climateprediction.net home page
stuck task (1006)

stuck task (1006)

Message boards : Number crunching : stuck task (1006)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 70629 - Posted: 8 Mar 2024, 10:31:04 UTC

I have a task that is stuck on 98.058%. Task properties shows --- against time since last checkpoint. Restarting client and manager makes no difference.
1. Is there anything I can do to bump start the task?
2. Is there anything I can look at to try and work out what has gone wrong? Nothing in event log or event log backup from before re-starting.

I shall wait till after the second task has completed before looking at whether I can start the wah2_8.29_windows_intelx86.exe manually. (Or I could just abort.)
ID: 70629 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 70630 - Posted: 8 Mar 2024, 11:30:10 UTC - in response to Message 70629.  
Last modified: 8 Mar 2024, 11:33:39 UTC

Hi Dave,

Are you sure the task processes are actually running? That's the first thing to check. There are 3 processes per task. Count the number of wah2_8.29_windows_intelx96.exe, wah2am* & wah2rm* processes you have running in Task Manager (or Resource Manager). If the numbers don't match with however many tasks boincmgr says let me know.

If they match then have a look in the model log files to see what's going on. First, find the folder for the task in question. Using one of mine as an example, if boincmgr shows the name of the task as 'wah2_eas25_a3pf_200912_24_1007_012269659_0', go to your boinc data folder (might be hidden), then projects\climateprediction.net, and you should see a folder of the same name but without the trailing _0 (task try number).

In the task folder you should see a text file: stdout_mon.txt. This contains a print of the timesteps completed. Check the 'Date modified' column, was the file updated recently? It's normally updated every few minutes.

Open the the file up. It'll contain lines like this:
wah2_eas25_a3pf_200912_24_1007_012269659 - PH 1 TS 0131015 A - 12/11/2010 17:45 - H:M:S=0061:51:37 AVG= 1.70 DLT= 0.30
wah2_eas25_a3pf_200912_24_1007_012269659 - PH 1 TS 0131016 A - 12/11/2010 18:00 - H:M:S=0061:51:37 AVG= 1.70 DLT= 0.28
wah2_eas25_a3pf_200912_24_1007_012269659 - PH 1 TS 0131017 A - 12/11/2010 18:15 - H:M:S=0061:51:38 AVG= 1.70 DLT= 0.47
Your names & numbers will be different, but that's a timestep log of how far the model has got. You can reload this file every few minutes to see if the lines have changed. What you're looking for is changes to the middle of the line:
.. A - 12/11/2010 18:15 ....
That's the current model date & time.

If that shows the model is not progressing, despite the process running, that's unusual. Never seen that before.

Zip up the files: stdout_mon.txt, stdout_rm.txt, stdout_um.txt in the task folder, together with stderr.txt in the task's slot folder and email the zip to me. I'll take a look and see what's going on.

p.s. had you made any hand edits to the client_status.xml file at all?
pp.s. it's not possible to 'hand-start' the task. It has to be done under boinc.
---
CPDN Visiting Scientist
ID: 70630 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 70631 - Posted: 8 Mar 2024, 14:18:23 UTC - in response to Message 70630.  

Just two tasks in the vm running windows (tiny10)

One wah2_8.29_windows_intelx96.exe is using only Both say they are using 0.5MB RAM but only one has any disk usage.
Two wah2am processes are running. one using just 0.8MB RAM, the other 152.1MB
Two wah2rm processes show as running one averaging about 26% cpu usage. (VM has 4 cpus allocated. That one shows about 257MB of RAM. The other shows 0% cpu usage and just 1MB of RAM.

Pretty obvious that the task crashed I think. Three or four instances of this in stderr.

Model crashed: READHIST: End of file in READ from history file for namelist NLIHISTO                                                                                                                                                                                           tmp/xadae.pipe_dummy 


I couldn't find anything informative in the other files you mentioned. There was a power outage for five minutes which is probably relevant. If you still want the files to check I can send them but I am not hopeful of finding much. I am just glad none of my Linux testing branch tasks suffered!
ID: 70631 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 70632 - Posted: 8 Mar 2024, 19:41:00 UTC - in response to Message 70631.  

Could you please send the text files? Id like to have a look.

The monitor process has obviously disappeared but the models should have stopped as well as they are each checking the other is still running. Not sure what's happening there.

Thanks.
---
CPDN Visiting Scientist
ID: 70632 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 70633 - Posted: 8 Mar 2024, 21:32:31 UTC - in response to Message 70632.  
Last modified: 9 Mar 2024, 17:34:29 UTC

Will send in morning. No hand edits to client_state.xml or other BOINC files.

Edit: Done. It will be interesting if there is anything significant. I was a bit surprised not to lose anything else from an unplanned powerdown. I did notice that some of the aborted tasks still had their folders sitting there to delete. Not a biggie as I check these things periodically anyway and as you said, that has been fixed or will be before further batches go out.
ID: 70633 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 70634 - Posted: 9 Mar 2024, 22:33:22 UTC - in response to Message 70633.  
Last modified: 9 Mar 2024, 22:33:51 UTC

Having looked at the output, I think what's happened is the power-cut killed the task as it was writing out the history files (or checkpoint files if you want to call them that). That left them incomplete so when the model restarted it couldn't read the input it needed.

The puzzle for me is why the two model processes didn't get killed as well. I should be able to create a test with a corrupted history file and see if I get the same behaviour.
---
CPDN Visiting Scientist
ID: 70634 · Report as offensive     Reply Quote

Message boards : Number crunching : stuck task (1006)

©2024 climateprediction.net