Message boards : Number crunching : Tasks stuck for days
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 Nov 20 Posts: 6 Credit: 6,943,088 RAC: 2,922 |
I have a few tasks now stuck at around 86% with 14d reported elapsed time, they have done longer than this as it has shown 14d for a while now, remaining time shows nothing. Should I leave them or abort them? Properties from the example unit Application Weather At Home 2 (wah2) (region independent) 8.29 Name wah2_eas25_a14t_201212_24_1015_012278165 State Suspended - user request Received 16/04/2024 11:54:52 Report deadline 25/06/2024 11:54:51 Estimated computation size 3,801,388 GFLOPs CPU time 10d 20:05:20 CPU time since checkpoint --- Elapsed time 14d 18:29:19 Estimated time remaining --- Fraction done 86.824% Virtual memory size 185.28 MB Working set size 75.98 MB Directory slots/39 Process ID 26168 Progress rate 0.360% per hour Executable wah2_8.29_windows_intelx86.exe |
Send message Joined: 15 May 09 Posts: 4541 Credit: 19,039,635 RAC: 18,944 |
Suspended - user request Did you suspend the task in order to see if setting it to run again would get it going? |
Send message Joined: 9 Nov 20 Posts: 6 Credit: 6,943,088 RAC: 2,922 |
I suspend Boinc during the day and let it run overnight |
Send message Joined: 29 Oct 17 Posts: 1051 Credit: 16,655,437 RAC: 10,602 |
Hi Ryan, Abort them. That task you showed is from batch 1015 (notice the _1015_ in the name). That batch was closed way back in July because there was a problem with that version of the Weather@Home app. Abort any other tasks from 1015. Glenn --- CPDN Visiting Scientist |
Send message Joined: 9 Nov 20 Posts: 6 Credit: 6,943,088 RAC: 2,922 |
Will do thanks :) |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 43,257,301 RAC: 72,605 |
I just had something similar on newer batches due to unplanned power outage. Usually when such incidents happen, tasks would either successfully resume, or error out with computation error soon after restart. However, this time I have 4 WUs that did not error out, nor making any progress. I noticed this when checking CPU utilization because they are not consuming any CPU cycles. After waiting for 15 minutes or so, I tried restarting the boinc client to force a restart but they were stuck in the same way. From the logs, they all have "Model crashed" in the output. The WUs that I had to manually abort are the following: https://main.cpdn.org/result.php?resultid=22468845 https://main.cpdn.org/result.php?resultid=22503701 https://main.cpdn.org/result.php?resultid=22505634 https://main.cpdn.org/result.php?resultid=22498243 Compared to the ones that error'ed out, which is the normal behavior that occasionally happens due to unplanned outage. (Still much reduced compared to 8.24) https://main.cpdn.org/result.php?resultid=22504490 https://main.cpdn.org/result.php?resultid=22476951 |
Send message Joined: 29 Oct 17 Posts: 1051 Credit: 16,655,437 RAC: 10,602 |
I looked at the first task on that list and it failed with this error: Model crashed: READHIST: End of file in READ from history file for namelist NLIHISTOAny reference to a history file means one of the models was trying to restart when the client started the task. If it hits end of file it usually means the previous process did not finish writing the history (restart) file for some reason, or the file has been corrupted on disk. I've seen this behaviour in testing and logged an issue for it. I don't know if it's a code issue that's always been there or I've introduced. Without looking at Ryan's earlier tasks I can't say if it's the same behaviour. --- CPDN Visiting Scientist |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 43,257,301 RAC: 72,605 |
Thanks for taking a look. This line indeed only appeared in the jobs that were stuck from my samples. I found Ryan's task with the same log line too: https://main.cpdn.org/result.php?resultid=22426672. From what you described, the task is likely not salvageable at that point. If it could report an error instead of hanging around, that would be rather helpful. Guess for now if any unplanned shutdown happens again, I need to double check if any WU is not making progress. |
Send message Joined: 29 Oct 17 Posts: 1051 Credit: 16,655,437 RAC: 10,602 |
Ok, thanks for checking Ryan's. I will investigate further why tasks with those errors are not stopping, though not immediately as I'm working on HadAM4 at the moment. Yep, those tasks are not recoverable. In an operational environment the system would save all the history (restart) files as it runs through the forecast, so if the latest one is corrupt it can drop back to a previous good one. But that's too expensive in storage for home use. --- CPDN Visiting Scientist |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 43,257,301 RAC: 72,605 |
In an operational environment the system would save all the history (restart) files as it runs through the forecast, so if the latest one is corrupt it can drop back to a previous good one. Ah, you don't need to save all history to deal with interrupted file writes. You only need space for one additional checkpoint temporarily. Say existing checkpoint is saved in folder A. New checkpoint writes to folder B, fsync and then `mv B A` should effectively give you a transaction on Linux. Resulting folder A should always be in a consistent state and the program always loads from A. Just throwing out the idea and I don't know about Window's guarantee. Not saying you need to change this ever either, given the success rate seems to be pretty high now. |
©2024 cpdn.org