climateprediction.net home page
model crashes at end of run

model crashes at end of run

Questions and Answers : Windows : model crashes at end of run
Message board moderation

To post messages, you must log in.

AuthorMessage
hogwell

Send message
Joined: 24 Nov 04
Posts: 5
Credit: 16,157,177
RAC: 5,307
Message 14508 - Posted: 19 Jul 2005, 17:42:17 UTC


The last three runs I\'ve done with v4.12 (BOINC v4.19) on different P-IV Windows XP and Server 2003 computers have run fine (for months).

Then, at the end of each run, when it would have been time to upload the final results, the model crashes with a computation error.

climateprediction.net - 2005-07-18 09:59:36 - Unrecoverable error for result 2xak_300157837_1 ( - exit code -1073741819 (0xc0000005))

I\'ve read other posts that report intermittent computation errors during a model run that are recoverable, but this seems to occur for me only at the end of the run, and on various Windows XP machines and prevents uploading any model results.

A couple of questions:

1. Are others encountering this symptom? Can I provide any additional information to those at cp who are investigating this?

2. Is there any value to the project to run a model for months, than have it crash without uploading the final results? If not, I might suspend running the model until a new version of CP is released that addresses this problem.


ID: 14508 · Report as offensive     Reply Quote
Arnaud

Send message
Joined: 3 Sep 04
Posts: 268
Credit: 256,045
RAC: 0
Message 14509 - Posted: 19 Jul 2005, 18:07:13 UTC
Last modified: 19 Jul 2005, 18:13:26 UTC

Hi
1°)Yes. Type 1073741819 in the search box of this forum or of the <a href="http://www.climateprediction.net/board/index.php">phpBB forum</a>: there are several complains about this problem.
It seems that this error occurs frequently at the end of a phase (for you, the 3rd phase)

2°) Unfortunately no. If you can't upload the results (the 5 zip files) it's useless to run CP: All the scientific results of the models are in these zip files.
HTH
-----------------------------------------------
<a href="http://boinc-doc.net/boinc-wiki/index.php?title=Main_Page">Boinc Wiki</a>
<a href="http://forum.boinc.fr/">L'Alliance Francophone</a>
ID: 14509 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 14519 - Posted: 19 Jul 2005, 22:02:35 UTC

Do you have a backup to try running the end of the model again?

It might be worth trying to run the last bit again. Perhaps a reboot and defragment prior to doing the end of phase processing might help.
_______________________________
Visit <a href="http://boinc-doc.net/boinc-wiki/index.php?title=Main_Page">BOINC WIKI</a> for help

And join <a href="http://www.boincsynergy.com/">BOINC Synergy</a> for all the news in one place.
ID: 14519 · Report as offensive     Reply Quote
hogwell

Send message
Joined: 24 Nov 04
Posts: 5
Credit: 16,157,177
RAC: 5,307
Message 14551 - Posted: 20 Jul 2005, 20:44:58 UTC - in response to Message 14519.  

The project folder that crashed after finishing the model run is still present for the finished run, complete with the output files.

Here's the last few lines in stderr_um.txt that seems to indicate that all the output files were created successfully before the crash...

....
OPEN: File dataout/2xakca.daq5bp0 Created on Unit 22
OPEN: File dataout/2xakca.daq5bs0 Created on Unit 22
OPEN: File dataout/2xakca.daq5c10 Created on Unit 22
CLOSE: WARNING: Unit 60 Not Opened
OPEN: File dataout/2xakca.paq6c10 Created on Unit 60
CLOSE: WARNING: Unit 63 Not Opened
OPEN: File dataout/2xakca.pdq6c10 Created on Unit 63
CLOSE: WARNING: Unit 64 Not Opened
OPEN: File dataout/2xakca.peq6c10 Created on Unit 64
CLOSE: WARNING: Unit 65 Not Opened
OPEN: File dataout/2xakca.pfq6c10 Created on Unit 65
CLOSE: WARNING: Unit 66 Not Opened
OPEN: File dataout/2xakca.pgq6c10 Created on Unit 66
CLOSE: WARNING: Unit 67 Not Opened
OPEN: File dataout/2xakca.phq6c10 Created on Unit 67

Perhaps I should zip up the whole folder and email/ftp it to climateprediction.net for recovery/analysis of the problem? (An entire model run is a terrible thing to waste)

Or, even better, is there a way I can "trick" BOINC into zipping up the output files properly and uploading the results?


ID: 14551 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 14556 - Posted: 20 Jul 2005, 23:31:06 UTC

The upload files are a 'summary' of the other files. (Possibly not the best word.)
Unless hadsm has created the 5 zips, there is nothing to upload.
There is no facility to handle files sent by mail, email, or ftp results.
Wasted runs are a fact of life with this project, and one just has to live with it. And investigate hardware and software reasons why they fail.

One way to prevent loss of hair, is to do a FULL backup of the BOINC folder just BEFORE you reach end-of-phase. Just after trickles 23, 47, and 71. Then you can reload the backup and try again.

Also, try to avoid doing other work near end-of-phase. If you're using work computers, this could be tricky. And the reason hadsm trips up. It doesn't like being interrupted at these times.

ID: 14556 · Report as offensive     Reply Quote
hogwell

Send message
Joined: 24 Nov 04
Posts: 5
Credit: 16,157,177
RAC: 5,307
Message 14557 - Posted: 21 Jul 2005, 0:57:51 UTC - in response to Message 14556.  

Thanks for your comments on this.

I think you're right, In my case, the model crashed during the actual zipping up of the files prior to uploading. Perhaps the OS ran out of file handles?
Whatever the cause of failure, it doesn't really matter - there should be a retry mechanism for the final model phase.

As a long-term software developer who has worked on numerical weather prediction myself, it is painful to throw away the results of months of computation due to a possibly transient situation with hardware, memory, etc.
The uncompressed results are sitting there on my disk!

I hope that someday the model will be able to recover from fatal errors during the final reporting stage, as it does pretty decently during the model run itself..

I think this could be done by resetting the project to the last timestep again and trying again in exponential fashion, as when other failures occur (e.g. uploading).

I hope this is useful feedback to the scientists working on the model's code.

Meanwhile, your point about my doing my own manual backups of the BOIC folder to "back up" the state of the model makes sense... I'll set that up on each machine to avoid this total loss of results in the future.


ID: 14557 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2169
Credit: 64,555,907
RAC: 5,858
Message 14561 - Posted: 21 Jul 2005, 4:16:01 UTC - in response to Message 14557.  

&gt; I hope that someday the model will be able to recover from fatal errors during
&gt; the final reporting stage, as it does pretty decently during the model run
&gt; itself..

I think this will be even more important as we get into the longer, more computationally intense models. There needs to be some type of intelligent recovery mechanism from certain types of faults, instead of just throwing and error, halting computation, and downloading a new model. I know this would be difficult given all the possible errors, but at least some of these could be caught and sent back to a known good point.
ID: 14561 · Report as offensive     Reply Quote
hogwell

Send message
Joined: 24 Nov 04
Posts: 5
Credit: 16,157,177
RAC: 5,307
Message 14571 - Posted: 21 Jul 2005, 15:52:59 UTC - in response to Message 14561.  

&gt; I know this would be
&gt; difficult given all the possible errors, but at least some of these could be
&gt; caught and sent back to a known good point.

Rather than trying to catch and handle all the possible errors, it might be better to indicate completed uploading of results (success) with the presence of some file placed in the run's folder, e.g. "results_uploaded.txt".

Then, whenever the model is started (e.g. after a reboot or suspend), if this file is not present, but the model has finished all the time steps, failure can be assumed and the model could be reset to the "next to last" time step of phase 3, and the end of run sequence, file zipping, could be tried again. There would also have to be a counter kept on disk somewhere so that, e.g., after 5 attempts, the model will still give up and load a new one, like it does now for only one failure.


ID: 14571 · Report as offensive     Reply Quote
hogwell

Send message
Joined: 24 Nov 04
Posts: 5
Credit: 16,157,177
RAC: 5,307
Message 14862 - Posted: 1 Aug 2005, 21:18:14 UTC - in response to Message 14556.  


Is there a way to use the phase3.start.zip file to restart the project at the start of phase 3 and try again?


&gt; The upload files are a 'summary' of the other files. (Possibly not the best
&gt; word.)
&gt; Unless hadsm has created the 5 zips, there is nothing to upload.
&gt; There is no facility to handle files sent by mail, email, or ftp results.
&gt; Wasted runs are a fact of life with this project, and one just has to live
&gt; with it. And investigate hardware and software reasons why they fail.
&gt;
&gt; One way to prevent loss of hair, is to do a FULL backup of the BOINC folder
&gt; just BEFORE you reach end-of-phase. Just after trickles 23, 47, and 71. Then
&gt; you can reload the backup and try again.
&gt;
&gt; Also, try to avoid doing other work near end-of-phase. If you're using work
&gt; computers, this could be tricky. And the reason hadsm trips up. It doesn't
&gt; like being interrupted at these times.
&gt;
&gt;
&gt;
ID: 14862 · Report as offensive     Reply Quote

Questions and Answers : Windows : model crashes at end of run

©2024 climateprediction.net