model crashes at end of run

Author	Message
hogwell Send message Joined: 24 Nov 04 Posts: 5 Credit: 16,157,177 RAC: 5,307	Message 14508 - Posted: 19 Jul 2005, 17:42:17 UTC The last three runs I\'ve done with v4.12 (BOINC v4.19) on different P-IV Windows XP and Server 2003 computers have run fine (for months). Then, at the end of each run, when it would have been time to upload the final results, the model crashes with a computation error. climateprediction.net - 2005-07-18 09:59:36 - Unrecoverable error for result 2xak_300157837_1 ( - exit code -1073741819 (0xc0000005)) I\'ve read other posts that report intermittent computation errors during a model run that are recoverable, but this seems to occur for me only at the end of the run, and on various Windows XP machines and prevents uploading any model results. A couple of questions: 1. Are others encountering this symptom? Can I provide any additional information to those at cp who are investigating this? 2. Is there any value to the project to run a model for months, than have it crash without uploading the final results? If not, I might suspend running the model until a new version of CP is released that addresses this problem. ID: 14508 · Reply Quote

Arnaud Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0	Message 14509 - Posted: 19 Jul 2005, 18:07:13 UTC Last modified: 19 Jul 2005, 18:13:26 UTC Hi 1Â°)Yes. Type 1073741819 in the search box of this forum or of the <a href="http://www.climateprediction.net/board/index.php">phpBB forum</a>: there are several complains about this problem. It seems that this error occurs frequently at the end of a phase (for you, the 3rd phase) 2Â°) Unfortunately no. If you can't upload the results (the 5 zip files) it's useless to run CP: All the scientific results of the models are in these zip files. HTH ----------------------------------------------- <a href="http://boinc-doc.net/boinc-wiki/index.php?title=Main_Page">Boinc Wiki</a> <a href="http://forum.boinc.fr/">L'Alliance Francophone</a> ID: 14509 · Reply Quote

crandles Volunteer moderator Send message Joined: 16 Oct 04 Posts: 692 Credit: 277,679 RAC: 0	Message 14519 - Posted: 19 Jul 2005, 22:02:35 UTC Do you have a backup to try running the end of the model again? It might be worth trying to run the last bit again. Perhaps a reboot and defragment prior to doing the end of phase processing might help. _______________________________ Visit <a href="http://boinc-doc.net/boinc-wiki/index.php?title=Main_Page">BOINC WIKI</a> for help And join <a href="http://www.boincsynergy.com/">BOINC Synergy</a> for all the news in one place. ID: 14519 · Reply Quote

hogwell Send message Joined: 24 Nov 04 Posts: 5 Credit: 16,157,177 RAC: 5,307	Message 14551 - Posted: 20 Jul 2005, 20:44:58 UTC - in response to Message 14519. The project folder that crashed after finishing the model run is still present for the finished run, complete with the output files. Here's the last few lines in stderr_um.txt that seems to indicate that all the output files were created successfully before the crash... .... OPEN: File dataout/2xakca.daq5bp0 Created on Unit 22 OPEN: File dataout/2xakca.daq5bs0 Created on Unit 22 OPEN: File dataout/2xakca.daq5c10 Created on Unit 22 CLOSE: WARNING: Unit 60 Not Opened OPEN: File dataout/2xakca.paq6c10 Created on Unit 60 CLOSE: WARNING: Unit 63 Not Opened OPEN: File dataout/2xakca.pdq6c10 Created on Unit 63 CLOSE: WARNING: Unit 64 Not Opened OPEN: File dataout/2xakca.peq6c10 Created on Unit 64 CLOSE: WARNING: Unit 65 Not Opened OPEN: File dataout/2xakca.pfq6c10 Created on Unit 65 CLOSE: WARNING: Unit 66 Not Opened OPEN: File dataout/2xakca.pgq6c10 Created on Unit 66 CLOSE: WARNING: Unit 67 Not Opened OPEN: File dataout/2xakca.phq6c10 Created on Unit 67 Perhaps I should zip up the whole folder and email/ftp it to climateprediction.net for recovery/analysis of the problem? (An entire model run is a terrible thing to waste) Or, even better, is there a way I can "trick" BOINC into zipping up the output files properly and uploading the results? ID: 14551 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 14556 - Posted: 20 Jul 2005, 23:31:06 UTC The upload files are a 'summary' of the other files. (Possibly not the best word.) Unless hadsm has created the 5 zips, there is nothing to upload. There is no facility to handle files sent by mail, email, or ftp results. Wasted runs are a fact of life with this project, and one just has to live with it. And investigate hardware and software reasons why they fail. One way to prevent loss of hair, is to do a FULL backup of the BOINC folder just BEFORE you reach end-of-phase. Just after trickles 23, 47, and 71. Then you can reload the backup and try again. Also, try to avoid doing other work near end-of-phase. If you're using work computers, this could be tricky. And the reason hadsm trips up. It doesn't like being interrupted at these times. ID: 14556 · Reply Quote

hogwell Send message Joined: 24 Nov 04 Posts: 5 Credit: 16,157,177 RAC: 5,307	Message 14557 - Posted: 21 Jul 2005, 0:57:51 UTC - in response to Message 14556. Thanks for your comments on this. I think you're right, In my case, the model crashed during the actual zipping up of the files prior to uploading. Perhaps the OS ran out of file handles? Whatever the cause of failure, it doesn't really matter - there should be a retry mechanism for the final model phase. As a long-term software developer who has worked on numerical weather prediction myself, it is painful to throw away the results of months of computation due to a possibly transient situation with hardware, memory, etc. The uncompressed results are sitting there on my disk! I hope that someday the model will be able to recover from fatal errors during the final reporting stage, as it does pretty decently during the model run itself.. I think this could be done by resetting the project to the last timestep again and trying again in exponential fashion, as when other failures occur (e.g. uploading). I hope this is useful feedback to the scientists working on the model's code. Meanwhile, your point about my doing my own manual backups of the BOIC folder to "back up" the state of the model makes sense... I'll set that up on each machine to avoid this total loss of results in the future. ID: 14557 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2169 Credit: 64,555,907 RAC: 5,858	Message 14561 - Posted: 21 Jul 2005, 4:16:01 UTC - in response to Message 14557. > I hope that someday the model will be able to recover from fatal errors during > the final reporting stage, as it does pretty decently during the model run > itself.. I think this will be even more important as we get into the longer, more computationally intense models. There needs to be some type of intelligent recovery mechanism from certain types of faults, instead of just throwing and error, halting computation, and downloading a new model. I know this would be difficult given all the possible errors, but at least some of these could be caught and sent back to a known good point. ID: 14561 · Reply Quote

hogwell Send message Joined: 24 Nov 04 Posts: 5 Credit: 16,157,177 RAC: 5,307	Message 14571 - Posted: 21 Jul 2005, 15:52:59 UTC - in response to Message 14561. > I know this would be > difficult given all the possible errors, but at least some of these could be > caught and sent back to a known good point. Rather than trying to catch and handle all the possible errors, it might be better to indicate completed uploading of results (success) with the presence of some file placed in the run's folder, e.g. "results_uploaded.txt". Then, whenever the model is started (e.g. after a reboot or suspend), if this file is not present, but the model has finished all the time steps, failure can be assumed and the model could be reset to the "next to last" time step of phase 3, and the end of run sequence, file zipping, could be tried again. There would also have to be a counter kept on disk somewhere so that, e.g., after 5 attempts, the model will still give up and load a new one, like it does now for only one failure. ID: 14571 · Reply Quote

hogwell Send message Joined: 24 Nov 04 Posts: 5 Credit: 16,157,177 RAC: 5,307	Message 14862 - Posted: 1 Aug 2005, 21:18:14 UTC - in response to Message 14556. Is there a way to use the phase3.start.zip file to restart the project at the start of phase 3 and try again? > The upload files are a 'summary' of the other files. (Possibly not the best > word.) > Unless hadsm has created the 5 zips, there is nothing to upload. > There is no facility to handle files sent by mail, email, or ftp results. > Wasted runs are a fact of life with this project, and one just has to live > with it. And investigate hardware and software reasons why they fail. > > One way to prevent loss of hair, is to do a FULL backup of the BOINC folder > just BEFORE you reach end-of-phase. Just after trickles 23, 47, and 71. Then > you can reload the backup and try again. > > Also, try to avoid doing other work near end-of-phase. If you're using work > computers, this could be tricky. And the reason hadsm trips up. It doesn't > like being interrupted at these times. > > > ID: 14862 · Reply Quote