climateprediction.net home page
Posts by hogwell

Posts by hogwell

1) Questions and Answers : Windows : model crashes at end of run (Message 14862)
Posted 1 Aug 2005 by hogwell
Post:

Is there a way to use the phase3.start.zip file to restart the project at the start of phase 3 and try again?


> The upload files are a 'summary' of the other files. (Possibly not the best
> word.)
> Unless hadsm has created the 5 zips, there is nothing to upload.
> There is no facility to handle files sent by mail, email, or ftp results.
> Wasted runs are a fact of life with this project, and one just has to live
> with it. And investigate hardware and software reasons why they fail.
>
> One way to prevent loss of hair, is to do a FULL backup of the BOINC folder
> just BEFORE you reach end-of-phase. Just after trickles 23, 47, and 71. Then
> you can reload the backup and try again.
>
> Also, try to avoid doing other work near end-of-phase. If you're using work
> computers, this could be tricky. And the reason hadsm trips up. It doesn't
> like being interrupted at these times.
>
>
>
2) Questions and Answers : Windows : model crashes at end of run (Message 14571)
Posted 21 Jul 2005 by hogwell
Post:
> I know this would be
> difficult given all the possible errors, but at least some of these could be
> caught and sent back to a known good point.

Rather than trying to catch and handle all the possible errors, it might be better to indicate completed uploading of results (success) with the presence of some file placed in the run's folder, e.g. "results_uploaded.txt".

Then, whenever the model is started (e.g. after a reboot or suspend), if this file is not present, but the model has finished all the time steps, failure can be assumed and the model could be reset to the "next to last" time step of phase 3, and the end of run sequence, file zipping, could be tried again. There would also have to be a counter kept on disk somewhere so that, e.g., after 5 attempts, the model will still give up and load a new one, like it does now for only one failure.


3) Questions and Answers : Windows : model crashes at end of run (Message 14557)
Posted 21 Jul 2005 by hogwell
Post:
Thanks for your comments on this.

I think you're right, In my case, the model crashed during the actual zipping up of the files prior to uploading. Perhaps the OS ran out of file handles?
Whatever the cause of failure, it doesn't really matter - there should be a retry mechanism for the final model phase.

As a long-term software developer who has worked on numerical weather prediction myself, it is painful to throw away the results of months of computation due to a possibly transient situation with hardware, memory, etc.
The uncompressed results are sitting there on my disk!

I hope that someday the model will be able to recover from fatal errors during the final reporting stage, as it does pretty decently during the model run itself..

I think this could be done by resetting the project to the last timestep again and trying again in exponential fashion, as when other failures occur (e.g. uploading).

I hope this is useful feedback to the scientists working on the model's code.

Meanwhile, your point about my doing my own manual backups of the BOIC folder to "back up" the state of the model makes sense... I'll set that up on each machine to avoid this total loss of results in the future.


4) Questions and Answers : Windows : model crashes at end of run (Message 14551)
Posted 20 Jul 2005 by hogwell
Post:
The project folder that crashed after finishing the model run is still present for the finished run, complete with the output files.

Here's the last few lines in stderr_um.txt that seems to indicate that all the output files were created successfully before the crash...

....
OPEN: File dataout/2xakca.daq5bp0 Created on Unit 22
OPEN: File dataout/2xakca.daq5bs0 Created on Unit 22
OPEN: File dataout/2xakca.daq5c10 Created on Unit 22
CLOSE: WARNING: Unit 60 Not Opened
OPEN: File dataout/2xakca.paq6c10 Created on Unit 60
CLOSE: WARNING: Unit 63 Not Opened
OPEN: File dataout/2xakca.pdq6c10 Created on Unit 63
CLOSE: WARNING: Unit 64 Not Opened
OPEN: File dataout/2xakca.peq6c10 Created on Unit 64
CLOSE: WARNING: Unit 65 Not Opened
OPEN: File dataout/2xakca.pfq6c10 Created on Unit 65
CLOSE: WARNING: Unit 66 Not Opened
OPEN: File dataout/2xakca.pgq6c10 Created on Unit 66
CLOSE: WARNING: Unit 67 Not Opened
OPEN: File dataout/2xakca.phq6c10 Created on Unit 67

Perhaps I should zip up the whole folder and email/ftp it to climateprediction.net for recovery/analysis of the problem? (An entire model run is a terrible thing to waste)

Or, even better, is there a way I can "trick" BOINC into zipping up the output files properly and uploading the results?


5) Questions and Answers : Windows : model crashes at end of run (Message 14508)
Posted 19 Jul 2005 by hogwell
Post:

The last three runs I\'ve done with v4.12 (BOINC v4.19) on different P-IV Windows XP and Server 2003 computers have run fine (for months).

Then, at the end of each run, when it would have been time to upload the final results, the model crashes with a computation error.

climateprediction.net - 2005-07-18 09:59:36 - Unrecoverable error for result 2xak_300157837_1 ( - exit code -1073741819 (0xc0000005))

I\'ve read other posts that report intermittent computation errors during a model run that are recoverable, but this seems to occur for me only at the end of the run, and on various Windows XP machines and prevents uploading any model results.

A couple of questions:

1. Are others encountering this symptom? Can I provide any additional information to those at cp who are investigating this?

2. Is there any value to the project to run a model for months, than have it crash without uploading the final results? If not, I might suspend running the model until a new version of CP is released that addresses this problem.






©2024 climateprediction.net