climateprediction.net home page
Can Someone help me get back to where I was?

Can Someone help me get back to where I was?

Questions and Answers : Windows : Can Someone help me get back to where I was?
Message board moderation

To post messages, you must log in.

AuthorMessage
RoadWarrior

Send message
Joined: 7 Jan 08
Posts: 1
Credit: 217,502
RAC: 0
Message 32860 - Posted: 5 Mar 2008, 14:47:47 UTC

When I powered this morning, I discovered that my climate prediction model was reset back to zero progress. The strange thing is that on the CPU run time is where it left off when I powered down last night, 138:37:30. I also started up the graphic model to check on the model year. The model year was set back to 1810. Right before I powered down last night, I was already in Winter 1825. I would like to know if there is anyway I can get back the information to where I was, Winter 1825. I would like to thank anyone that can help me solve this problem...
ID: 32860 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 32862 - Posted: 5 Mar 2008, 15:22:47 UTC
Last modified: 5 Mar 2008, 15:28:14 UTC

Hi RoadWarrior and welcome to the message board.

This \'slab\' model, 7093797, appears to be the one. It was just approaching the end of the first of three phases when it rewound to the start. The slab model appears to be rather sensitive at the phase change if interrupted before the phase post-processing is complete. One of my models did exactly that, rewound and finished quite happily.

If you have a backup, then that can be restored. Otherwise the only thing to do is to let it run through phase 1 again and it\'ll carry on as before. There isn\'t much point aborting the model, because you would only have to start another one at the beginning.

It might be an idea to watch for the phase change at timestep 259,248 and let it run for a couple of hours after, to clear the Zip file upload and any trickles (there can be quite a few trickles at the end of a phase).

Iain

PS This model, 7100731, is the one of mine that misbehaved. You can see that the seconds/timestep doubles at the end of phase 1. That\'s because the CPU time continues to accumulate - as you noticed.
ID: 32862 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 32863 - Posted: 5 Mar 2008, 15:27:53 UTC
Last modified: 5 Mar 2008, 15:30:27 UTC

There are 2 possibilities:
1) The model has rewound, in which case it will get back to where it was by itself. Eventually.
2) The model has crashed, in which case, only restoring from a backup made BEFORE the failure can get it back.

Check the links in my sig below; one is about making backups, the other is a set of README files, each containing links to posts on different subjects.
The set for Crashes and other problems may help you.

Post again if you need more help.

edit
I see that Iain\'s already onto it. :)


Backups: Here
ID: 32863 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,115,532
RAC: 2,545
Message 32899 - Posted: 10 Mar 2008, 4:35:59 UTC - in response to Message 32862.  

Hi, everyone. This seems to be the place to post about unintended model resets. I am running 2 HadCM3’s on a dual core machine with 2GB of RAM. This morning the models were crunching their way through 1942. When I checked on them tonight they had reset to 1921 about 8000 timesteps into the model! Since both reset, I don’t think it is likely that there is a flaw in the WU’s that is causing them to loop. I don’t think that I could have accidentally clicked the reset tab in “Projects“.

Fortunately, I made a backup this morning so I was able to empty the BOINC folder and refill it with the backup copy. I am back in 1942 and I only lost about 10 hours of crunching. I don’t know how this happened and I hope it doesn’t happen again.


Hi RoadWarrior and welcome to the message board.

This \'slab\' model, 7093797, appears to be the one. It was just approaching the end of the first of three phases when it rewound to the start. The slab model appears to be rather sensitive at the phase change if interrupted before the phase post-processing is complete. One of my models did exactly that, rewound and finished quite happily.

If you have a backup, then that can be restored. Otherwise the only thing to do is to let it run through phase 1 again and it\'ll carry on as before. There isn\'t much point aborting the model, because you would only have to start another one at the beginning.

It might be an idea to watch for the phase change at timestep 259,248 and let it run for a couple of hours after, to clear the Zip file upload and any trickles (there can be quite a few trickles at the end of a phase).

Iain

PS This model, 7100731, is the one of mine that misbehaved. You can see that the seconds/timestep doubles at the end of phase 1. That\'s because the CPU time continues to accumulate - as you noticed.


ID: 32899 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 32900 - Posted: 10 Mar 2008, 9:53:43 UTC
Last modified: 10 Mar 2008, 9:55:11 UTC

Jim,

It looks as if the HADCM3 models are these two: 7202708 and 7202703.

These two models appear to have crashed for some reason and two more were downloaded - so they didn\'t really reset, though the progress indicators for the new models would have certainly started from zero.

Well done for taking and restoring a backup. To prevent new models arriving when models crash, just press the \'No new tasks\' button in BOINC Manager (and press it again when the models finish, to allow new tasks to be downloaded).

Iain

PS When a backup is restored a duplicate computer appears in the computer list (here). If you display the \'computer summary\' page by clicking on one of your computer links, then there is a \'merge this computer\' link at the bottom of the page, which will merge duplicate computer records.
ID: 32900 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32901 - Posted: 10 Mar 2008, 12:40:01 UTC


The two crashes show an error code often associated with either the graphics being shut down when \'not responding\', and also with the Vista shut down process. So I\'d suggest firstly disabling the screensaver (use \'blank\' instead), and secondly shutting down Boinc prior to shutting down Vista. There are also some system settings which will reduce the chance of Vista killing the model (see the \'crashes and other problems\' readme, in the \'vista\' section, link is in my signature).
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32901 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,115,532
RAC: 2,545
Message 32906 - Posted: 11 Mar 2008, 4:19:34 UTC

Dear Mike and Iain:

I guess your right. I checked “Projects” and the “Allow New Tasks” tab was clicked. I don’t know this happened because I am sure that I clicked “No New Tasks” when I downloaded the crashed WU’s to stop automatic downloads of new WU‘s in the event of a crash. I don’t know why they crashed because I always exit the manager before shutting down.

ID: 32906 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 32909 - Posted: 11 Mar 2008, 5:11:35 UTC


The error code for both of your failed models is: exit code 1073807364 (0x40010004)
Codes stating with 107 are Windows \"stop\" errors, (there\'s 4 or 5 of these, with the last few numbers being different), and, as Mike said, can be associated with a graphics problem.

Updating the drivers for the graphics card, (from the card maker\'s web site), often fixs the problem.


Backups: Here
ID: 32909 · Report as offensive     Reply Quote
Jord
Avatar

Send message
Joined: 5 Aug 04
Posts: 250
Credit: 93,274
RAC: 0
Message 32941 - Posted: 13 Mar 2008, 1:47:47 UTC - in response to Message 32909.  

The error code for both of your failed models is: exit code 1073807364 (0x40010004)
Codes stating with 107 are Windows \"stop\" errors, (there\'s 4 or 5 of these, with the last few numbers being different), and, as Mike said, can be associated with a graphics problem.

On Windows Vista these errors also occur if you shut down Windows/reboot without exiting BOINC first. Vista\'s fast shutdown mode ignores any programs still running, doesn\'t allow them to write any state to disk and corrupts them.

See this BOINC FAQ for workarounds.
Jord.
ID: 32941 · Report as offensive     Reply Quote

Questions and Answers : Windows : Can Someone help me get back to where I was?

©2024 climateprediction.net