Questions and Answers :
Windows :
Model stopped running
Message board moderation
Author | Message |
---|---|
Send message Joined: 4 Sep 06 Posts: 79 Credit: 5,583,517 RAC: 0 |
One of the 4 models running on my Q6600 suddenly stopped running, at least it seemed so. When I turnes on the graphics the earth was only blue and I could not change it. It was stucked on 336 timesteps til checkpoint. I suspended boinc, rightclicked and exit, then I restarted the machine. The model started again but stopped at the same point. The graphics window is slower to open and close too. Is it something with the model or is it something with my PC? All models have run without any problems til now and the other 3 models run without any problems. The model is: hadcm3iozn_cpyj_2000_80_45899412_7 using hadcm3i version 544 After I wrote this, the model has advanced 3 timesteps.. I run antivirus, defragment, have updated drivers, did not use it to anything else for hours before it happened. Edit: I took a backup yesterday :-))) Should I reinstall? Thx Steinar |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
There is a no definitive answer yet to this problem. A \'blue world\' is what you get initially when starting a model, as the graphics data is not yet available for the display. Whether this has anything to do with it, I don\'t know. There is an info thread here, with a link near the bottom to a discussion thread. This is for the slab models, but the symptoms are often similar for the Coupled Ocean models. As to re-running from a backup, there\'s only one way to find out. :( However, the problem may have started from before the time that the backup was made. |
Send message Joined: 4 Sep 06 Posts: 79 Credit: 5,583,517 RAC: 0 |
The model finished and uploaded tonight. I got the following msg: 08.10.2007 01:28:26|climateprediction.net|Reason: Unrecoverable error for result hadcm3iozn_cpyj_2000_80_45899412_7 (The device does not recognize the command. (0x16) - exit code 22 (0x16)) 08.10.2007 01:28:26|climateprediction.net|Computation for task hadcm3iozn_cpyj_2000_80_45899412_7 finished 08.10.2007 01:28:26|climateprediction.net|Output file hadcm3iozn_cpyj_2000_80_45899412_7_5.zip for task hadcm3iozn_cpyj_2000_80_45899412_7 absent 08.10.2007 01:28:26|climateprediction.net|Output file hadcm3iozn_cpyj_2000_80_45899412_7_6.zip for task hadcm3iozn_cpyj_2000_80_45899412_7 absent 08.10.2007 01:28:26|climateprediction.net|Output file hadcm3iozn_cpyj_2000_80_45899412_7_7.zip for task hadcm3iozn_cpyj_2000_80_45899412_7 absent 08.10.2007 01:28:26|climateprediction.net|Output file hadcm3iozn_cpyj_2000_80_45899412_7_8.zip for task hadcm3iozn_cpyj_2000_80_45899412_7 absent Exit code 22 is a fail on \"my side\"? isnt it? Should I restore from the - thank god - newly created back-up I took after uploading the .zip-files on all 4 models? I have suspended boinc just to be sure.. |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
...Exit code 22 is a fail on \"my side\"? isnt it? Yes, the exit code is on the PC. Restore the model as quickly as possible: since all models on your quad will be restored, the quicker you restore the less wasted time there is on the other models. I don\'t know of any method to restore just one of the four. Before restoring it might be a courtesy to the project to abort the new model that has downloaded but not yet trickled. That way, they know for sure the model isn\'t going to restart at a later date. |
Send message Joined: 4 Sep 06 Posts: 79 Credit: 5,583,517 RAC: 0 |
After reinstalling after the modelcrash, the model now works fine after passing the piont where it crashed. All other models also works fine :-) I deleted the model that downloaded after the upload of the crashed model. I did so before I thought of aborting that model, so nobody in the project probably know its not going to finish. Sorry :/ I hope the models will finish without any problems.. Steinar |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
Good to hear that it all worked out. |
Send message Joined: 4 Sep 06 Posts: 79 Credit: 5,583,517 RAC: 0 |
The same model stopped running again. I have 4 models running and I backed up yesterday after 3 had sendt the zip-files. The fourth zip-file uploaded toningt and when I restore from backup, it will upload again. Will that cause any problems? I have looked in the forums for a soulution but didnt find anything.. Thx Steinar |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
No problems. The trickles also contain the timestep, so the server will be able to compare them with what it already has, and ignore duplicates. |
Send message Joined: 4 Sep 06 Posts: 79 Credit: 5,583,517 RAC: 0 |
Again the same model crashed, and right after another model crashed for the first time. I get the \"blue screen\" or ice world all the times. I dont know what happens here but my PC is the same as before, it is not used for much other things. All drivers updated, anti virus etc. It is not OC\'ed either. The models went fine in the beginning but after the first crash it happens more and more often. No models had problem befor 50% was crunshed. I restore from backups but now there is a crash every day. The models have run 75% - 80% and it would be nice to see them reach the end. Should I reinstall boinc? It is ver. 5.10.20 and/or should I terminate the models and reinstall the PC? Is it better to run only 3 models at the same time? I think it is sad not to finish them.. Steinar |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Grasping at straws with this -- This is recorded for OS and installed memory: Operating System Microsoft Windows XP I have no experience with 4GB installed with a 32-bit OS but am aware that WinXP_x32 can\'t address all of it, which probably explains the 3318 MB entry. Perhaps someone with experience on that configuration can help... Is it possible that, as the Models progress, \'memory leaks\' develop, migrate above three Gig level and, eventually, Windoze, with its legendary space-management abilities, trips over itself? (I said I was grasping at straws!) Expert help, anyone...? (My quad has 4GB but has 64-bit OS installed (Vista Home Premium_x64 [so far, unused] and openSuSE 10.3_x64. Have a lock-up issue between boinc and openSuSE 10.3 in \'Tools\' and \'Advanced\' options, possibly related to new Linux security measures, but no other boinc or CPDN problems.) "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hi Steinar There are a lot of \'blue worlds\' lately. They may be related to the values being used in the recent batches of models. I wouldn\'t start reinstalling things; it\'s much more likely to be the models. ************ A note for those people who aren\'t aware: the object of the climate models is NOT to run them to the due completion date. It\'s to run them as far as they\'ll go. The researchers need to know which combinations of values produce a long time stable model and which don\'t, as much as they also want to know the end \'climate\' result. The object of the project is to improvement climate forecasting, so it\'s just as important to know what works, as it is to know what doesn\'t. Backups: Here |
Send message Joined: 4 Sep 06 Posts: 79 Credit: 5,583,517 RAC: 0 |
Ok. I am running the prime95 toture test at the moment to check if everything is ok. Then I reinstall after that and if the first model keeps crashing I abort it and keeps on running the others? |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
If the model continues to run at a normal speed, then I\'d keep it running regardless of whether it turns into an iceworld or not. The only case I\'d recommend aborting it is if the model grinds to a halt. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 4 Sep 06 Posts: 79 Credit: 5,583,517 RAC: 0 |
I saw something that might be interesting: The backup-folder has the size 1,37 Gb but the folder that contains the crashed models, the folder in c:7program files, has the size 2,12 Gb. Does that indicate something relevant? I have just finished 24 hrs of the prime95 torture test and all iterations passed. I will now resore from backup and start again. If the model(s) fail again after a short time, should I abort them and and finish the others? |
Send message Joined: 4 Sep 06 Posts: 79 Credit: 5,583,517 RAC: 0 |
After some problems with 2 of the models stop running, all 4 models now finished computation after reaching the end. Nice to see :) |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Congratulations! These models are so long that every one completed is a personal success. Cpdn news |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Good show, Steinar! Congratulations. For what it\'s worth, I\'ve had more problems with my Q6600 (G0 stepping) than any earlier build. Issues range from Fedora7 installation twice trashing the Master Boot Record (preventing Windoze boot), to BSOD pushing updates and utilities into Vista Home Premium. The box tests \'stable\' with Memtest and four copies of Prime-95, is not overclocked, yet seems a pit to collect all ills. Perhaps there are issues with boinc running certain combinations of four CPDN Models...? One wonders. (Not to mention questionable Linux distros and M$ inadequacies.) Regardless, job well done in completing the four Models. Many more successes! "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
©2024 climateprediction.net