Questions and Answers :
Windows :
model crashed, from which point shall I restore backup?
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Apr 07 Posts: 26 Credit: 12,681 RAC: 0 |
Me again, http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7041767 Shall I restore the backup from 30 Dez right before the error happened, or before the model-restart, or shall I restore the backup when the 5th trickle starts (after 28 Dec 2007 15:35:28)? |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
The most recent one prior to the crash is the way to go. Good luck with it! "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
The error text page isn\'t giving us a clear reason for the crash \'(null)\', so the restored model may or may not crash again at the same point. It would be interesting for us to know the outcome, so a useful experiment either way. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 12 Apr 07 Posts: 26 Credit: 12,681 RAC: 0 |
OK, I will do so, after the computer finishes the current task (http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7078423). I have already clicked on \"no new tasks\". |
Send message Joined: 12 Apr 07 Posts: 26 Credit: 12,681 RAC: 0 |
Small Report: When I restored the backup from 1h before the crash happened the model crashed immediatelly again. Then I restored the backup from the day before the crash (~2h computing time before crash happened). It´s running now. We will see what is going to happen to the model :D I will report later. |
Send message Joined: 12 Apr 07 Posts: 26 Credit: 12,681 RAC: 0 |
Update: Good news: The hadsm-model works again ->http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7041767. Bad news: The other model an hadcm, which was running on linux just crashed ->http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7117962. Can someone tell me, why the stderr out message is so long? I already had the feeling that something is going wrong with the linux-BOINC-model because the graphic didn\'t show and the data looks weird (esp the ocean temp). Shall I restore the backup from 6 Jan, unfortunatelly that\'s the last backup I made. Or should I better reinstall BOINC and start a new model (and hope that everything works fine)? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The linux model possibly has so many error messages because you\'ve had a lot of problems with it, each one adding to the list. Near the Top: process exited with code 22 coupled with this near the bottom: Model crashed: umshell1.f: TRANSO2A: Missing data in ocean UV fields is usually permanently fatal. I\'d say abort it and start with a new model. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
It\'s very unusual for a HadAM3 model to crash, and I\'ve never seen one with an Ocean UV error before. Usually that indicates that floating point errors are occurring on the PC (can be caused by a faulty memory stick, overclocking, overheating, or a bad contact). The Ocean UV message appears six times at the end, but worse than that, it appears several times in the messages prior to that. This indicates that the model crashed, recovered, ran for a while, then crashed again, recovered, until finally the model data was too damaged to continue. It might be a good idea to run a stress test on the machine - prime95\'s torture test is very good. You need to run a copy per core (using the -A affinity flag) for around 24 hours. The linux version is called mprime. MemCheck86+ is also useful, for testing that the memory is working OK. I find the Linux error output is much more verbose than the Windows output. -- Edit: Les got there first :-) I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 12 Apr 07 Posts: 26 Credit: 12,681 RAC: 0 |
@MikeMarsUK: Thanks for the links, I already know them except of the linux version. I have SUSE 10.3 I can even start memtest at startup: http://bp2.blogger.com/_VN8zHqq8Ns8/ReRkagjSZ5I/AAAAAAAAAD0/4I-5y-gesnc/s1600/103grubcr2.png But in Win Xp, which I had before, the Computer didn\'t produce so much errors. I don\'t think something is wrong with the computer. I think I did something wrong with the installation of BOINC. I don\'t know what I am going to do next... BTW: The graph of the new model after the crash also doesn\'t show. One more thing: BOINC is showing: hadsm3fub_028m_005919313_3 using hadsm3 version 504 Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! hadsm3fub_028m_005919313 - PH 1 TS 0000145 A - 04/12/1810 00:30 - H:M:S=0000:09:19 AVG= 3.86 DLT= 1.98 Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! hadsm3fub_028m_005919313 - PH 1 TS 0000289 A - 07/12/1810 00:30 - H:M:S=0000:18:48 AVG= 3.91 DLT= 2.98 Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! hadsm3fub_028m_005919313 - PH 1 TS 0000433 A - 10/12/1810 00:30 - H:M:S=0000:28:27 AVG= 3.94 DLT= 1.96 Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! hadsm3fub_028m_005919313 - PH 1 TS 0000577 A - 13/12/1810 00:30 - H:M:S=0000:38:05 AVG= 3.96 DLT= 3.26 Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! hadsm3fub_028m_005919313 - PH 1 TS 0000721 A - 16/12/1810 00:30 - H:M:S=0000:47:33 AVG= 3.96 DLT= 2.98 Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! hadsm3fub_028m_005919313 - PH 1 TS 0000865 A - 19/12/1810 00:30 - H:M:S=0000:57:07 AVG= 3.96 DLT= 2.73 Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! hadsm3fub_028m_005919313 - PH 1 TS 0001009 A - 22/12/1810 00:30 - H:M:S=0001:06:37 AVG= 3.96 DLT= 2.03 Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! hadsm3fub_028m_005919313 - PH 1 TS 0001153 A - 25/12/1810 00:30 - H:M:S=0001:15:59 AVG= 3.95 DLT= 1.88 Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! hadsm3fub_028m_005919313 - PH 1 TS 0001297 A - 28/12/1810 00:30 - H:M:S=0001:25:21 AVG= 3.95 DLT= 1.98 Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! hadsm3fub_028m_005919313 - PH 1 TS 0001441 A - 01/01/1811 00:30 - H:M:S=0001:34:56 AVG= 3.95 DLT= 2.76 Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! hadsm3fub_028m_005919313 - PH 1 TS 0001585 A - 04/01/1811 00:30 - H:M:S=0001:44:34 AVG= 3.96 DLT= 2.92 Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! Resuming CPDN! hadsm3fub_028m_005919313 - PH 1 TS 0001729 A - 07/01/1811 00:30 - H:M:S=0001:54:00 AVG= 3.96 DLT= 1.92 in the stdoutdae.txt again! In the stderr.txt so far: shmget: No such file or directory No protocol specified GLUT: Fatal Error in screensaver: could not open display: :0.0 No protocol specified GLUT: Fatal Error in screensaver: could not open display: :0.0 I have tried to open the graph 2 times. |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
T!R0, One thing to remember is that the messages in the dump don\'t necessarily relate to the crash itself - they accumulate through the model\'s life. The bit at the end is the crash for sure ... Iain |
©2024 cpdn.org