climateprediction.net home page
model crashed, from which point shall I restore backup?

model crashed, from which point shall I restore backup?

Questions and Answers : Windows : model crashed, from which point shall I restore backup?
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user442168

Send message
Joined: 12 Apr 07
Posts: 26
Credit: 12,681
RAC: 0
Message 32026 - Posted: 6 Jan 2008, 1:07:33 UTC
Last modified: 6 Jan 2008, 1:08:22 UTC

Me again,
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7041767

Shall I restore the backup from 30 Dez right before the error happened, or before the model-restart, or shall I restore the backup when the 5th trickle starts (after 28 Dec 2007 15:35:28)?
ID: 32026 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 32029 - Posted: 6 Jan 2008, 6:23:13 UTC


The most recent one prior to the crash is the way to go. Good luck with it!

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 32029 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32032 - Posted: 6 Jan 2008, 11:19:44 UTC


The error text page isn\'t giving us a clear reason for the crash \'(null)\', so the restored model may or may not crash again at the same point. It would be interesting for us to know the outcome, so a useful experiment either way.

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32032 · Report as offensive     Reply Quote
old_user442168

Send message
Joined: 12 Apr 07
Posts: 26
Credit: 12,681
RAC: 0
Message 32038 - Posted: 6 Jan 2008, 13:29:22 UTC - in response to Message 32032.  

OK, I will do so, after the computer finishes the current task (http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7078423).
I have already clicked on \"no new tasks\".
ID: 32038 · Report as offensive     Reply Quote
old_user442168

Send message
Joined: 12 Apr 07
Posts: 26
Credit: 12,681
RAC: 0
Message 32097 - Posted: 9 Jan 2008, 16:19:27 UTC - in response to Message 32038.  

Small Report:
When I restored the backup from 1h before the crash happened the model crashed immediatelly again.
Then I restored the backup from the day before the crash (~2h computing time before crash happened).
It´s running now.
We will see what is going to happen to the model :D
I will report later.
ID: 32097 · Report as offensive     Reply Quote
old_user442168

Send message
Joined: 12 Apr 07
Posts: 26
Credit: 12,681
RAC: 0
Message 32123 - Posted: 10 Jan 2008, 19:42:47 UTC - in response to Message 32097.  
Last modified: 10 Jan 2008, 19:48:18 UTC

Update:
Good news: The hadsm-model works again ->http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7041767.

Bad news: The other model an hadcm, which was running on linux just crashed ->http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7117962.

Can someone tell me, why the stderr out message is so long?
I already had the feeling that something is going wrong with the linux-BOINC-model because the graphic didn\'t show and the data looks weird (esp the ocean temp).

Shall I restore the backup from 6 Jan, unfortunatelly that\'s the last backup I made.
Or should I better reinstall BOINC and start a new model (and hope that everything works fine)?
ID: 32123 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 32125 - Posted: 10 Jan 2008, 20:41:47 UTC
Last modified: 10 Jan 2008, 20:42:17 UTC

The linux model possibly has so many error messages because you\'ve had a lot of problems with it, each one adding to the list.


Near the Top: process exited with code 22

coupled with this near the bottom: Model crashed: umshell1.f: TRANSO2A: Missing data in ocean UV fields

is usually permanently fatal.

I\'d say abort it and start with a new model.

ID: 32125 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32126 - Posted: 10 Jan 2008, 21:28:41 UTC
Last modified: 10 Jan 2008, 21:53:33 UTC

It\'s very unusual for a HadAM3 model to crash, and I\'ve never seen one with an Ocean UV error before. Usually that indicates that floating point errors are occurring on the PC (can be caused by a faulty memory stick, overclocking, overheating, or a bad contact).

The Ocean UV message appears six times at the end, but worse than that, it appears several times in the messages prior to that. This indicates that the model crashed, recovered, ran for a while, then crashed again, recovered, until finally the model data was too damaged to continue.

It might be a good idea to run a stress test on the machine - prime95\'s torture test is very good. You need to run a copy per core (using the -A affinity flag) for around 24 hours. The linux version is called mprime.

MemCheck86+ is also useful, for testing that the memory is working OK.

I find the Linux error output is much more verbose than the Windows output.

-- Edit: Les got there first :-)

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32126 · Report as offensive     Reply Quote
old_user442168

Send message
Joined: 12 Apr 07
Posts: 26
Credit: 12,681
RAC: 0
Message 32128 - Posted: 10 Jan 2008, 22:08:45 UTC - in response to Message 32126.  
Last modified: 10 Jan 2008, 22:35:41 UTC

@MikeMarsUK: Thanks for the links, I already know them except of the linux version. I have SUSE 10.3 I can even start memtest at startup: http://bp2.blogger.com/_VN8zHqq8Ns8/ReRkagjSZ5I/AAAAAAAAAD0/4I-5y-gesnc/s1600/103grubcr2.png

But in Win Xp, which I had before, the Computer didn\'t produce so much errors.
I don\'t think something is wrong with the computer. I think I did something wrong with the installation of BOINC.
I don\'t know what I am going to do next...

BTW: The graph of the new model after the crash also doesn\'t show.


One more thing:
BOINC is showing:

hadsm3fub_028m_005919313_3 using hadsm3 version 504
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
hadsm3fub_028m_005919313 - PH 1 TS 0000145 A - 04/12/1810 00:30 - H:M:S=0000:09:19 AVG= 3.86 DLT= 1.98
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
hadsm3fub_028m_005919313 - PH 1 TS 0000289 A - 07/12/1810 00:30 - H:M:S=0000:18:48 AVG= 3.91 DLT= 2.98
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
hadsm3fub_028m_005919313 - PH 1 TS 0000433 A - 10/12/1810 00:30 - H:M:S=0000:28:27 AVG= 3.94 DLT= 1.96
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
hadsm3fub_028m_005919313 - PH 1 TS 0000577 A - 13/12/1810 00:30 - H:M:S=0000:38:05 AVG= 3.96 DLT= 3.26
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
hadsm3fub_028m_005919313 - PH 1 TS 0000721 A - 16/12/1810 00:30 - H:M:S=0000:47:33 AVG= 3.96 DLT= 2.98
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
hadsm3fub_028m_005919313 - PH 1 TS 0000865 A - 19/12/1810 00:30 - H:M:S=0000:57:07 AVG= 3.96 DLT= 2.73
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
hadsm3fub_028m_005919313 - PH 1 TS 0001009 A - 22/12/1810 00:30 - H:M:S=0001:06:37 AVG= 3.96 DLT= 2.03
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
hadsm3fub_028m_005919313 - PH 1 TS 0001153 A - 25/12/1810 00:30 - H:M:S=0001:15:59 AVG= 3.95 DLT= 1.88
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
hadsm3fub_028m_005919313 - PH 1 TS 0001297 A - 28/12/1810 00:30 - H:M:S=0001:25:21 AVG= 3.95 DLT= 1.98
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
hadsm3fub_028m_005919313 - PH 1 TS 0001441 A - 01/01/1811 00:30 - H:M:S=0001:34:56 AVG= 3.95 DLT= 2.76
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
hadsm3fub_028m_005919313 - PH 1 TS 0001585 A - 04/01/1811 00:30 - H:M:S=0001:44:34 AVG= 3.96 DLT= 2.92
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
hadsm3fub_028m_005919313 - PH 1 TS 0001729 A - 07/01/1811 00:30 - H:M:S=0001:54:00 AVG= 3.96 DLT= 1.92

in the stdoutdae.txt again!


In the stderr.txt so far:

shmget: No such file or directory
No protocol specified
GLUT: Fatal Error in screensaver: could not open display: :0.0
No protocol specified
GLUT: Fatal Error in screensaver: could not open display: :0.0

I have tried to open the graph 2 times.
ID: 32128 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 32129 - Posted: 10 Jan 2008, 22:37:45 UTC

T!R0,

One thing to remember is that the messages in the dump don\'t necessarily relate to the crash itself - they accumulate through the model\'s life.

The bit at the end is the crash for sure ...

Iain
ID: 32129 · Report as offensive     Reply Quote

Questions and Answers : Windows : model crashed, from which point shall I restore backup?

©2024 cpdn.org