climateprediction.net home page
What is the meaning of this?

What is the meaning of this?

Questions and Answers : Unix/Linux : What is the meaning of this?
Message board moderation

To post messages, you must log in.

AuthorMessage
copycat

Send message
Joined: 24 Feb 05
Posts: 28
Credit: 121,749
RAC: 0
Message 15754 - Posted: 6 Sep 2005, 21:18:34 UTC

It would seem my CPDN just lost a months worth of results. Why is it doing this? You can see the line before and after the rewinding-message, nothing extraordinary happened.

4843_200297411 - PH 1 TS 0004027 A - 24/02/1811 21:30 - H:M:S=0004:41:59 AVG= 4.20 DLT= 2.85
Preparing for restart...
Rewinding a model-month...
Copying restart files for model retry...
Starting model ID 4843_200297411 Phase 1
Waiting for model startup, this may take a minute...
4843_200297411 - PH 1 TS 0002881 A - 01/02/1811 00:30 - H:M:S=0004:42:12 AVG= 5.88 DLT= 0.00
ID: 15754 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 15756 - Posted: 6 Sep 2005, 21:28:41 UTC

In certain types of model instability, the model will go back to a hopefully known good point, in this case at the start of the last month, and start from there again. This gives it a chance to continue on after an error. If the error was just some odd hardware glitch that doesn't reoccur, then it will continue on OK. If the model is unstable, or the computer is unstable again, it will give up and download a new model. Usually it rewinds a day, then a month, then a year. You may not have noticed the rewind a day messages.
ID: 15756 · Report as offensive     Reply Quote
copycat

Send message
Joined: 24 Feb 05
Posts: 28
Credit: 121,749
RAC: 0
Message 15788 - Posted: 7 Sep 2005, 21:20:46 UTC - in response to Message 15756.  

> In certain types of model instability, the model will go back to a hopefully
> known good point, in this case at the start of the last month, and start from
> there again. This gives it a chance to continue on after an error. If the
> error was just some odd hardware glitch that doesn't reoccur, then it will
> continue on OK. If the model is unstable, or the computer is unstable again,
> it will give up and download a new model. Usually it rewinds a day, then a
> month, then a year. You may not have noticed the rewind a day messages.
Oh yes, I noticed the rewind a day message too, that just happened some time before. And yes, here we go again:
'Preparing for restart...
Rewinding a model-year...
Error: Restart files for dataout/restart.year not found
Giving up, this result exceeded crash count for available restart files.'
The EXACT same thing happened the first time I tried CPDN, so many months ago (march 2005). Apparently, it's STILL not fixed. :-\ First it rewinds a day, then it rewinds a month, then it tries to rewind a year, but can't because it hasn't gotten that far yet and then gives up. It has even happened so fast it couldn't even rewind a MONTH! I can understand S@H is not comparable to this, because a S@H-WU can be finished in one day, but an E@H-WU takes up several days too, so why can those WU's make it through that time whilst not crashing and not a CPDN-WU? I am very careful to suspend BOINC before shutting down, so everything can safely re-start the next time but it would seem that still isn't enough. I can't track the evolution of the SC-application, but the hadsm-application seems to have evolved 2 versions since that time. I'll abort SC, since it's either that or restart and probably face it crashing again. I'll see if the new hadsm3-application fares better than it's predecessor, although it looks doubtful.
ID: 15788 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 15790 - Posted: 7 Sep 2005, 22:00:02 UTC

I see that you don't have any trickles recorded, even after a month.
Are they all still in your climateprediction.net folder, or have you done a computer merge and had them allocated to another ID?
Have you read / tried the maintenance / stress testing written by UK_Nick?

ID: 15790 · Report as offensive     Reply Quote
copycat

Send message
Joined: 24 Feb 05
Posts: 28
Credit: 121,749
RAC: 0
Message 15858 - Posted: 10 Sep 2005, 21:35:14 UTC - in response to Message 15790.  

I see that you don\'t have any trickles recorded, even after a month.
Are they all still in your climateprediction.net folder, or have you done a computer merge and had them allocated to another ID?
Have you read / tried the maintenance / stress testing written by UK_Nick?

A small situation report:
My machine has 2 OS\'s, Window$ XP Home and Linux (actually there\'s two Linux OS\'s, but we\'ll treat them as one). On both OS\'s there\'s BOINC installed. Window$ has become rater unstable (as all Window$ do), but I only need it to play certain games and it can still manage that, as long as I save often enough. When I try to run that BOINC, the OS crashes. Since I managed to make one of my Linux OS\'s a DVD-player I don\'t start it up that often anyways. On the Window$ BOINC there\'s some S@H WU\'s (behind deadline) and CPDN WU\'s (before deadline) present but I can\'t finish them for the the reasons I outlined above. On the Linux-BOINC I first had S@H, then CPDN, abandoned CPDN (full detach), then E@H, and now re-attached CPDN.
The reason why I don\'t have any trickles recorded is rather obvious: none of the WU\'s I (try to) process can run long enough to produce any trickles! In Linux the WU crashes (as you\'ve seen) and in Window$ the OS crashes.
I have done no computer merge on my CPDN account, only on the S@H and the E@H-account a short while ago. Strange, the number of computers seem to match the number of CPDN Wu\'s on each OS, but I guess that\'s a coincidence.
I\'ve got a great deal of client errors on my result-page. Some of the (non errored) WU\'s are from before I detached (and thus I don\'t have anymore), some are on my Window$ partition, and two are currently in my Linux BOINC CPDN-folder. Currently BOINC is in EDF-mode and since there are two E@H WU\'s there, which obviously have shorter deadlines than the CPDN-WU, they are being crunched first.
maintenance / stress test?
ID: 15858 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 15860 - Posted: 10 Sep 2005, 22:20:15 UTC
Last modified: 10 Sep 2005, 22:21:11 UTC

<a href=\"http://www.climateprediction.net/board/viewtopic.php?t=2124\"> Maintenance</a>

<a href=\"http://www.climateprediction.net/board/viewtopic.php?t=2126\"> Tests</a>

edit:
Sorry about the long strings. Carl has updated the server software.

ID: 15860 · Report as offensive     Reply Quote
copycat

Send message
Joined: 24 Feb 05
Posts: 28
Credit: 121,749
RAC: 0
Message 15866 - Posted: 11 Sep 2005, 1:01:19 UTC - in response to Message 15860.  
Last modified: 11 Sep 2005, 1:05:02 UTC

<a href=\"http://www.climateprediction.net/board/viewtopic.php?t=2124\"> Maintenance</a>

<a href=\"http://www.climateprediction.net/board/viewtopic.php?t=2126\"> Tests</a>

edit:
Sorry about the long strings. Carl has updated the server software.

You DO know this is a Linux-forum and those links are to exe-files, right? At least one of them seems rather un-wine-able.
ID: 15866 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 15867 - Posted: 11 Sep 2005, 1:37:46 UTC - in response to Message 15866.  

You DO know this is a Linux-forum and those links are to exe-files, right? At least one of them seems rather un-wine-able.

And you said you have both Windows and Linux on those PCs. Testing hardware to see if it\'s reliable in Windows should suffice for determining hardware stability in Linux for CPDN. And, while Prime95 Windows executable might be linked from that thread, there is a Linux binary for Prime95 as well.
ID: 15867 · Report as offensive     Reply Quote
copycat

Send message
Joined: 24 Feb 05
Posts: 28
Credit: 121,749
RAC: 0
Message 15911 - Posted: 11 Sep 2005, 16:40:54 UTC - in response to Message 15867.  

And you said you have both Windows and Linux on those PCs. Testing hardware to see if it\'s reliable in Windows should suffice for determining hardware stability in Linux for CPDN. And, while Prime95 Windows executable might be linked from that thread, there is a Linux binary for Prime95 as well.

I believe I also said Windows XP is only on here to play certain games, and since I\'ve still got lots of DVD\'s I need to watch on this pc (I\'ve only got ONE pc, despite what it may say in my profile, and no stand-alone DVD-player) I rarely start it up anymore. I\'ve got something monitoring my hardware in Windows, from when I was trying to determine why I was unable to (re-)install one of my two Linux OS\'s, but it seems some buggy sectors on the XP partition were the cause. In fact, that\'s MBM version 5. However, although I tried, I was unable to set up logging, so I can\'t determine what went wrong when that OS (XP) crashes. As long as it didn\'t crash, all parameters were within safe range. Also, yesterday, the linkt to the Prime95-tests was unavailable, and I searched but could not find the Prime95 Linux-version whilst googling. Also, if something goes wrong: a) shouldn\'t S@H and E@H suffer too and b) can\'t it (CPDN) tell me WHAT\'s wrong in the logfile (or at least nudge me in the right direction) BEFORE it dumps the restart/rewind-message?
ID: 15911 · Report as offensive     Reply Quote
copycat

Send message
Joined: 24 Feb 05
Posts: 28
Credit: 121,749
RAC: 0
Message 15947 - Posted: 12 Sep 2005, 21:25:37 UTC

I would also like to add that Prime95 DID find discrepeancies whilst running its torture test but a) they were found very quick, much quicker than even CPDN gives a rewind message and b) alongside the torture tests an E@H WU was crunching and it did not give even the smallest hint of something going wrong!
ID: 15947 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 15950 - Posted: 12 Sep 2005, 21:56:53 UTC - in response to Message 15947.  

I would also like to add that Prime95 DID find discrepeancies whilst running its torture test but a) they were found very quick, much quicker than even CPDN gives a rewind message and b) alongside the torture tests an E@H WU was crunching and it did not give even the smallest hint of something going wrong!

CPDN stresses both processor and memory. In particular, the memory is stressed much more in CPDN than in any other distributed computing project. If you have errors in Prime95, I have no doubt you will also eventually error out in CPDN.
ID: 15950 · Report as offensive     Reply Quote
copycat

Send message
Joined: 24 Feb 05
Posts: 28
Credit: 121,749
RAC: 0
Message 16002 - Posted: 14 Sep 2005, 20:05:00 UTC - in response to Message 15950.  

In particular, the memory is stressed much more in CPDN than in any other distributed computing project. If you have errors in Prime95, I have no doubt you will also eventually error out in CPDN.

In that case, there\'s probably something wrong with my memory, which doesn\'t affect S@H or E@H because they don\'t take up as much memory as CPDN, but in CPDN eventually it does, and always at about the same time. Guess I\'ll have to suspend CPDN then, until I can get my memory modules checked and (probably) replaced.
ID: 16002 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : What is the meaning of this?

©2024 climateprediction.net