climateprediction.net home page
Problems after Climate site down

Problems after Climate site down

Questions and Answers : Unix/Linux : Problems after Climate site down
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile old_user65973

Send message
Joined: 21 Mar 05
Posts: 13
Credit: 1,886
RAC: 0
Message 11531 - Posted: 31 Mar 2005, 11:36:11 UTC
Last modified: 2 Apr 2005, 11:21:28 UTC

I run boinc under SuSE 9.0. I have my time evenly divided between seti@home and climateprediction.net.

Everything appeared to be running ok, and still does on seti@home. But it appears that the climateprediction site was down or unaccessable to me for a couple of days. After a day or so of this boinc was still getting a processing data from seti@home but unable to connect to climatepredection. I stopped boinc and restarted it, telling it to stop when it was done with the current seti data. When it stopped I ran the old seti@home for about 3 days. When the climate site came back up yesterday I had the old seti@home stop when it finished the current data set. Then I restarted boinc. Now all of my results on the client site say, "client error". I figured the software was smart enough to pick up where it left off when the site went down. This did not happen to the seti results. boinc appears to have pickup at seti right where it left off and it crunching away. Any thoughts on what is wrong on the climate side and how to fix it?

This is still happening to every work unit. They all end due to client error.
Looking at the error codes I see that they were all 26 or 251.

core_client_version>4.19
process exited with code 251 (0xfb)10
No heartbeat from core client for 31 sec - exiting


ID: 11531 · Report as offensive     Reply Quote
Profile old_user65973

Send message
Joined: 21 Mar 05
Posts: 13
Credit: 1,886
RAC: 0
Message 11570 - Posted: 2 Apr 2005, 11:30:27 UTC

I have searched the site looking for the meaning of error code 251 and cannot find it. Does anyone have any idea why this is happening to every work unit?
ID: 11570 · Report as offensive     Reply Quote
Profile Andrew Hingston
Volunteer moderator

Send message
Joined: 17 Aug 04
Posts: 753
Credit: 9,804,700
RAC: 0
Message 11572 - Posted: 2 Apr 2005, 13:02:43 UTC

Error 251 seems to be a variant of error -5 as in <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2348">this thread</a>. It is difficult to give clear advice, because it could be hardware, OS, program incompatibility, etc. But it could also be a problem with the CPDN client - 4.12 has only been released recently to fix other problems.
ID: 11572 · Report as offensive     Reply Quote
Profile old_user65973

Send message
Joined: 21 Mar 05
Posts: 13
Credit: 1,886
RAC: 0
Message 11573 - Posted: 2 Apr 2005, 14:38:20 UTC
Last modified: 2 Apr 2005, 14:39:15 UTC

Andrew,

Thank you for the response. I have run seti@home classic for over 4000 work units without a problem. seti@home under boinc runs fine. But the Climateapps seem to error out every work unit. None have gone without error. Should I just stop this until a new version of the client software comes out? If I continue to run it like this will it screw up things on the science end? When a new version of the client comes out will boinc upload it and use in automagically or do I need to do something?

Thanks Steve
ID: 11573 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 11576 - Posted: 2 Apr 2005, 19:19:43 UTC
Last modified: 2 Apr 2005, 19:23:22 UTC

Hi, Stephen,

I see that you had four Trickles from the first WU. Judging from the Model number, it was possibly one of the not-so-good WU. Yesterday's failure might be from the same WU set. Not today's, though. (If the failures leave a RunID Directory in Projects Directory with 3 files, the end of the zipped yabsd.out file may have the reason for the failure, if Negative Pressure or Negative Theta.)

Is your Athlon overclocked? CPDN hammers a machine, both CPU and HD, and is apt to fail, whereas Projects with short WU get through okay. Overclocked machines are especially vulnerable. Verifying with Prime95 is a good idea.

Folks running two or more Projects report a lot of the problems we see on the Boards. Do you have your Preferences option (Edit: in "Your account") set to leave in memory when suspended? (It's a good idea for CPDN.)

Looks like your OS could be SuSE 9.0. Should not be problems there.

Which boinc version? How much memory?
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 11576 · Report as offensive     Reply Quote
Profile old_user65973

Send message
Joined: 21 Mar 05
Posts: 13
Credit: 1,886
RAC: 0
Message 11577 - Posted: 2 Apr 2005, 20:51:09 UTC

Boinc version is 4.19 My cpu is an AMD xp2400+ and is not over clocked. I have 1GB of PC2700 ddr memory. I will look at the files you mentioned if it fails again. I post the data here.

As far as folks with Boinc running more then one project having problems, I thought that was what Boine was for?

Thanks Steve
ID: 11577 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 11582 - Posted: 3 Apr 2005, 5:30:34 UTC - in response to Message 11577.  

&gt; Boinc version is 4.19 My cpu is an AMD xp2400+ and is not over clocked. I
&gt; have 1GB of PC2700 ddr memory. I will look at the files you mentioned if it
&gt; fails again. I post the data here.
&gt;
&gt; As far as folks with Boinc running more then one project having problems, I
&gt; thought that was what Boine was for?
&gt;
&gt; Thanks Steve

Hi, Steve,

To be sure. And in my case, running P4s, boinc allows parallel CPDN runs -- sonething we couldn't do in Classic CPDN, thanks to M$ Registry limitations. (There were no Linux or MAC versions in Classic.)

You have a heavy setup and I don't see an obvious problem. From what I've read over time on these Boards, though, some AMD rigs have problems with CPDN, though they more than meet Specs required to run this beast. ... Someone with more tech savvy than me will have to wade in to help.

This creature we run is a million-plus-line Fortran program developed over decades by climate scientists to run on super-computers. (In fact, the British Met. Office runs it on such machines for daily forecasts.) That it was ported and runs on PCs at all, I find quite amazing. Perhaps we shouldn't be surprised that some hardware combinations have difficulties -- while similar machines continually turn out successfully completed Models.

I hope you find the culprit and are able to stay with the Project.

Regards,
Jim
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 11582 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 11583 - Posted: 3 Apr 2005, 5:30:43 UTC - in response to Message 11577.  

&gt; Boinc version is 4.19 My cpu is an AMD xp2400+ and is not over clocked. I
&gt; have 1GB of PC2700 ddr memory. I will look at the files you mentioned if it
&gt; fails again. I post the data here.
&gt;
&gt; As far as folks with Boinc running more then one project having problems, I
&gt; thought that was what Boine was for?
&gt;
&gt; Thanks Steve

Hi, Steve,

To be sure. And in my case, running P4s, boinc allows parallel CPDN runs -- sonething we couldn't do in Classic CPDN, thanks to M$ Registry limitations. (There were no Linux or MAC versions in Classic.)

You have a heavy setup and I don't see an obvious problem. From what I've read over time on these Boards, though, some AMD rigs have problems with CPDN, though they more than meet Specs required to run this beast. ... Someone with more tech savvy than me will have to wade in to help.

This creature we run is a million-plus-line Fortran program developed over decades by climate scientists to run on super-computers. (In fact, the British Met. Office runs it on such machines for daily forecasts.) That it was ported and runs on PCs at all, I find quite amazing. Perhaps we shouldn't be surprised that some hardware combinations have difficulties -- while similar machines continually turn out successfully completed Models.

I hope you find the culprit and are able to stay with the Project.

Regards,
Jim
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 11583 · Report as offensive     Reply Quote
Profile old_user65973

Send message
Joined: 21 Mar 05
Posts: 13
Credit: 1,886
RAC: 0
Message 11587 - Posted: 3 Apr 2005, 12:04:50 UTC

I found the file. At the end it said:

*********************************************************************************
Model aborted with error code - 1 Routine and message:-
P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.
*********************************************************************************

Steve
ID: 11587 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 11593 - Posted: 3 Apr 2005, 18:16:26 UTC - in response to Message 11577.  
Last modified: 3 Apr 2005, 18:26:54 UTC

&gt; Boinc version is 4.19 My cpu is an AMD xp2400+ and is not over clocked. I
&gt; have 1GB of PC2700 ddr memory. I will look at the files you mentioned if it
&gt; fails again. I post the data here.
&gt;
&gt; As far as folks with Boinc running more then one project having problems, I
&gt; thought that was what Boine was for?
&gt;
&gt; Thanks Steve

Hi, Steve,

To be sure. ...

[Edit. Now I see that this WAS posted last evening, so I removed the Body of text. No evidence of successful posting was given and I couldn't connect with any other part of the BB. Odd.]

(Rats. The Board went down while I wrote this ... Sorry for the delay.)

Wrote that last night, US Pacific Coast time. Just saw your "Negative Pressure" post. That confirms that at least that one Model was from the bad batch.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 11593 · Report as offensive     Reply Quote
Profile old_user65973

Send message
Joined: 21 Mar 05
Posts: 13
Credit: 1,886
RAC: 0
Message 11597 - Posted: 3 Apr 2005, 20:40:29 UTC

Jim,

I looked at the latest run and it is looking good. It may be the case the I got 6 consecutive bad batches. What luck. I hope that this is the case as this was making me crazy. I could not find anything wrong on my end. You won't believe how many hours I devoted to going through my computer with a fine tooth comb trying to find something wrong. I'll keep you posted.

Thank You very much for your help. I'll let you know how this goes.
Stephen Hawkins NG0G
ID: 11597 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 11598 - Posted: 3 Apr 2005, 21:13:06 UTC - in response to Message 11597.  

&gt; Jim,
&gt;
&gt; I looked at the latest run and it is looking good. It may be the case the I
&gt; got 6 consecutive bad batches. What luck. I hope that this is the case as
&gt; this was making me crazy. I could not find anything wrong on my end. You
&gt; won't believe how many hours I devoted to going through my computer with a
&gt; fine tooth comb trying to find something wrong. I'll keep you posted.
&gt;
&gt; Thank You very much for your help. I'll let you know how this goes.
&gt; Stephen Hawkins NG0G


Pleased to see it, Steve. Best of luck.

Jim
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 11598 · Report as offensive     Reply Quote
Profile old_user65973

Send message
Joined: 21 Mar 05
Posts: 13
Credit: 1,886
RAC: 0
Message 11605 - Posted: 4 Apr 2005, 16:18:26 UTC

Although it went a lot farther this time, it happened again. But this time with a different error msg. See Below:

Result ID: 694277 Name 1wgy_300109643_0

*********************************************************************************
Model aborted with error code - 1 Routine and message:-
ATM_DYN : NEGATIVE THETA DETECTED.
*********************************************************************************
ID: 11605 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 11606 - Posted: 4 Apr 2005, 17:56:27 UTC - in response to Message 11605.  

&gt; Although it went a lot farther this time, it happened again. But this time
&gt; with a different error msg. See Below:
&gt;
&gt; Result ID: 694277 Name 1wgy_300109643_0
&gt;
&gt;
&gt; *********************************************************************************
&gt; Model aborted with error code - 1 Routine and message:-
&gt; ATM_DYN : NEGATIVE THETA
&gt; DETECTED.
&gt;
&gt; *********************************************************************************
&gt;
Hmmm. More bad news about the current Linux version. Apparently, it's unstable, too. In a message from Tolu replying to my Email about a similar problem in Alpha, he stated that it is his #1 priority. That's good news, given the many high-priority things he has to do.
<a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2353"> See this Thread </a>
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 11606 · Report as offensive     Reply Quote
Profile old_user65973

Send message
Joined: 21 Mar 05
Posts: 13
Credit: 1,886
RAC: 0
Message 11609 - Posted: 4 Apr 2005, 19:37:56 UTC

Well I guess the last two pieces of information that I need are:

1. Should I stop running this until a new, stable version is out and if so how will I know? I mean are these repeated abort due to "client error" screwing up your data?

2. Is boinc smart enough to see that there is a new version of your software, and down load and install it without intervention from me?
ID: 11609 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 11616 - Posted: 5 Apr 2005, 0:15:13 UTC - in response to Message 11609.  
Last modified: 5 Apr 2005, 0:21:21 UTC

&gt; Well I guess the last two pieces of information that I need are:
&gt;
&gt; 1. Should I stop running this until a new, stable version is out and if so how
&gt; will I know? I mean are these repeated abort due to "client error" screwing
&gt; up your data?
&gt;
&gt; 2. Is boinc smart enough to see that there is a new version of your software,
&gt; and down load and install it without intervention from me?

Hi, again, Steve,

We have to stop meeting like this; people will talk!

Seriously, though, one of my machines crashed and downloaded 4.13 and a new Workunit about a half hour ago.

I have no information on this release -- haven't seen a post here or on the Alpha BB yet. ... at least it is new and hope springs eternal. Or some such thing.

Edit: Oops. Re. your #2, in the course of processing a completed run, normal or crashed, the new version will be detected and downloaded. Or, you can force the issue with -detach_project, then go through the -attach_project drill again. (You'll get a new machine ID in that process and have to do a "merge machines" drill to put the pieces together.)

Jim
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 11616 · Report as offensive     Reply Quote
Profile old_user65973

Send message
Joined: 21 Mar 05
Posts: 13
Credit: 1,886
RAC: 0
Message 11640 - Posted: 5 Apr 2005, 22:45:00 UTC - in response to Message 11616.  

&gt;
&gt; Hi, again, Steve,
&gt;
&gt; We have to stop meeting like this; people will talk!

Jim,

I know. I just heard about this on the BBC World News on the 49 meter band, and, dare I say it, Foxnews, and CNN. What will we do when Mom hears about this????

Seriously, I am still cooking along on the new data but on 4.12. I will keep you posted and try not to let the media know.

Secret password = 1.4142135 * .707

Thank You,
73 49 111 01001001
Stephen Hawkins NG0G
ID: 11640 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,487,919
RAC: 4,541
Message 11642 - Posted: 5 Apr 2005, 23:56:25 UTC - in response to Message 11616.  

&gt; Edit: Oops. Re. your #2, in the course of processing a completed run, normal
&gt; or crashed, the new version will be detected and downloaded. Or, you can
&gt; force the issue with -detach_project, then go through the -attach_project
&gt; drill again. (You'll get a new machine ID in that process and have to do a
&gt; "merge machines" drill to put the pieces together.)
&gt;
&gt;
He should be able to do a -reset_project and then won't have to merge any hosts. At least it's worked that way for me.

George
ID: 11642 · Report as offensive     Reply Quote
Profile old_user65973

Send message
Joined: 21 Mar 05
Posts: 13
Credit: 1,886
RAC: 0
Message 11669 - Posted: 6 Apr 2005, 21:57:03 UTC

George,

Thank You. I will give that a shot.

Steve NG0G
73 49 111 01001001
ID: 11669 · Report as offensive     Reply Quote
Profile old_user65973

Send message
Joined: 21 Mar 05
Posts: 13
Credit: 1,886
RAC: 0
Message 11670 - Posted: 6 Apr 2005, 22:05:12 UTC

George,

Thank You. I will give that a shot.

Steve NG0G
73 49 111 01001001
ID: 11670 · Report as offensive     Reply Quote
1 · 2 · Next

Questions and Answers : Unix/Linux : Problems after Climate site down

©2024 climateprediction.net