crash with code 251

Author	Message
old_user57798 Send message Joined: 25 Feb 05 Posts: 6 Credit: 11,512 RAC: 0	Message 10220 - Posted: 2 Mar 2005, 15:52:23 UTC I've got boinc/climate running on a couple of windoz OSs and a linux box but am having trouble with a second linux box (GenuineIntel Intel(R) Pentium(R) 4 CPU 1700MHz). I've run mprime for 24 hrs on this machine without difficulties (as recomended elsewhere) but (with same Debian version OS as the other one) I get "Model crash ....(process exited with code 251) .... Defering communication for 1 hr..." and no processing going on. Any suggestions? kyle ID: 10220 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 10247 - Posted: 2 Mar 2005, 20:17:17 UTC Hi kyle Are you using a network drive for the files? If so, this is a bad move. Les ID: 10247 · Reply Quote

old_user57798 Send message Joined: 25 Feb 05 Posts: 6 Credit: 11,512 RAC: 0	Message 10295 - Posted: 3 Mar 2005, 12:10:59 UTC - in response to Message 10247. > Hi kyle > Are you using a network drive for the files? > If so, this is a bad move. > > Les > No. And now, through no change on my part (except to restart) the model is running (for about 24 hrs). I notice in the 'Nature' article that some parameter choices are unstable and cause the calculation to crash. Could possibly the first set of model parameters on this machine been unstable and now I have a different set of parameters? How often does something like that happen? Is there a way to tell if that happened? Thanks, kyle ID: 10295 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 10304 - Posted: 3 Mar 2005, 13:47:17 UTC Last modified: 3 Mar 2005, 13:47:54 UTC I've seen the 251 error code mentioned before, but I think the explanation was on the forum that is down. You're right about unstable parameter sets. My 1st model crashed with -5 error, a general 'catch all' code. Last year there was a spate of crashes that were called 'fast processing ice balls', because the earth, in the vis, showed ice everywhere early in the crunching. As it says <a href="http://www.climateprediction.net/science/strategy.php"> here</a>, under experiment 1, the current models are about finding out what is stable and what isn't. So, keep crunching, and hopefully your new model will be successfull. But if not, there's always someone here to help. Les ID: 10304 · Reply Quote

Thyme Lawn Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1283 Credit: 15,824,334 RAC: 0	Message 10305 - Posted: 3 Mar 2005, 13:55:19 UTC - in response to Message 10304. > I've seen the 251 error code mentioned before, but I think the explanation was > on the forum that is down. > You're right about unstable parameter sets. My 1st model crashed with -5 > error, a general 'catch all' code. Just a thought to throw into this. If you mask -5 down to a single byte and read it as unsigned you get ... 251 "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer ID: 10305 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2168 Credit: 64,541,825 RAC: 6,664	Message 10306 - Posted: 3 Mar 2005, 14:37:12 UTC - in response to Message 10305. > > If you mask -5 down to a single byte and read it as unsigned you get ... 251 > That's right. Astro and I were having this error in sulphur alpha in Linux. The yabsd.out file had negative theta detected errors. ID: 10306 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 10773 - Posted: 12 Mar 2005, 18:19:29 UTC Last modified: 17 Mar 2005, 2:38:48 UTC Two recent -251 errors (after 55 straight successfully-completed Models). Bbox, SuSE 9.0, P4 3.0, ASUS P4P800 MB, 1 gig dual-channel CL2.0 mem. Model #0m6u_n...nnn error -251 @ Phase 1, T.S.114,913 From zipped residual yabsd.out: im,sm,ngroup,new_im,new_sm 1 1 48 T F NEGATIVE THETA AT POINT 1 LEVEL 18 ******************************************************************************* Model aborted with error code - 1 Routine and message:- ATM_DYN : NEGATIVE THETA DETECTED. ******************************************************************************* Negative Theta error was experienced twice in Sulfur Alpha tests, as George noted. They were also logged at Point 1, Level 18. (On different machines [Abox & Ebox, SuSE 9.0 & 9.1, P4 2.8 & P4 3.4], hence I'm not jumping to blame Bbox. Yet.) Tolu diagnosed it as a compiler problem, fell back, and released 4.10 for Sulfur Alpha Linux. If it's the same 4.10 we're running here, it's especially distressing to see the beast raise its ugly head again. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ [EDIT: Another Negative Theta! This machine had 15 straight completed CPDNboinc runs, now three failures in a row -- all while the parallel run on the P4 CPU continued on its merry way. -- This failure was identical the the one above and at almost the same point in the run; this one at T.S. 114193 rather that 114913. Well... Enough of that! Prime95 is crunching away on its seventh set of Self-Test iterations without error. I'll let it run overnight, after restarting boinc so the two CPDN Models will bring the CPU up to its usual operating temperature/load for the continuing test.] ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ More on a Bbox Model crash reported earlier: Model #08fk_n...nnn error -251 @ Phase 1, T.S. 9073 From zipped residual yabsd.out: REPLANCA: UPDATE REQUIRED FOR FIELD 38 REPLANCA - time interpolation for field 38 time,time1,time2 1260.000 720.0000 1440.000 hours,int,period 1260 720 8640 Information used in checking ancillary data set: position of lookup table in dataset: 6 Position of first lookup table referring to data type 3 Interval between lookup tables referring to data type 3 Number of steps 1 STASH code in dataset 32 STASH code requested 32 'Start' position of lookup tables for dataset in overall lookup array 241 im,sm,ngroup,new_im,new_sm 1 1 48 T F NEGATIVE PRESSURE AT POINT 1837 NEGATIVE PRESSURE AT POINT 1838 NEGATIVE PRESSURE AT POINT 1839 NEGATIVE PRESSURE AT POINT 1933 NEGATIVE PRESSURE AT P "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 10773 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 11032 - Posted: 17 Mar 2005, 15:05:53 UTC Re. the Edit in my post, below: Torture Test ran 13 hours, 56 minutes - 0 errors, 0 warnings. jim@Bbox:~/Desktop/Prime95> So, why the failures? A "bad" batch of parameter sets? Any other posibilities? "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 11032 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2168 Credit: 64,541,825 RAC: 6,664	Message 11036 - Posted: 17 Mar 2005, 16:08:49 UTC - in response to Message 11032. > Re. the Edit in my post, below: > > Torture Test ran 13 hours, 56 minutes - 0 errors, 0 warnings. > jim@Bbox:~/Desktop/Prime95> > > So, why the failures? A "bad" batch of parameter sets? Any other > posibilities? > > Probably a bad batch of parameter sets. Jim, on the Prime95 on Linux, were both processors running at 100%? On Windows, depending on what torture test I specify, I get a lot of writing to the page file (this was on the default torture test), resulting in not 100% CPU utilization. By choosing specific tests, I can max out the processor. ID: 11036 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 11047 - Posted: 17 Mar 2005, 20:42:37 UTC - in response to Message 11036. Last modified: 17 Mar 2005, 20:44:37 UTC > > Re. the Edit in my post, below: > > > > Torture Test ran 13 hours, 56 minutes - 0 errors, 0 warnings. > > jim@Bbox:~/Desktop/Prime95> > > > > So, why the failures? A "bad" batch of parameter sets? Any other > > posibilities? > > > > > Probably a bad batch of parameter sets. > > Jim, on the Prime95 on Linux, were both processors running at 100%? On > Windows, depending on what torture test I specify, I get a lot of writing to > the page file (this was on the default torture test), resulting in not 100% > CPU utilization. By choosing specific tests, I can max out the processor. Hi, George, An hour, at most ran stand-alone. I could have ginned-up a second copy of Prime95 but chose to restart CPDN and then stop one of the CPDN runs. So, a good 13 hours were with the CPU maxed-out. [Edit: No unusual disk activity noted.] More evidence for a bad batch, I think: Ebox upchucked its first ever run (not counting Sulfur_Alpha) -- with a negative Theta. Curious, after noting Ebox's Model-time of failure, I went back and looked at the date/time for the other three failures. Recall the Alpha problem where runs all failed at a 144-TS boundary? Well, all four of these runs failed on the first TS after a 144-TS boundary. Negative Theta @ TS 111457=13/05/1817 0030Z, TS 114193=10/07/1817 0030Z, TS 114913=25/07/1817 0030Z Negative Pressure @ TS 9073=10/06/1811 0030Z Looks familiar, eh? Jim "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 11047 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2168 Credit: 64,541,825 RAC: 6,664	Message 11054 - Posted: 18 Mar 2005, 1:52:36 UTC - in response to Message 11047. > Curious, after noting Ebox's Model-time of failure, I went back and looked at > the date/time for the other three failures. Recall the Alpha problem where > runs all failed at a 144-TS boundary? Well, all four of these runs failed on > the first TS after a 144-TS boundary. > > Negative Theta @ TS 111457=13/05/1817 0030Z, TS 114193=10/07/1817 0030Z, TS > 114913=25/07/1817 0030Z > Negative Pressure @ TS 9073=10/06/1811 0030Z > > Looks familiar, eh? > > Jim > Hmmm. Did these download 4.11 slab? Something's going on. See <a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2237">this post</a> for additional suspicious problems. You may want to e-mail Tolu. I'd hate to think 4.11 is causing all this. ID: 11054 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 11057 - Posted: 18 Mar 2005, 2:54:46 UTC - in response to Message 11054. Last modified: 18 Mar 2005, 16:12:48 UTC > Hmmm. Did these download 4.11 slab? Something's going on. See <a> href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2237">this > post</a> for additional suspicious problems. You may want to e-mail Tolu. > I'd hate to think 4.11 is causing all this. In the Slots in both Bbox & Ebox where the fatalities were run, the replacements have 4.11. (The remaining active Slot on Bbox is 4.04, running okay. On Ebox, it's Sulfur_Alpha.) I don't know how to determine from the corpses' three residual files which version ran them. [To be sure, current Models activated from the same group were suspended -- so, Ebox runs a Sulfur Alpha Model stand-alone and Bbox runs a 4.04 Model stand-alone. That leaves wasted P4 resources but, at least, they aren't trying to pour more water up a rope.]) Much as I hate to add a brick to Tolu's hod, I'll send an email. Edit: Tolu replied with thanks and said he's looking into it now. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 11057 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 11340 - Posted: 24 Mar 2005, 2:46:33 UTC Last modified: 24 Mar 2005, 2:48:30 UTC One more time -- fifth 4.11 crash, this time on Dbox, third of my machines to have a "hardware failure" with this version. Increasingly disgusting that there is no information about a potential solution. [Edit: Corrected a typo.] Is this affecting only the minority of us posting the errors? Or is it widespread and unnoticed, ignored, or silently driving people away from CPDN? Once again, on a 144-TS boundary. (Yawn.) From residual zipped files: {PH}1{/PH} {TS}127153{/TS} {DAY}10{/DAY} {MTH}4{/MTH} {YR}1818{/YR} {HR}0{/HR} {MIN}30{/MIN} MODEL DUMP SUCCESSFULLY WRITTEN - 1292182 WORDS TO UNIT 22 Number of Words Written to Disk was 1292678 im,sm,ngroup,new_im,new_sm 1 1 48 T F NEGATIVE THETA AT POINT 1 LEVEL 18 ******************************************************************************* Model aborted with error code - 1 Routine and message:- ATM_DYN : NEGATIVE THETA DETECTED. ******************************************************************************* "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 11340 · Reply Quote

cetus Send message Joined: 7 Aug 04 Posts: 9 Credit: 139,753,972 RAC: 19,927	Message 11462 - Posted: 27 Mar 2005, 3:46:46 UTC - in response to Message 11340. > One more time -- fifth 4.11 crash, this time on Dbox, third of my machines to > have a "hardware failure" with this version. Increasingly disgusting that > there is no information about a potential solution. > > [Edit: Corrected a typo.] > > Is this affecting only the minority of us posting the errors? Or is it > widespread and unnoticed, ignored, or silently driving people away from CPDN? > I appear to have the same problem too. I've had three 4.11 crashes in phase one on a P4 linux box. The same machine is still successfully running a 4.04 model, so a hardware issue seems unlikely. ID: 11462 · Reply Quote