Reporting - Errors while computing -

Author	Message
Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 45195 - Posted: 29 Oct 2012, 9:38:52 UTC Reporting - Errors while computing - I would like to report the following - Errors while computing - in case it would give Andy and Jonathan some information they need: Task 15408322 Task 15407533 Task 15407290 Task 15404774 Task 15402557 Task 15401805 all with the following Stderr file: <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> The device does not recognize the command. (0x16) - exit code 22 (0x16) </message> <stderr_txt> Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 *Sorry, too many model crashes! :-(* Called boinc_finish </stderr_txt> ]]> I hope this helps, Byron ID: 45195 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 6,980,320 RAC: 3,893	Message 45196 - Posted: 29 Oct 2012, 11:28:49 UTC - in response to Message 45195. Thanks, Byron. Reports of models failing with "REPLANCA" errors have been passed onto the project team and the cause is currently being investigated. ID: 45196 · Reply Quote

DouglasRH Send message Joined: 21 Jan 09 Posts: 1 Credit: 615,987 RAC: 0	Message 45306 - Posted: 4 Dec 2012, 1:48:01 UTC Hi, I have the BOINC ClimatePredciton.Net running with no problems on my Quad core x86/ Vista 32. However when I try to run it on my x64 Hex core Windows7 x64 all I get are computational errors: Exit status 22 (0x16), communications deferred for an hour and 'No work available to process' I've tried extensive computing preference changes including GPU on/off etc. Nothing works. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1255061 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15468768 Thanks for any and all assistance. Regards, DougRH ID: 45306 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45307 - Posted: 4 Dec 2012, 2:44:08 UTC - in response to Message 45306. The programs here are 32 bits, and require 32 libraries for any 64 bit OS. You're getting no new work from the project because there isn't any. There's a thread in this section of the board called Project has no tasks available. Rambles a bit near the end, but ... The Server Status page has a link in the blue menu to the left of here, 5 from the bottom. Backups: Here ID: 45307 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 45479 - Posted: 19 Jan 2013, 2:29:20 UTC Last modified: 19 Jan 2013, 2:40:07 UTC - approx. 2 hours ago I had the following - Full Resolution Ocean v6.07 Model Crash at 45% so I would just like to report the following - Errors while computing in case it might give Andy, Jonathan and the crew at Oxford ... some hints or information they might need ?? or is this crash the fault of me or my computer ?? hadcm3n_38o6_1940_40_008261906_1 hadcm3n_38o6_1940_40_008261906_1 Workunit 8417030 Created 9 Jan 2013 13:53:48 UTC Sent 9 Jan 2013 13:54:16 UTC Received 18 Jan 2013 23:43:06 UTC Server state Over Outcome Client error Client state Compute error Exit status 22 (0x16) Computer ID 1167855 Report deadline 10 Apr 2013 21:21:27 UTC Run time 811,078.29 CPU time 645,155.60 Validate state Invalid Claimed credit 0.00 Granted credit 5,598.72 application version UK Met Office Coupled Model Full Resolution Ocean v6.07 Stderr <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> The device does not recognize the command. (0x16) - exit code 22 (0x16) </message> <stderr_txt> CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy 2048 Sorry, too many model crashes! :-( Called boinc_finish </stderr_txt> ]]> 18/01/2013 3:41:45 PM \| climateprediction.net \| Computation for task hadcm3n_38o6_1940_40_008261906_1 finished 18/01/2013 3:41:45 PM \| climateprediction.net \| Output file hadcm3n_38o6_1940_40_008261906_1_2.zip for task hadcm3n_38o6_1940_40_008261906_1 absent 18/01/2013 3:41:45 PM \| climateprediction.net \| Output file hadcm3n_38o6_1940_40_008261906_1_3.zip for task hadcm3n_38o6_1940_40_008261906_1 absent 18/01/2013 3:41:45 PM \| climateprediction.net \| Output file hadcm3n_38o6_1940_40_008261906_1_4.zip for task hadcm3n_38o6_1940_40_008261906_1 absent I hope this helps, Byron ID: 45479 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45480 - Posted: 19 Jan 2013, 3:00:53 UTC - in response to Message 45479. Hi Byron Invalid Theta is when the physics goes wrong, and built in checks stop the model. It's what the researchers are looking for, so that they know what the result is of starting a model with the values that it was given. So, No Worries. :) Backups: Here ID: 45480 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 45488 - Posted: 23 Jan 2013, 19:33:59 UTC - in response to Message 45480. Last modified: 23 Jan 2013, 19:37:50 UTC - Hi Byron Invalid Theta is when the physics goes wrong, and built in checks stop the model. It's what the researchers are looking for, so that they know what the result is of starting a model with the values that it was given. So, No Worries. :) Hi Les I have been away for a couple of days ... so just now reading your message. aha ... ok thanks for that info and that explanation ... I understand now :) it's good to hear that this might provide the project team with info that could be useful to them. Byron ID: 45488 · Reply Quote

old_user490835 Send message Joined: 23 Dec 07 Posts: 3 Credit: 682,099 RAC: 0	Message 45548 - Posted: 12 Feb 2013, 20:00:48 UTC i did get some window telling me "library of c++ has crashed" etc and there goes my 80hours of computing :( http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15599675 Client state Compute error Exit status 22 (0x16) Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=2096, iMonCtr=1 Model crash detected, will try to restart... Sorry, too many model crashes! :-( Called boinc_finish any ideas ? my machine broke or the model(wu) ? ID: 45548 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 45549 - Posted: 12 Feb 2013, 21:44:25 UTC - in response to Message 45548. Last modified: 12 Feb 2013, 21:49:05 UTC Looking at the tasks page for your computer, GuruFin, there has been a great variety of reasons for models crashing on your computer. Possibly there is more than one issue. For best results with climate models, which stress the CPU and memory more heavily than almost anything else, and which are fussy about disk access, the following are recommended, in this order: 1. Do not overclock. 2. Ensure that your virus scanner excludes the Boinc data folder and all sub-folders. (That is the folder with two sub-folders "projects" and "slots".) 3. In Boinc preferences - disk and memory usage, ensure that "leave applications in memory when suspended" is selected, and allow Boinc to use up to 75% of memory. (At least 1 GB per running task is best; mostly 500MB works too. Mostly.) Also ensure that Boinc has enough disk space, 2 GB per CPU at least. 4. Shut down Boinc (suspend all work) when you play games that have demanding video requirements. 5. For a multi-processor system such as yours, in processor usage preferences, set Boinc to "Use at most 100% of CPU time", and control the amount of work with for example "use at most 75 % of processors" (change the 75 to whatever you like). If you have done these, and are still getting errors, your RAM may be running out of specification. Run a memory test program such as memtest86+ for at least 48 hours to check. Alternatively, the power supply for your computer may be unable to supply enough power, or the motherboard is using the not-recommended "voltage boost" feature that some have. Edit: I should point out that there will still be some apparent failures even after doing all of this. Some climate models fail because they generate physically impossible atmospheric pressures or potential temperatures. A few other have been sent out with the wrong data files - these normally crash straight away, though. ID: 45549 · Reply Quote

old_user490835 Send message Joined: 23 Dec 07 Posts: 3 Credit: 682,099 RAC: 0	Message 45550 - Posted: 12 Feb 2013, 22:13:08 UTC - in response to Message 45549. Last modified: 12 Feb 2013, 22:19:11 UTC Looking at the tasks page for your computer, GuruFin, there has been a great variety of reasons for models crashing on your computer. Possibly there is more than one issue. 1. Do not overclock. little overclock ? :) (4.2ghz turbo freq) 2. Ensure that your virus scanner excludes the Boinc data folder and all sub-folders. (That is the folder with two sub-folders "projects" and "slots".) not interfering.. no problems 3. In Boinc preferences - disk and memory usage, ensure that "leave applications in memory when suspended" is selected, and allow Boinc to use up to 75% of memory. (At least 1 GB per running task is best; mostly 500MB works too. Mostly.) Also ensure that Boinc has enough disk space, 2 GB per CPU at least. been allready this way or better (boinc can use 8gb of mem @16gb installed) 4. Shut down Boinc (suspend all work) when you play games that have demanding video requirements. yep.. done this way too 5. For a multi-processor system such as yours, in processor usage preferences, set Boinc to "Use at most 100% of CPU time", and control the amount of work with for example "use at most 75 % of processors" (change the 75 to whatever you like). yep.. done this way too If you have done these, and are still getting errors, your RAM may be running out of specification. Run a memory test program such as memtest86+ for at least 48 hours to check. Alternatively, the power supply for your computer may be unable to supply enough power, or the motherboard is using the not-recommended "voltage boost" feature that some have. my memory modules are running at 1066mhz (not 1333mhz as specs say) because i like stable pc. (memtest runs fine and "intel burn test" too) Edit: I should point out that there will still be some apparent failures even after doing all of this. Some climate models fail because they generate physically impossible atmospheric pressures or potential temperatures. A few other have been sent out with the wrong data files - these normally crash straight away, though. thats why i asked opinion in my failed tests :) maybe my issue is corrupt "c++".dll ? but what is curious is that my wu's do run 60-120hours good and then fail suddenly... thank you ! ID: 45550 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 45551 - Posted: 12 Feb 2013, 22:53:58 UTC - in response to Message 45550. Intel's turbo boost won't be a problem. According to Intel's literature it operates for up to two or three seconds when one process is using a lot of one core and the other cores are idle. In this situation the chip won't get too hot and unstable. But if you are running more than one CPDN model at a time, turbo boost won't operate. And even with only one model, it will exceed the "a few seconds" time limit, so the CPU will cycle: a few seconds on turbo, then 10 or 20 seconds at normal speed, turbo for 2 or 3 seconds, back to normal... Interesting to watch, if you like that kind of thing. Manually overclocking to the turbo boost frequency is not recommended. Together with underclocking the RAM, you may get just the results you are seeing. Now I've given you the overclocking lecture. :-) Several other people have reported the C++ DLL crash over the last few years (that I have seen). Solving the problem was always difficult. Sometimes the problem was blamed on video drivers, but I can't remember whether ATI/AMD or Nvidia is the bigger suspect. Sometimes the screen saver was suspected, or other software such as Microsoft SQL Server, which will try to grab all the memory for itself. Sometimes a corrupt download of the BOINC software was suspected. If you are confident in your video card, its drivers, and in the power supply, then the way forward is probably to disconnect from CPDN and all other projects, uninstall boinc, delete its data folder and program folder via windows explorer, download a fresh copy of boinc, and re-connect to CPDN. But that may not work either. Some combinations of CPU, RAM, and motherboard just seem less reliable. I had a core i3 (Clarkdale) on a Gigabyte H55 board with Hynix memory that was like that. Worked perfectly for everything except CPDN. ID: 45551 · Reply Quote

old_user490835 Send message Joined: 23 Dec 07 Posts: 3 Credit: 682,099 RAC: 0	Message 45554 - Posted: 15 Feb 2013, 4:35:37 UTC - in response to Message 45551. thanks for the good answers :) ill do some crunching and adjust my system to see if its stable enough to cpdn :) i did previously (month ago) have different mb and processor (i2550k/p8p67) but now a new setup (i3770k/sabertooth) so lets see if that helps or not :) ID: 45554 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 45616 - Posted: 7 Mar 2013, 16:17:03 UTC Last modified: 7 Mar 2013, 16:51:07 UTC - Hello everyone I would like to report the following - Errors while computing - in case it might give the project team some information or clues they might need ? maybe some one could pass the errors onto the project team and the cause could be investigated ? the same Model crashed twice with the same <stderr_txt> file - with two different computers. application version UK Met Office Coupled Model Full Resolution Ocean v6.07 name hadcm3n_zipn_2000_40_008323389 Workunit 8474524 application UK Met Office Coupled Model Full Resolution Ocean v6.07 created 2 Mar 2013 1:57:59 UTC my computer 1167855 on ths computer I'm only running one project - only Climate Prediction.net - 24/7/365 - and no GPU - apps 8 physical CPU - no Hyper threading - Stderr <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> The device does not recognize the command. (0x16) - exit code 22 (0x16) </message> <stderr_txt> Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Sorry, too many model crashes! :-( Called boinc_finish </stderr_txt> ]]> 07/03/2013 12:13:18 AM \| climateprediction.net \| Computation for task hadcm3n_zipn_2000_40_008323389_1 finished 07/03/2013 12:13:18 AM \| climateprediction.net \| Output file hadcm3n_zipn_2000_40_008323389_1_1.zip for task hadcm3n_zipn_2000_40_008323389_1 absent 07/03/2013 12:13:18 AM \| climateprediction.net \| Output file hadcm3n_zipn_2000_40_008323389_1_2.zip for task hadcm3n_zipn_2000_40_008323389_1 absent 07/03/2013 12:13:18 AM \| climateprediction.net \| Output file hadcm3n_zipn_2000_40_008323389_1_3.zip for task hadcm3n_zipn_2000_40_008323389_1 absent 07/03/2013 12:13:18 AM \| climateprediction.net \| Output file hadcm3n_zipn_2000_40_008323389_1_4.zip for task hadcm3n_zipn_2000_40_008323389_1 absent I'm just curious, what dose the following <stderr_txt> file mean ? <stderr_txt> file Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 </stderr_txt> file I hope this helps, Byron ID: 45616 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45617 - Posted: 7 Mar 2013, 20:03:21 UTC - in response to Message 45616. 'REPLANCA' etc, means that one of the supporting data files has the wrong number of entries, so they don't match what the main program is expecting. Someone at the research place has gotten it wrong. :( The project people were notified yesterday that the entire batch appears to be faulty. Backups: Here ID: 45617 · Reply Quote

old_user671679 Send message Joined: 30 Jan 12 Posts: 38 Credit: 10,197,388 RAC: 0	Message 45619 - Posted: 7 Mar 2013, 20:25:37 UTC Oh man, are you serious? Another bad batch? Maybe those boys need a vacation. I hope my computers don't get put on probation from these and the PNW wu's. Ah well, I guess we wait. Hey Les, how long do you suppose it would take before they roll out the new Africa project? ID: 45619 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45620 - Posted: 7 Mar 2013, 22:52:10 UTC - in response to Message 45619. Re: Africa project No information, but wild guess: a year. Christmas present? :) As for the bad batch: According to this page, there's a lot of research centres involved in the current hadcm3 work, so it could be a work experience person at any of them. Only the Uni of Oregon involved with the PNW models, so someone there. ANZ should be along soon. Backups: Here ID: 45620 · Reply Quote

old_user671679 Send message Joined: 30 Jan 12 Posts: 38 Credit: 10,197,388 RAC: 0	Message 45621 - Posted: 7 Mar 2013, 23:31:49 UTC I'm sorry Les, what is ANZ? I don't think I've heard of that one, I hope you don't mind me picking you're brain for a moment, I am wondering about the Full Resolution Ocean models. What ocean are they modeling and what exactly are they looking for? I've read allot of you're posts and you seem very knowledgeable about this project. Thanks in advance. ID: 45621 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 45622 - Posted: 8 Mar 2013, 0:53:35 UTC - in response to Message 45617. 'REPLANCA' etc, means that one of the supporting data files has the wrong number of entries, so they don't match what the main program is expecting. Someone at the research place has gotten it wrong. :( The project people were notified yesterday that the entire batch appears to be faulty. aha ... ok thank you Les. Best Wishes Byron ID: 45622 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45623 - Posted: 8 Mar 2013, 1:38:27 UTC - in response to Message 45621. ANZ = Australia - New Zealand area, which is a big one. It's been in beta for a while, and is now held up by problems at the researcher's end. There's a thread a little way down this section about ANZ. The Ocean is "all of it". Several of the other model types use HadSM3 which has what's called a 'slab' ocean, i.e. it has certain fixed values, which makes modelling simpler and faster, when the aim is to study the atmosphere. The Coupled Ocean model, HadCM3, has both an ocean part, (at lower resolution, because changes there are much slower), and an atmosphere part. This can be seen by watching the data part of the graphics display for a while. The current use of this model is for the RAPID-RAPIT experiment. There is also a thread about it here by the researchers. And then there are the 'regional' models. These use a simplified model for the bulk of the globe, with a high resolution model for a small area. To run this high res model for the full globe would require a supercomputer to finish it in a reasonable time. Backups: Here ID: 45623 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,497,933 RAC: 6,477	Message 45624 - Posted: 8 Mar 2013, 11:24:17 UTC Is the data still useful from those or will they be re-issued anyway when fixed? Also even if they are not useful don't abort without checking the graphics - the last one I got didn't start in 2000 so was clearly not from that batch. ID: 45624 · Reply Quote