climateprediction.net home page
Computing error

Computing error

Message boards : Number crunching : Computing error
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 40850 - Posted: 13 Oct 2010, 19:01:13 UTC

The task hadcm3istd_cslz_1920_160_06021211 ended with an "error while computing" message after 4600 hours on my Linux box. It had started from 1920 and ended about 2070. Is it possible to know the causes of this error? I am running SuSE Linux 11.1 32-bit pae on an Opteron 1210, not overclocked, and I get very rarely a computing error in any of my 6 Boinc projects.
Tullio
ID: 40850 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 40851 - Posted: 13 Oct 2010, 19:53:27 UTC - in response to Message 40850.  
Last modified: 13 Oct 2010, 19:54:02 UTC

The list of what happens to a model, success OR failure, is in stderr, on the project's page for each model. Click on the plus sign along side it.

In this case, it was cannot open input file ..., which may indicate that an AV program had the file locked while it was checking it, at the moment that the model's program wanted to use it.
Backups: Here
ID: 40851 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 40852 - Posted: 13 Oct 2010, 19:57:32 UTC
Last modified: 13 Oct 2010, 19:58:29 UTC

From "stderr":
Model crashed: 
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/atmos_restart.day
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/ocean_restart.day

My guess is that something external to boinc had the files locked, a virus scan, for example.

"stderr" is visible on the model's page; click the " + " sign to see the diagnostics.

If you have a recent backup, the model could be restarted from that point -- however, it would also restart work on your other projects. There is a convoluted way to get around the other projects but only CPDN would run until the CPDN Task completed.

EDIT: Beat me to it again, Les!
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 40852 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 40853 - Posted: 14 Oct 2010, 4:19:04 UTC - in response to Message 40852.  

Thanks Les and AstroWX. I am using Linux and have not made any virus scan. I am using only a firewall, plus a modem with a built-in firewall protection by Telecom Italia. I shall read the stderr.txt file.
Tullio
ID: 40853 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 40854 - Posted: 14 Oct 2010, 7:52:43 UTC

This is the message I got on my terminal:
12-Oct-2010 22:24:16 [climateprediction.net] Restarting task hadcm3istd_cslz_1920_160_06021211_4 using hadcm3i version 604
12-Oct-2010 22:24:28 [climateprediction.net] Computation for task hadcm3istd_cslz_1920_160_06021211_4 finished
12-Oct-2010 22:24:28 [climateprediction.net] Output file hadcm3istd_cslz_1920_160_06021211_4_16.zip for task hadcm3istd_cslz_1920_160_06021211_4 absent
ID: 40854 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 40856 - Posted: 14 Oct 2010, 8:40:22 UTC - in response to Message 40854.  

Which isn't very useful.
Use the stderr in your account on this web site.

ID: 40856 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 40857 - Posted: 14 Oct 2010, 8:50:51 UTC - in response to Message 40856.  

Which isn't very useful.
Use the stderr in your account on this web site.


here its last part:
CPDN Monitor - Quit request from BOINC...
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/atmos_restart.day
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/ocean_restart.day

Model crashed:
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/atmos_restart.day
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/ocean_restart.day

Model crashed:
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/atmos_restart.day
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/ocean_restart.day

Model crashed:
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/atmos_restart.day
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/ocean_restart.day

Model crashed:
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/atmos_restart.day
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/ocean_restart.day

Model crashed:
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/atmos_restart.day
cpdnmonitor: cannot open input file /home/tchersi/BOINC/projects/climateprediction.net/hadcm3istd_cslz_1920_160_06021211/dataout/ocean_restart.day

Model crashed:
Sorry, too many model crashes! :-(
called boinc_finish

</stderr_txt>
ID: 40857 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40861 - Posted: 14 Oct 2010, 16:38:22 UTC

I've never seen so many messages in stderr before, or many of these particular messages. All I can say is that whenever I've seen a mention of lockfile it has proved fatal for the model. The model tries again and again but I don't think I've ever seen a case where a model has recovered from this (eg one or two instances of lockfile, but the file then miraculously unlocks and the model marches on again).

The lockfile messages you had are not the same as the ones we see on some computers that mean the person must upgrade their Boinc version; and in any case that situation only happened on Windows.

Is Boinc in the trusted zone of both your firewall and AV? I wouldn't let automatic AV scans run while Boinc is running.

The only consolation is that the computer produced 15 decadal files which will all be used by the researchers.
Cpdn news
ID: 40861 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 40862 - Posted: 14 Oct 2010, 17:27:11 UTC - in response to Message 40861.  

I have a backup made on October 9 (I make one every week) but I am not going to use it, also because I have AQUA, Einstein, QuantumFIRE, QMC and SETI (when not down) all running happily. I am glad if my 4600 hours of runtime and 4000 hours of CPU time have served any purpose.
Tullio
ID: 40862 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40864 - Posted: 15 Oct 2010, 12:02:01 UTC
Last modified: 15 Oct 2010, 12:06:42 UTC

If by chance you have a single-core machine even if it's quite old (I don't think that for HadCM one needs SSE2) with Linux you could transfer the whole backup to it when it has no work, suspend and never run the tasks that are being run on your good computer, and just let the HadCM crunch to the end. You'd get the message from the server that the task had already been reported as completed but the final file would still be accepted, added to the model's other results and used.

I'm going to do this with a CPDN FAMOUS that crashed on my quad when I had a Big Problem. When my single-core machine has finished its current work next week I'll let it complete just this one model from the restore of a multi-model backup. After it's finished I'll delete the restored contents of the Boinc Data folder and put back the original Data folder package.

CPDN has a task indexing system that puts together all the files from models even if the files upload to several different servers or from more than one computer.
Cpdn news
ID: 40864 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 40868 - Posted: 15 Oct 2010, 19:16:35 UTC - in response to Message 40864.  

I had a 400 MHz PII Deschutes but I gave it to my son. He lives in Tuscany and visits me about monthly. I could give him a Flash memory stick with the BOINC directory. I just bought a 1.4 TB external hard disk to save my personal files so I can upgrade my SuSE distro to 11.3. Thanks for your suggestion.
Tullio
ID: 40868 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40870 - Posted: 16 Oct 2010, 10:01:14 UTC
Last modified: 16 Oct 2010, 10:05:55 UTC

How much RAM has that computer got? HadCM is much more likely to crunch and not crash if it has 512 RAM, not just 256. Maybe you'll need to look around to see whether you have a spare old RAM card too. Of course the computer's memory may be built into its motherboard, in which case you'd just have to take a chance. If it has less than 256 RAM I don't think the model would be likely to succeed. This opinion is based on what happened to BBC members' models which were almost the same as yours.

Trying this is all much more worthwhile for long models than for short ones, especially if they crashed near the end.

Lucky son, living in Tuscany. And lucky that you manage to see each other so often.
Cpdn news
ID: 40870 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 40874 - Posted: 17 Oct 2010, 4:34:21 UTC - in response to Message 40870.  

If I remember well the PII has only 384 MB RAM. I renounced running CPDN on it but it made some work on the BBC model. It was also running SETI and Einstein. I now have 5 GB RAM on my Linux box. But I also have a AT&T Olivetti UNIX PC with 2.5 MB RAM and a 40 MB disk. It is still running UNIX System V. with a threadbare windowing and a three button mouse.Cheers.
Tullio
ID: 40874 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40878 - Posted: 17 Oct 2010, 11:20:21 UTC

If the computer could crunch a BBC model it should be able to crunch this HadCM.
Cpdn news
ID: 40878 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 41036 - Posted: 15 Nov 2010, 8:13:02 UTC

My first Famous task ended with compute error after 56 hours. I got a second one, hope it works.
Tullio
ID: 41036 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 41053 - Posted: 16 Nov 2010, 10:18:06 UTC
Last modified: 16 Nov 2010, 10:21:55 UTC

4 out of 7 computers, all with different CPUs and OS, errored on this task. Three are still working. Too many total results, says a red line.
Tullio
On my second Famous task, my wingman has already errored.
ID: 41053 · Report as offensive     Reply Quote
[B^S] mavau

Send message
Joined: 30 Aug 04
Posts: 142
Credit: 9,936,132
RAC: 0
Message 41056 - Posted: 16 Nov 2010, 20:48:34 UTC

There is a thread about famous here.
To sum up, the error rate seems to be about one in three.
It's a good idea to look at how other computers are doing, to check if there's a problem at your end.
Note that there are also discrepancies according to OS and CPU.
As an example, although I've been fairly successful, here are two quick failures with Invalid Theta:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6959088

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6961241

Good luck with your crunching.


Forum search Site search
ID: 41056 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 41078 - Posted: 18 Nov 2010, 8:55:22 UTC
Last modified: 18 Nov 2010, 8:57:00 UTC

We can forget about the Too many total results which is irrelevant to CPDN and really shouldn't be there.

What I'm going to say now refers to HadSM and FAMOUS only. If you look at the workunit page for a crashed model of one of these types, you can see which other computers got a model from the same WU. Results from computers with the same combination of operating system + CPU type (AMD or Intel) should produce the same results. For example if one computer with Linux + AMD crashes one of these models after a particular timestep, any other computers of the same combination should produce the same result.

Sometimes there's no other computer in the WU of the same type as our own, but often we can compare. If two computers of the same type crash a HadSM or FAMOUS at the same processing point we know the problem lies in the model.

But if our model crashes while another computer of the same type completes it we know there's probably something wrong with one of those computers. The computer with the problem is more likely to be the one that crashed the model, though not necessarily.

For these two model types, computers with the same OS + CPU type should generate bit-identical results.

In any case, for FAMOUS if you go to a model's web page and click on stderr +, you see its messages. If you see NEGATIVE PRESSURE or INVALID THETA it's almost certain that a crash was caused by the model's parameter values. If one of these messages appears 5 or 6 times all together one after the other it's even more certain.

If these PRESSURE or THETA messages appear here and there one at a time interspersed with lots of other messages, then it will often be that a problem with the computer is the cause.
Cpdn news
ID: 41078 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 41086 - Posted: 19 Nov 2010, 9:25:33 UTC

Here is my stderr.txt:
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)
</message>
<stderr_txt>
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...

BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 60 - Return code = 1

BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 61 - Return code = 1

BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 68 - Return code = 1

BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 69 - Return code = 1
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...

BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 60 - Return code = 1

BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 61 - Return code = 1

BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 68 - Return code = 1

BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 69 - Return code = 1
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...

BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 60 - Return code = 1

BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 61 - Return code = 1

BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 68 - Return code = 1

BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 69 - Return code = 1
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy

Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy
Sorry, too many model crashes! :-(
(2478): called boinc_finish

</stderr_txt>
]]>

Trickle Click here
Perturbed Parameters for Result # 118
The second Famous unit is still crunching after 48+ hours.
Tullio
ID: 41086 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 41089 - Posted: 19 Nov 2010, 10:42:28 UTC - in response to Message 41086.  

The relevant text for the failure is: INVALID THETA DETECTED

Which is the cause of most FAMOUS failures.

ID: 41089 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Computing error

©2024 climateprediction.net