climateprediction.net home page
Curious about "Error while computing..."

Curious about "Error while computing..."

Message boards : Number crunching : Curious about "Error while computing..."
Message board moderation

To post messages, you must log in.

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 47797 - Posted: 17 Dec 2013, 14:32:26 UTC

Sometimes I get a work unit, such as Workunit 8515857, that has "Error while computing" status for several other users. I always allow my machine to attempt them, and it very often succeeds. It did for Workunit 8515857 for example. And all the others before me failed in one way or another.

Is this because the other users' computers are less reliable than mine? If we are running essentially the same program, with the same data, I would expect us all to fail or all to succeed.

The difference I notice is that several of the failures were running various versions of Windows, though one was running Darwin, and I run Red Hat Enterprise Linux 6.

Also, mine is an x86_64 machine with an Intel processor, and the others were either 32-bit or 64 bit machines.

ID: 47797 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 47798 - Posted: 17 Dec 2013, 15:11:55 UTC

It varies wildly. You can go to individuals and look at their computer/computers and see that nearly every model they run finishes successfully, except the ones that are sent to them with incorrectly setup files. Then you can go to other individuals and see that 90% of their tasks fail on one or more computers. A lot has to do with how their computers are setup, how they use them, and how they have configured boinc preferences. We know that most of the cpdn models sent out don't like being interrupted at certain points. The more frequently they are interrupted, the more likely some failure is to occur. So, if boinc is configured to remove the task from memory when suspended, or suspend if CPU usage higher than xx%, or if the computer is shutdown or hibernated without cleanly exiting boinc, then all those things increase the likelihood of task failures.

Iain Inglis did an analysis of failures by processor and operating system several years ago. The results were in a thread on the old phpBB forum. Removing the computers with immediate failures of all tasks (an obviously misconfigured computer), some configurations seemed more likely to succeed than others. However, even then, it had more to do with how the computer is configured and used rather than whether it was an AMD or Intel running Linux, Windows or Darwin.

ID: 47798 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,907,363
RAC: 6,402
Message 47799 - Posted: 17 Dec 2013, 17:57:34 UTC
Last modified: 6 May 2014, 23:01:12 UTC

Here's one version of the analysis geophi mentioned, for HADCM3N models.

The first chart includes all models, of which only ~40% got to the first trickle (thicker blue line). The second chart looks at those models that submitted at least one trickle, of which just over 30% complete (again, thicker blue line). Platform-specific problems tend to come and go, so one platform might look bad for a particular batch of models and better at another time. For example, these charts show the devastating effect of the Mac permissions problem, which stops many Mac users getting to the first trickle; if they do that, however, they do relatively well.

Progress of HADCM3N Models

Progress of HADCM3N Models (that submit at least one trickle)
ID: 47799 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 47802 - Posted: 17 Dec 2013, 19:09:32 UTC


Nice charts. The really interesting one there is the 'Darwin' entry ... only a handful of Darwin boxes are correctly configured to be able to run CPDN, but when they do run past the first trickle, they're the most reliable.




I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 47802 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 47812 - Posted: 18 Dec 2013, 16:09:25 UTC - in response to Message 47798.  

I see. I see I was wise to draw no conclusions from the limited data I examined, for one thing.

I have the BOINC client setup to always leave stuff in memory. I have 8 GBytes of the stuff and 2 GBytes is surely all I really need. I could probably put 256 or 384 GBytes in the box if I were crazy enough to do that. I normally have about 40 to 50 megabytes swapped out.

Now sometimes the BOINC processes do get suspended (they have Linux nice value of 19) but my machine gets stopped only about once a month when I need to reboot it to run Windows, or to replace the Linux kernel. So mostly I do not experience problems like that. Though I notice it at times.

Like this one:

Task 15937757
Work Unit 8561112
Stderr starts out like this:

<core_client_version>6.10.45</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...

and ends like this:

Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
20:05:15 (2239): No heartbeat from core client for 30 sec - exiting
Suspended CPDN Monitor - No 'heartbeat' from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Signal 15 received, exiting...
Called boinc_finish
Signal 15 received, exiting...
Called boinc_finish
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Signal 15 received, exiting...
Called boinc_finish
Suspended CPDN Monitor - Suspend request from BOINC...
*** glibc detected *** ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu: double free or corruption (out): 0x090f2de0 ***
======= Backtrace: =========
/lib/libc.so.6[0x6e3df1]
/lib/libc.so.6[0x6e6531]

and more boring stuff not worth including here.

The funny thing is that it seems to actually have completed successfully with all the trickles delivered.
ID: 47812 · Report as offensive     Reply Quote

Message boards : Number crunching : Curious about "Error while computing..."

©2024 climateprediction.net