climateprediction.net home page
Lost users due to lost WUs

Lost users due to lost WUs

Message boards : Number crunching : Lost users due to lost WUs
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile old_user36084
Avatar

Send message
Joined: 15 Jan 05
Posts: 31
Credit: 1,249,348
RAC: 0
Message 7427 - Posted: 21 Jan 2005, 12:16:11 UTC

I’ve be participating in distributed computing for several years running seti@home and then United Devices grid application clocking up over four years of CPU time. I thought I would try the climate prediction modelling, only that it does have one critical drawback compared to the other DC applications, WU size and time to process.

The WU size and time to process is not an issue for me.

Take the WU size requiring I believe 700 MB to 1 GB, which considering how cheap hard drives are at less than £30 for 40 GB should not be a problem for any high spec PC.

The time to process, which in my case on a P4 2.60 GHz HT takes 10 hours per trickle, is not a problem as the PC is continuously on and running boinc_cli as service with the CP models going 24/7 100% processor usage. The time to process only becomes a problem when the model crashes and reports a client error to climateprediction.net and starts a new model. To the user (like me this week) this feels very frustrating as those CPU cycles appear to have been wasted. My solution for this would be for boinc to automatically backup the vital data files created and then recover when it fatally crashes. This would require extra HD space but it would easily fit in the default 10 GB required. If users can only donate 1 GB HD, then the recover option due to fatal boinc crash is disabled.

It must be expected then boinc will have a fatal crash on some computers. But if it was able to automatically recover the user would not get so irritated and wander if it is worth running CP models. Compare this to over DC projects where a WU only takes a day to complete, in these cases if it fatal crashes it does not feel so frustrating.

I must add that the major plus for the CP project over other DC projects, is the unique model given to each computer to process. That makes it feel more personal and the fact that each model when completed is adding to the vast pool of CP data to then be sifted by some low-paid PhD student! Moreover, you can actual see the post processing progress of the project as a whole, unlike UD grid where the post processing of the protein folding for cancer, anthax, smallpox and human genome is secret due to commercial impacts etc.

Ian

ID: 7427 · Report as offensive     Reply Quote
Profile old_user17525

Send message
Joined: 13 Sep 04
Posts: 161
Credit: 284,548
RAC: 0
Message 7429 - Posted: 21 Jan 2005, 13:09:20 UTC
Last modified: 21 Jan 2005, 13:10:28 UTC

Have a look at this tread for a discussion of the same problem.

<a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=1425">Just lost a WU......</a>

Marj
_________________________________
ID: 7429 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 7430 - Posted: 21 Jan 2005, 13:23:52 UTC
Last modified: 21 Jan 2005, 13:26:08 UTC

CP does have the ability to recover from problems with computations of a model: it can rewind and try again,
first a day, then a month, then a year. What it can't do is recover from problems caused by
finiky hardware.

The real fix is find out what it is about your computer that keeps causing WUs to crash.
There are many possibilities, so each person with a problem is a bit on their own.

First of all, heat is a big problem. Is your computer in a hot room? Is the computer case of a size to allow
good air circulation inside? (The very small "cube" cases look cute, but are useless for continuous
running). Does the case have a bottom/front, and a top/back fan? These can help push/pull cooling air
through the case. Even making sure the cabling inside the case is out of the way can help airflow.

Then there is the quality of the ram being used. In some case, this has been a problem.
And is the heatsink properly seated onto the cpu chip? NOTE: fiddling with this can damage the processor chip
if you're not carefull!

There are several programs that can be run to check how good a computer is, but the details are on the
php boards, which are currently 'under repair'. One is prime95, and another, (I think), mem86.
Or maybe it's memtest86.

I haven't had to do any 'find and fix', as my machine appears to be rock solid. I built it that way. :)

Other users who know more about these matters can probably help you more if you need it, so think, look,
fix, and post again if you need more help. And also if you find and fix the problem.
It's nice to know how people get on, and may help others.

Les


edit
Marj got in a reply while I was composing. Oh well.

Les
ID: 7430 · Report as offensive     Reply Quote
Profile old_user17525

Send message
Joined: 13 Sep 04
Posts: 161
Credit: 284,548
RAC: 0
Message 7432 - Posted: 21 Jan 2005, 13:44:22 UTC

&gt; Marj got in a reply while I was composing. Oh well.
Ah yes, but your answer was different.
2 answers are better than 1 :)
Marj
_________________________________
ID: 7432 · Report as offensive     Reply Quote
LochDhu

Send message
Joined: 5 Aug 04
Posts: 27
Credit: 13,339,226
RAC: 0
Message 7435 - Posted: 21 Jan 2005, 14:31:02 UTC

Also, about one in 25 models fail to no fault of your own.

CPDN's goal is to make Climate Prediction more accurate. Certain combinations of the 20 or so model parameters make the Earth turn into Titan (very cold). When that happens the model rewinds a day, then a month and a year to make sure this wasn't an intermittant calculation error. If all the rewinds go Titan, then it uploads and starts another.

I have had only one of my 20 models go Titan, and that was on the pre-BOINC version. The only other models that have failed on me were download problems due to a firewall, so I didn't waste any processing. If you have one upload prematurely, don't worry about it. But if two in a row are short, then you probably have a hardware issue.
ID: 7435 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 7437 - Posted: 21 Jan 2005, 15:13:54 UTC
Last modified: 21 Jan 2005, 15:21:15 UTC

Also see:
<a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=1484">Why can't my pc crunch cpdn wus?</a>
for a followup to Kenneth Larsen's heat problem, and solution.

Another possibility is your video card. If this is a bit old, or the software, (drivers?) not up to date,
the main cpu may be doing some of the work now done by modern display cards. I vaguely remember this
being mentioned before, perhaps in the php forum.

Les
ID: 7437 · Report as offensive     Reply Quote
Profile old_user36084
Avatar

Send message
Joined: 15 Jan 05
Posts: 31
Credit: 1,249,348
RAC: 0
Message 7443 - Posted: 21 Jan 2005, 17:53:17 UTC

Thank you for useful tips on overheating PCs.

I would like to check that a client error with exit status –5 (0xfffffffb) does mean that the model fatally stopped due to a computation error by the computer and NOT the model.

I believe I've got stable PCs as I've built both of them myself with overclocking in mind. Now, I’m not o/c the PCs for boinc due to possible unstable response. Both PCs are Pentium 2.60 GHz with HT enabled on Abit and Asus motherboards with good cooling. The Abit based PC has been running boinc continuously for nearly a week now with no problems. Whereas the Asus based PC has been having problems with client errors and fatally stopping models.

I don’t think the fatal CP model errors are due to hardware whether overheating or not. I say this because the errors are not appearing to be random or when I’ve left the PC unattended for long hours. Instead the model errors occur when I’m using the PC.

I should highlight that I’m running boinc_cli as a service and I don’t use the boinc_gui or the screensaver. Therefore it should not be related to graphics drivers or card error.

In task manager it is shown as two hadsm3um_4.04 processes taking 50% CPU usage each, plus two other hadsm3_4.04 processes appearing to do nothing. When an error occurs only the hadsm3um_4.04 in slot 1 of the cpu dies.

Could I recover from this error by restoring a backup of the boinc program file directory?

Ian

ID: 7443 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 7446 - Posted: 21 Jan 2005, 18:36:38 UTC - in response to Message 7443.  

&gt; Could I recover from this error by restoring a backup of the boinc program
&gt; file directory?

Almost certainly if it's as a result of an unexpected system restart, hardware problem or user error (I'm particularly good at that one!). If it's a genuine model problem (e.g. a Titan or negative pressure calculation) it should fail at exactly the same point.

I've successfully restored a couple models from backup.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 7446 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 7451 - Posted: 21 Jan 2005, 21:02:22 UTC
Last modified: 21 Jan 2005, 21:04:16 UTC

The -5 error has been mentioned a bit, somewhere. It is apparently a general 'catch-all' message.
Someone from the core team once said that -5 often meant that a 'cell' in the model
had a negative pressure as a result of calculations, and the program then aborts the run.

So, yes, if you get this error, it's software.

But there is speculation on several threads in the php forum about the possibility of unstable computers
causing the program to hiccup. Basically, the theory is an overheated cpu causing the transistors to
malfunction for a moment, and thus getting the maths wrong. Cool down the processor, and things
get back to normal. But it's just a theory.
Best to accept that another model has failed and start on a new one.

But I see that you have had all 17 of your models fail on one computer.
So I'd be inclined to 'hit the computer with the maintenance manual' until it behaved.
No point in crunching if it wont play nicely.

Les
ID: 7451 · Report as offensive     Reply Quote
Arnaud

Send message
Joined: 3 Sep 04
Posts: 268
Credit: 256,045
RAC: 0
Message 7452 - Posted: 21 Jan 2005, 21:34:16 UTC

Hi,
About the -5 error code and unrecoverable error.
I had it almost every time I connected to internet with my RTC modem.
So I thought that my machine was unstable and I started to do regular back-up.
I had more than 30 uncorevable errors on my Wu with BOINC 4.13, but always used back-up to recover it.(it became a sort of personnal challenge not to loose this wu :o))
Since I upgraded to boinc version &gt; 4.13, this error has completely desappeared.
I thought that my machine was unstable (althought I had run multiple test with SuperPI, memtest, Prime95, etc, without any problem) but it looks like it was BOINC 4.13 that was unstable on this particular machine.
Strange, isn't it ?



Arnaud
ID: 7452 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 7454 - Posted: 21 Jan 2005, 22:41:07 UTC
Last modified: 21 Jan 2005, 22:48:45 UTC

Strange indeed, Arnaud. So many problems, so little time.
BOINC and Windows; two (great?) American products!
The MIB will probably be here soon!

juicedry,
Before you try an update of BOINC, read these threads, and THEN choose your version.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=1551

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=1552

Les

ID: 7454 · Report as offensive     Reply Quote
old_user2147

Send message
Joined: 27 Aug 04
Posts: 55
Credit: 1,106,201
RAC: 0
Message 7474 - Posted: 23 Jan 2005, 5:46:21 UTC - in response to Message 7452.  

&gt;

&gt; I had more than 30 uncorevable errors on my Wu with BOINC 4.13, but always
&gt; used back-up to recover it.(it became a sort of personnal challenge not to
&gt; loose this wu :o))
&gt; Since I upgraded to boinc version &gt; 4.13, this error has completely
&gt; desappeared.
&gt; I thought that my machine was unstable (althought I had run multiple test with
&gt; SuperPI, memtest, Prime95, etc, without any problem) but it looks like it was
&gt; BOINC 4.13 that was unstable on this particular machine.
&gt; Strange, isn't it ?
&gt;

Yes, extremely strange. From my happenstance glances on the phpBB, many peeps seem to think V4.13 is a pretty stable build; often moreso than its predecessors. I upgraded all 5 of my machines from V4.05 to V4.13 while they were all mid-model around 2 months ago. Haven't had one error since the upgrade.

Glad to see you're now enjoying the satisfaction of successfuly completing error free models, irrespective of the reason! :-)

Strat
ID: 7474 · Report as offensive     Reply Quote
Arnaud

Send message
Joined: 3 Sep 04
Posts: 268
Credit: 256,045
RAC: 0
Message 7477 - Posted: 23 Jan 2005, 7:29:12 UTC

Thanks Strat.
It's pretty "cool" to have no error.

I upgraded to 4.16 to crunch for E@H (they want 4.16) and discovered this particular unstability by chance because on my second machine 4.13 was ultra-stable too (no error on 4 finished Wus).
Arnaud
ID: 7477 · Report as offensive     Reply Quote
LochDhu

Send message
Joined: 5 Aug 04
Posts: 27
Credit: 13,339,226
RAC: 0
Message 7535 - Posted: 24 Jan 2005, 14:59:54 UTC - in response to Message 7477.  

&gt; Thanks Strat.
&gt; It's pretty "cool" to have no error.
&gt;
&gt; I upgraded to 4.16 to crunch for E@H (they want 4.16) and discovered this
&gt; particular unstability by chance because on my second machine 4.13 was
&gt; ultra-stable too (no error on 4 finished Wus).
&gt;

I just upgraded to 4.16 also for E@H. It reset my CPDN work unit and started to download another. So I backed up the 4.16 BOINC and restored from backup my 4.13. This unit is at 70%, so I'll be done with it sometime this weekend. Then I'll restore to the 4.16 and split time between CPDN and E@H. I didn't happen to notice the deadline on the E@H unit; I hope it doesn't expire on me.
ID: 7535 · Report as offensive     Reply Quote
LochDhu

Send message
Joined: 5 Aug 04
Posts: 27
Credit: 13,339,226
RAC: 0
Message 7545 - Posted: 24 Jan 2005, 16:01:48 UTC - in response to Message 7535.  

&gt; I just upgraded to 4.16 also for E@H. It reset my CPDN work unit and started
&gt; to download another. So I backed up the 4.16 BOINC and restored from backup
&gt; my 4.13. This unit is at 70%, so I'll be done with it sometime this weekend.
&gt; Then I'll restore to the 4.16 and split time between CPDN and E@H. I didn't
&gt; happen to notice the deadline on the E@H unit; I hope it doesn't expire on me.
&gt;

I just learned about v4.17. I backed up again and upgraded to that. It went fine, and now CPDN and E@H are happily sharing CPU. I suspect my upgrade issue was a user error (but I don't know what I did wrong) rather than a bug in the installer.
ID: 7545 · Report as offensive     Reply Quote

Message boards : Number crunching : Lost users due to lost WUs

©2024 climateprediction.net