climateprediction.net home page
Has my WU completed?

Has my WU completed?

Questions and Answers : Windows : Has my WU completed?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile old_user17289

Send message
Joined: 13 Sep 04
Posts: 228
Credit: 354,979
RAC: 0
Message 30712 - Posted: 26 Sep 2007, 8:40:33 UTC
Last modified: 26 Sep 2007, 8:41:19 UTC

I just came back from vacation and found that my WU has stopped running (result ID 6181611). I forgot to check the logs before I cleaned out the BOINC folders and reinstalled the latest version of BOINC.

The WU had once crashed before, but I allowed it to resume from a backup. However, it seems the WU is still in a crashed state.

Has the WU successfully completed? The image in the result seems to indicate that it goes through to 2080, but the state still indicates a client error.
ID: 30712 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 30713 - Posted: 26 Sep 2007, 8:52:45 UTC


Once you have an error recorded on the server, it will stay there.
This message is only of interest on other projects, Here, it\'s ignored, in favour of other indicators. (Because this project uses backups.)

You have to go by the year at the top of the graph.

ID: 30713 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 30717 - Posted: 26 Sep 2007, 15:24:40 UTC

And the year at the top of the graph is 2080, so thanks for 160 years of simulation, pwillener.


Cpdn news
ID: 30717 · Report as offensive     Reply Quote
Profile old_user17289

Send message
Joined: 13 Sep 04
Posts: 228
Credit: 354,979
RAC: 0
Message 30726 - Posted: 27 Sep 2007, 3:45:15 UTC

Thanks for the clarification.

I did have other WUs in the past that crashed, but were able to complete via backups. These were marked as \'Success\' after the final upload. This must have changed with the newer models/software, so that\'s why I was concerned. Not any more...
ID: 30726 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 30735 - Posted: 27 Sep 2007, 9:59:47 UTC

The Over/Success/Done etc descriptions on members\' results pages are not all reliable and I\'ve no idea why. Crashed and restored models that then complete should really be described as a Success, but this doesn\'t happen; the server refuses to forget the earlier failure. And some models that crash and don\'t complete are called a Success. So the true indication of a model\'s outcome is where the trickles got to and the graphs.

Maybe the descriptors were designed for other projects that treat results differently from cpdn.
Cpdn news
ID: 30735 · Report as offensive     Reply Quote
John Eric Hopkinson

Send message
Joined: 27 Jan 05
Posts: 74
Credit: 1,047,809
RAC: 0
Message 30755 - Posted: 28 Sep 2007, 16:38:18 UTC

mo.v and Les:
Perhaps you could help me with this situation:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6044145

which appears to be similar to the problems described by PWILLENER.
ID: 30755 · Report as offensive     Reply Quote
John Eric Hopkinson

Send message
Joined: 27 Jan 05
Posts: 74
Credit: 1,047,809
RAC: 0
Message 30756 - Posted: 28 Sep 2007, 16:44:14 UTC

mo.v and Les: (continued)
I hit the enter button before completing my message.

I replaced a WU from a \"Les Bayliss\" type backup, and it is not clear at all whether this model is behaving properly.
Note the other computers (not mine) shown in the list under \"WORKUNITS\".
How did they get into this system?
The WU appears to be plodding along garnering credits, but that\'s not what some of the report functions indicate.
Anyway, not to panic.

Almost 50& complete readu fro another backup, unless you decide there is a problem,

ID: 30756 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 30757 - Posted: 28 Sep 2007, 17:44:53 UTC

Each of the models in a work unit is allocated to a different computer. This is sometimes done to improve the chances of at least one of the models from that work unit finishing; it could also be that the models in a work unit are similar, but not identical.

The model of yours that comes from work unit 6044145 hasn\'t crashed and hasn\'t yet got any credits (i.e. 6542982) - so I wonder whether you mean that one of your other models has been restored?

A possibility is model 6563560, which crashed on 10 September, is about 50% done, and is trickling again. There\'s nothing in the crash details that suggests the model is doomed, so it looks like a viable proposition to me - keep crunching!
ID: 30757 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 30758 - Posted: 28 Sep 2007, 20:09:55 UTC

Hi John

Like Iain I think the model you\'ve restored is #6563560 which now seems to have reached millenium year so it\'s doing very well. The error code 22 that it previously crashed with won\'t be forgotten by the server whose memory is longer than an elephant\'s, and this will prevent you from getting the word Success against the model when it completes. But that doesn\'t matter one bit. As long as the model progresses and trickles, all the work it does is accepted by the server and used by the researchers.

Congratulations on the successful restore of this model. I also use Les\'s backup method which seems failsafe.

That 22 error code probably meant that something wrong happened momentarily on your computer and crashed the model. If you have a look at the project READMEs in my signature and go to the one about avoiding model crashes, you\'ll find that item #5 by Mike suggests lots of ways to avoid model crashes.

Even taking all the precautions, it\'s quite easy for such long models to crash before they finish so regular backups are the best defence.

Your computer is crunching its model at about 4.2 sec per timestep. I\'m no hardware expert, but I wonder whether your computer should be crunching a bit faster.

*In your cpdn preferences, check that \'Keep application in memory while suspended?\' is set to YES. If it says NO, change it.

*Disable the screensaver if you haven\'t done so already as it really slows the crunching down. You can still view the globe using the graphics button in boinc manager.

*Before you turn off the computer, look at your globe, if necessary click Z then 8 on the keyboard to see all the figures, and if possible turn off when the model has just gone past a checkpoint ie turn it off shortly after number 432.

Cpdn news
ID: 30758 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 30759 - Posted: 28 Sep 2007, 20:49:18 UTC


Hi John

I think that it\'s going OK, so keep going.

My 3.2GHz computer has been running a model on the beta site, and a few weeks ago it suddenly slowed right down. It went from about 15 hours a trickle to 29 hours, and the sec/TS started climbing from about 2.5 to a calculated value of about 4.0
It\'s not the first of my beta models to do this either.

So your 2.4GHz computer is probably running at about the right speed.

Your graph seems OK too. This is mine. I\'m going to continue with it to see just how it ends up.

I\'m trying to get a quad core running to replace it. Computers aren\'t as easy to put together as they were 20 years ago, but I\'ll get there. :)
And just in time for summer. :( Yesterday was very hot, and it\'s only spring. I think.

ID: 30759 · Report as offensive     Reply Quote
John Eric Hopkinson

Send message
Joined: 27 Jan 05
Posts: 74
Credit: 1,047,809
RAC: 0
Message 30805 - Posted: 4 Oct 2007, 20:47:12 UTC

Iain, mo.v and Les:

Your helpful replies are ackowledged and thank you for the assistance.
My investigations indicate that the crash occurred because of a \"perfect storm\" of repeated power failures, antivirus program interference, and automated defragmentation during a file transfer.
Individually, none of these would bother post-4.xx versions of BOINC, which appear to be very fault-tolerant. But there are limits.
My old G40 is quite capable but, like its owner,a plodder. I tend to leave it alone, perhaps too much.

Les: Good luck with the quad core and the weather. We will be watching. If things get too hot, come on up to NS, where you will be perpetually wet. Thanks for the simple backup/restore procedure.

mo.v: your comment re Error 22 and \"SUCCESS\" makes me (and probably many others) feel some relief about the missing \"SUCCESS\" on apparently completed models. And,I will make better use of the README\'S in future. The G40 should be crunching faster, but I don\'t know how to accelerate the process. As Les says, things aren\'t as simple as they were 20 (light)years ago.

Iain: 6563560 was restored and should finish, but without \"SUCCESS\". I have not been able to figure out where all the models on my results page come from.
ID: 30805 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 30809 - Posted: 4 Oct 2007, 21:50:08 UTC
Last modified: 4 Oct 2007, 22:09:32 UTC

I\'m glad you got to the root of the problems, John.

Mike has now posted his recently updated and improved README item about the most common causes of model crashes. As the list of potential model-killers is quite long (!) and not all the problems can necessarily be foreseen on a particular computer, as usual the advice is to make regular backups if at all possible.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=4231

John said \'I have not been able to figure out where all the models on my results page come from.\'

Most of those models are from quite a while ago so don\'t spend time thinking about them now. The way to be sure you won\'t be sent extra models you don\'t want is to go to the boinc manager Projects tab, highlight cpdn and click No new tasks. The day you DO want a new model you\'ll need to click the button again.

Cpdn news
ID: 30809 · Report as offensive     Reply Quote

Questions and Answers : Windows : Has my WU completed?

©2024 climateprediction.net