climateprediction.net home page
Work unit taking too long

Work unit taking too long

Questions and Answers : Wish list : Work unit taking too long
Message board moderation

To post messages, you must log in.

AuthorMessage
ramgarden

Send message
Joined: 9 Mar 06
Posts: 1
Credit: 566,546
RAC: 0
Message 32420 - Posted: 31 Jan 2008, 22:26:08 UTC

You won\'t be able to get very many results if the work units take over 2,000 hours each! Right now I have a task with a report deadline of 7/19/2010 and 2768 hours left to complete! Am I the only one? Is this a fluke? Does everyone else see this? What kind of distributed computing can be done if it takes more than a year to work out one work unit?
ID: 32420 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32422 - Posted: 31 Jan 2008, 22:48:45 UTC


Currently there are three different types of model that you can download, they take different times and have different memory requirements. Obviously the CPU will have an effect on the speed of the model.

For more information, see the \'README - running the model\' (link via my signature), the first couple of posts within the \'information\' section discuss the different types of model.


I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32422 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 32424 - Posted: 31 Jan 2008, 22:57:31 UTC


If you look at the stats on the front page of this site, you\'ll see that LOTS of climate models of various types have been completed.
If you look through the forum topics, you\'ll see that lots of people get a shock when they find out how long a climate model takes to create.

The one \"deadline\" is just there because it\'s a builtin part of BOINC, and a number is compulsary.
But this project ignores deadlines.

And it doesn\'t actually take a year to complete a model. Unless one is not serious about the project, and is only allowing a small amount of time for the climate models, in favour of work units from other projects.

Also, if you look at the climateprediction prefs on your account page, you\'ll find an option to select a type of model, some of which are a lot shorter, although one of the short models is also a high resolution model, and requires a lot of ram. If you don\'t tick a type, you get issued one at random.

It takes me about 3 months for a long model on an Intel P4, and 12 days for a short model, on an Intel quad.
1 model just finished, and 7 more in about 2 days.

Happy crunching.


Backups: Here
ID: 32424 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32426 - Posted: 31 Jan 2008, 23:22:20 UTC
Last modified: 31 Jan 2008, 23:27:27 UTC

Hi Ramgarden, welcome to the forum.

The HADSM model on your AMD could take 3 or 4 weeks to complete depending on how many hours a day you let the computer crunch.

As you\'ll see if you go to the posts Mike suggested, the 160-year HADCM on your Intel is the longest type of model. Yes, they are massive models and you could call this extreme distributed computing. I\'m running one on my Intel. By the time it completes it will have crunched for about 2560 hours which is a bit more than 3 months running 24/7.

There\'s no need to run these models 24/7 if you don\'t want to. The researchers will be needing more of these models for quite some time, which is why they have a long deadline for completion. Even if you overrun the deadline it doesn\'t matter as the CPDN servers accept overdue models and still give you all the credits you\'ve earned.

If you disable the model screensaver and instead occasionally view the globe using the View Graphics button in BOINC manager, the model will run faster.

To keep a model running for so long without crashing requires a few precautions, mostly easy to implement - you just need to know about them. In the README about crashes and problems, I\'d recommend item #5 by Mike who posted above. It\'s a comprehensive overview of the precautions.

Many of us regularly back up the contents of the BOINC folder so that if the model does crash, we can restore the backup and continue the model. In the README about backups, the first manual method explained by Les is quick and easy.

If one of these HADCM models does crash and can\'t be restored, it will still have sent data useful to the researchers at the end of each model decade. And some crashed models are reissued and continued from part-way through on other computers.

Hope that helps.

Mo
Cpdn news
ID: 32426 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 32457 - Posted: 5 Feb 2008, 6:35:50 UTC

It\'s even more difficult to complete work under these conditions:
<message>
aborted by user
</message>

Did you perchance sign on to this project without reading anything about what it entails?

Please note the utterly meaningless credits to the left of my post. Perhaps I should say meaningless except to indicate that it might actually be possible to run these over-long Models. Tends to suggest that your \"You won\'t be able to get very many results if the work units take over 2,000 hours each!\" might be ill-advised, eh?

You might also check the amount of work done by the tens of thousands of participants (your ID# will give you a clue as to how many). Part of it is shown in the \"Project Stats\" link in the blue, left.

Your reaction is typical of those who jump into the water without testing the temperature. We hope that you will find the temperature is reasonable and that you can participate and gain the satisfaction that so many CPDN participants feel for doing an important job -- and seeing a long Model through to completion.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 32457 · Report as offensive     Reply Quote
SekeRob

Send message
Joined: 21 Nov 06
Posts: 20
Credit: 318,377
RAC: 0
Message 32458 - Posted: 5 Feb 2008, 11:12:48 UTC - in response to Message 32457.  
Last modified: 5 Feb 2008, 11:31:20 UTC

We have all our bad hair days. I\'ve not many credits going for this project because time and again the model goes and breaks off, be it at 100 hours or 1681 hours as one did. Make it robust and fix it so it will be able to restore without the hoopla stuff contributors have to go thru. Also fix the due date issue and set it to something that will cause BOINC not to go into a Earliest Deadline First when that \'fictitious\' deadline is approaching.

2 cents
ID: 32458 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32459 - Posted: 5 Feb 2008, 14:00:50 UTC
Last modified: 6 Feb 2008, 1:19:38 UTC

Hi Sekerob

The deadline date for all models currently being issued has been lengthened considerably, which should help multi-project crunchers.

Backing up the BOINC folder regularly shouldn\'t be difficult even for newbies who can use Les\'s easy manual method which only takes a few minutes. It\'s the first method described in the README collection about backups (link in my sig). Restoring a backup after a model crash is almost always successful in my experience. Les\'s restore method, described click-by-click, is just as easy.

As you\'re not a BOINC newbie you may prefer to try one of the more sophisticated backup methods!


Cpdn news
ID: 32459 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 32466 - Posted: 6 Feb 2008, 0:24:21 UTC - in response to Message 32458.  

I\'ve not many credits going for this project because time and again the model goes and breaks off

Your computers are hidden so we can\'t see the failed Models and their error messages. If you make the machines visible, perhaps we can help you solve the problem(s).

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 32466 · Report as offensive     Reply Quote
SekeRob

Send message
Joined: 21 Nov 06
Posts: 20
Credit: 318,377
RAC: 0
Message 32468 - Posted: 6 Feb 2008, 11:56:21 UTC - in response to Message 32466.  

I\'ve not many credits going for this project because time and again the model goes and breaks off

Your computers are hidden so we can\'t see the failed Models and their error messages. If you make the machines visible, perhaps we can help you solve the problem(s).
That\'s the way they are going to remain. I know what each case was caused by but the longest one that ran 1681 hours, but am fairly sure it was file corruption on a disk progressively going up the famous creek.

In the backup / restore procedures there are a number of items I may have overlooked:

1. What if the unattended client already communicated back to CPDN that the model crashed?
2. How do you stop CPDN to communicate a \'crash\' message on a multiproject / multicore system? Suspend networking altogether? But then,
3. Running the projects dry takes time on a 4 core, particular if there is a few days buffered work and some projects like QMC running day-long models. Eventually the Result upload has to be done. How to be sure the bad CPDN project does not \'tell\' things went bad on the first internet connection?

One thing I don\'t/won\'t engage in is meddling in the client_state.xml

The wiki suffers from tunnel vision and is long behind on present crunching conditions. Partial quotation:
#5 We don\'t want BOINC to contact other projects with information about runs that were in progress at the time of the backup but have since been completed, so:

1. Disconnect internet
2. Start BOINC and detach all projects except ClimatePrediction.net.
3. Reconnect internet

#5 Finish the Climateprediction.net Work Unit

Is this a serious proposition, the present model 69 hours done and 330 hours to go on a multicore? Maybe CPDN should recommend to only participate with single core machines (ancient, slow, disproportional amount of electricity use).

Don\'t want to be a PITA, but the solution could be more of the 21st century.

cheers


Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 32468 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32469 - Posted: 6 Feb 2008, 13:24:34 UTC - in response to Message 32468.  
Last modified: 6 Feb 2008, 13:38:31 UTC

...
In the backup / restore procedures there are a number of items I may have overlooked:

1. What if the unattended client already communicated back to CPDN that the model crashed?


Doesn\'t matter - you do get a \'result refused\' message when the result reports for the second time, but this makes no difference. Similarly the original error report will stick on the website, but this isn\'t important (trickles are still received, credit still generated, and the scientific data still collected).

2. How do you stop CPDN to communicate a \'crash\' message on a multiproject / multicore system? Suspend networking altogether? But then,
3. Running the projects dry takes time on a 4 core, particular if there is a few days buffered work and some projects like QMC running day-long models. Eventually the Result upload has to be done. How to be sure the bad CPDN project does not \'tell\' things went bad on the first internet connection?


As long as you don\'t mind that the web site claims that the result has crashed out (which has no effect on subsequent processing) you needn\'t bother to try to block the crash message.


One thing I don\'t/won\'t engage in is meddling in the client_state.xml

The wiki suffers from tunnel vision and is long behind on present crunching conditions. Partial quotation:
#5 We don\'t want BOINC to contact other projects with information about runs that were in progress at the time of the backup but have since been completed, so:

1. Disconnect internet
2. Start BOINC and detach all projects except ClimatePrediction.net.
3. Reconnect internet

#5 Finish the Climateprediction.net Work Unit



As you have noted, the wiki is very out of date now that Paul D Buck has stopped maintaining it. We\'re trying to maintain more up-to-date info on the forums instead. I don\'t know why the wiki says there is a problem with sending duplicate \'finished\' reports to other projects, I\'ve never tried that - is it a real problem? Won\'t the duplicate simply be rejected as \'already received\'?


Is this a serious proposition, the present model 69 hours done and 330 hours to go on a multicore? Maybe CPDN should recommend to only participate with single core machines (ancient, slow, disproportional amount of electricity use).

Don\'t want to be a PITA, but the solution could be more of the 21st century.

cheers



Boinc is quite awkward when it comes to restoring single models from multicore systems. I\'ve raised this in /trac + bugzilla reports, but so far they\'ve been ignored. The easiest thing to do is just to restore everything together, although this results in some CPU time being wasted for the other climate models on the same host. RRodway wrote an automated backup system so that you can take daily backups without having to spend time at each host.

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32469 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 32477 - Posted: 6 Feb 2008, 20:05:26 UTC


My solution for a model crashing on a quad core is to just let it stay dead. There are millions more combinations to try.

Most of the problems people have are caused by either operator ignorance, (e.g. not shutting down BOINC before \'pulling the plug\'), or hardware problems, possibly the most common being power supply unit being too small, as people try to use anything they can get their hands on to run science apps that are too \'big\' for the computer.

As Mike said, the Wiki is \'dead\', Long Live the README files. :)

ID: 32477 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32488 - Posted: 7 Feb 2008, 1:24:37 UTC
Last modified: 7 Feb 2008, 1:53:49 UTC

Has anyone got a link to the Wiki page Sekerob referred to? (I\'ve never learned to navigate the unofficial BOINC Wiki and only find random pages by chance... and rarely find my way back to the same pages.)

Maybe some of us should learn how to Wiki-edit or at least delete UBW stuff that\'s no longer advisable, or provide links to README posts?

Mike said

I\'ve raised this in /trac + bugzilla reports, but so far they\'ve been ignored.

My comments on this Trac ticket (unrelated to the content of this thread) have also been ignored for months, though I think I asked for something that should matter to us all.

When you make a backup of the BOINC folder contents, it does help if you haven\'t got 100+ tasks from other projects in progress or waiting to be crunched (yes, this is possible!), because if you restore the backup you need to know which tasks have already been completed and therefore must be aborted to avoid crunching them a second time.

So backups are easier if you don\'t also crunch other projects that send large numbers of short workunits. This is why some people who have more than one computer reserve a particular computer/s to crunch only CPDN.
Cpdn news
ID: 32488 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32499 - Posted: 7 Feb 2008, 8:21:46 UTC
Last modified: 7 Feb 2008, 8:34:13 UTC


It\'s here: http://www.boinc-wiki.info/Main_Page I also find it impossible to navigate. There are 1,300 pages ...


UBW?
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32499 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32505 - Posted: 7 Feb 2008, 15:01:14 UTC

UBW?


Unofficial BOINC Wiki! At first I thought it might be a cricketing term meaning umbilicus before wicket.
Cpdn news
ID: 32505 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32512 - Posted: 7 Feb 2008, 19:01:41 UTC


<grins>

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32512 · Report as offensive     Reply Quote

Questions and Answers : Wish list : Work unit taking too long

©2024 climateprediction.net