climateprediction.net home page
Name BOINC mis-estimates runtime for hadam3pm2_k00w

Name BOINC mis-estimates runtime for hadam3pm2_k00w

Message boards : Number crunching : Name BOINC mis-estimates runtime for hadam3pm2_k00w
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 50894 - Posted: 28 Nov 2014, 9:18:18 UTC

There has been a smallish batch of these -- alias MOSES that came out last few days.

What BOINC reports about estimated completion time is obviously totally wrong - either one day of fulltime computing is 1% -- meaning that completing the wu would take 100 days, or - if there's only 10 uploads for these old MOSES things it wuold only be 10-20 days.

The problem is - I got flooded with these wu's that, based on the initial estimate of a day or 3, now look like they might run for a half year.

IF - they are like the long ago MOSES models -- ??
They probably run for 10 uploads -- that would give an estimate of only a week or two.

What to think?

I've choked down my acceptance factor from 6 days to 1, because I've no clue how long these things will run.

Any ideas?

ID: 50894 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,846
RAC: 3,590
Message 50895 - Posted: 28 Nov 2014, 10:04:10 UTC

You can probably work it out by now based on percentage competed. :)
ID: 50895 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50896 - Posted: 28 Nov 2014, 10:39:13 UTC - in response to Message 50894.  

Yes, bad estimate.
This is still set for either 1 or 2 years, whatever was used in testing.
So the total time is either 5 or 10 times the figure used.

Got one on each of my machines.

ID: 50896 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 50899 - Posted: 29 Nov 2014, 0:39:38 UTC - in response to Message 50895.  
Last modified: 29 Nov 2014, 0:45:20 UTC

Naah, the percent-completed in boincmgr is totally out-of-range.

The 3 uploads in 3 days - yeah if there's 10 uploads to completion - that gives a rough estimate.

So what I figure is maybe 300-500 hours on my fastest machine. Give or take a week or so.

So that's Okay. seems like in reality - but NOT in BOINC estimate --
These wu's will comlpete - we hope - in a reasonable week or two on recent machines. Very good.

So -- anybody looking at this thread -
Please don't worry the crazy BOINC percents and completion times, both are wrong, expect a week or three

Dont Panic.


And don't go killing these MOSES or hadam3pm2 wu's -- there were problems with them a few months ago, but this batch seems OK -- we really need to test these models again.

Keep on crunching
e


You can probably work it out by now based on percentage competed. :)

ID: 50899 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 147
Credit: 7,748,561
RAC: 8,366
Message 50901 - Posted: 29 Nov 2014, 22:05:10 UTC

Seems the same as before, can't handle stop-restarts, so I aborted it: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=17516381
ID: 50901 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50902 - Posted: 30 Nov 2014, 19:16:49 UTC - in response to Message 50899.  

The one running solo on my Ivy Bridge has just created zip 8 at 96 hours.
This means about 144 hours for the full run, if it goes that far.

The one on the Haswell is a bit faster, so a few less hours.

ID: 50902 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50914 - Posted: 1 Dec 2014, 20:39:42 UTC

Both have now finished, with a total of 10 zips. The last one is a bit over 95 Megs.

The Ivy Bridge took 120 hours 37 minutes.
The Haswell took 113 hours.


ID: 50914 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 50955 - Posted: 10 Dec 2014, 7:38:45 UTC

I got overloaded with the MOSES global II things -- mostly by mishap of bad timing combined with the initial misunderestimation of probable runtime.
I've suspended all others, the fastest I get running 5 tasks on 4 real cores + 1ht on Ivy Bridge is 180 hours. The slowest is 340+ hours on oldest AMD - no ht.

I think the difficult part for running these things is just the "never interrupt after starting" thing.

The mis-estimation of time to end thing is mostly fixed - with a few bizarre remaining glitches -- like why does the next-to-last (but not the last) model think it needs 1700 hours to complete?

Anyhow, the underestimation seems to be over -- so --I'll work through the couple dozen of this batch remaining here OK, probably.

Thanks all.
ID: 50955 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,421,805
RAC: 1,225
Message 50987 - Posted: 17 Dec 2014, 13:20:39 UTC
Last modified: 17 Dec 2014, 13:24:19 UTC

I know (or think I know) that the MOSES models run for 10 years. I am referring to the am3pm2 (LINUX) models only.

Does anyone know what the timestep is? I know ANZ is 5 minutes and the PNW tasks don't show the time step on the graphic display.

If I knew the how many timesteps there are in 10 years, I could look at my trickles and estimate how long the task might run.
ID: 50987 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,846
RAC: 3,590
Message 50989 - Posted: 17 Dec 2014, 16:19:58 UTC - in response to Message 50987.  
Last modified: 17 Dec 2014, 16:43:59 UTC

I haven't had any Moses models off this, the main site yet, only from the beta site where some of the estimates have been wildly out.

Don't know if it helps but on my I3 CPU on this box the one I have completed, and East Asia model was 779,535.90s cpu time
A second East Asia model is almost 99% complete at 347hours and about 4.5 hours to go.
An Afr Moses task is showing as 13.2% Complete after 113 hours with 194.5 hours showing as remaining time.

The eas tas checkpoints about once an hour. the afr about every 3hours 20mins.

I am going to wait till the afr completes before drawing any conclusions.
ID: 50989 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50990 - Posted: 17 Dec 2014, 20:21:50 UTC - in response to Message 50987.  
Last modified: 18 Dec 2014, 3:10:58 UTC

Mine too are mostly on the beta site, but there's this one from earlier in the year. Click on the trickles page for the full list.

Keep in mind that the "MOSES" models are changing a bit over time, at least on beta. The ones I'm testing at the moment are MOSES II + Triffid.
I don't know which ones will be used on the main site.
ID: 50990 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,421,805
RAC: 1,225
Message 50993 - Posted: 18 Dec 2014, 0:50:24 UTC

Les -

Thanks for the info.

Assuming the current models are the same as the beta, the link you gave me shows about 309,000 time steps. My task that I am looking at has gone about 31,000 time steps. So, maybe it is about 10% done. Not the 0.832% done as BOINC shows.
ID: 50993 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 51000 - Posted: 21 Dec 2014, 5:15:49 UTC

I think that if you multiply the "progress" shown in BoincManager by 12 and apply that to the "elapsed" you will get a real good estimate of the actual progress.

In other words, to get a good estimate of total run time (cpu) figure (elapsed)/((progress)*(12)) -- elapsed and progress as shown in BoincMgr 7.42.

For example , elapsed=127 hour progress=5.792% . 127/((0.05792)*(12)) = 183
(example from one of my running tasks)

This has been true for all the few dozen of these hadam3pm2 global MOSES from the main site that my machines have completed, and the ones running now. Why I dunno.
This estimate is much much better than whatever insane underestimate caused my boxes to load up so many of these models when they first came out, and better than some later ones, where the corrections made by Oxford staff worked - but - combined with BOINC trying to correct for the earlier underestimate -- hey some of BOINC's estimates went up into the 1600 hour range :) -- thinking the outlandish 4 or 5 values in client.state.xml <duration_correction_factor> had something to do with that.

Anyhow, like I posted elsewhere a while ago, and saw on some announcements thread, these models are priority, not quite ready for prime time (Windows hosts), and need all the help we can give.

They are also fragile, in that they will lose an upload file any time they are suspended by user or stopped and restarted for reboot. I think the _k*** and _i*** series from the main site (not the beta site) are actually like a beta-2 series.
Anyhow,
ID: 51000 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,421,805
RAC: 1,225
Message 51016 - Posted: 22 Dec 2014, 19:26:11 UTC

Erik -

I think you have something there.

How do yours look if you just ignore the Progress %, and just look at the Elapsed and Remaining time (added together for total run time)?


ID: 51016 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 51044 - Posted: 26 Dec 2014, 1:37:22 UTC

I have a couple of these running right now. Not _kOOw.

One claims to be 7.935 DONE, elapsed 280:54:41 remaining 333:32:32
One claims to be 7.836 DONE, elapsed 280:51:42 remaining 333:35:04

Do not pay attention to the seconds as they were changing as I wrote them down.

I am not worried about this, but it is sure confusing.
ID: 51044 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51045 - Posted: 26 Dec 2014, 2:24:12 UTC

Sorry, I forgot to re-visit this thread after my 2 finished.

here on an Ivy Bridge, which took 120 hours and 37 minutes, and
here on a Haswell which took 113 hours.

ID: 51045 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,421,805
RAC: 1,225
Message 51053 - Posted: 27 Dec 2014, 14:09:30 UTC

Les - I think you gave me the piece of information I was looking for. The last trickle on the two links you sent shows 348,548 time steps.

Therefore, on one of my tasks the last trickle is at 129,000 time steps, so I am thinking it is a little more than 1/3 done.

The task shows 3.183 percent done. This coincides with Erik's observation (stated a different way) that the task will be done when the percentage reaches 8.333%
ID: 51053 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 147
Credit: 7,748,561
RAC: 8,366
Message 51054 - Posted: 27 Dec 2014, 18:39:02 UTC
Last modified: 27 Dec 2014, 18:55:43 UTC

Yeah, progress on these are calculated for a 120 year run.

Have complained about this months ago in beta but nothing have happened, well, well..
ID: 51054 · Report as offensive     Reply Quote
MyLittleBoinc

Send message
Joined: 31 Mar 13
Posts: 44
Credit: 6,950,896
RAC: 0
Message 51072 - Posted: 30 Dec 2014, 10:07:38 UTC

Some of my hadam3pm2 wu's have had interruptions and I guess that's why they finish with an error of zip files being absent. Are these wu's still useful for something or could they just as well have been aborted?
ID: 51072 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 51074 - Posted: 30 Dec 2014, 10:27:34 UTC - in response to Message 51072.  

I've -- or my machines have -- run a few dozen of these --
And my experience has been that, after the models actually start running, any interruption will result in a lost upload, and eventually, in a failed wu.

What that's worth for the science -- I have no clue. Hope it helps.

I've just let them run to end, mostly.

Luckily, there's been no power outages where I live the last few months, and I've learned to postpone rebootable upgrades to the system until these MOSES jobs are cleared from by boxes. Like I said otherwhere -- these are beta-2 -- and by what was posted - they are priority models.

Right now, I run these things as a top priority, and will have another dozen completing near New Years. But they are fragile, don't deal well with any kind of interruption (after they actually start)

About my original complaint, totally whacko underestimates of runtime that left me with more than a months worth of models that "shoulda" finished in a week -- heh heh -- minor minor worry.

Some of my hadam3pm2 wu's have had interruptions and I guess that's why they finish with an error of zip files being absent. Are these wu's still useful for something or could they just as well have been aborted?


ID: 51074 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Name BOINC mis-estimates runtime for hadam3pm2_k00w

©2024 climateprediction.net