climateprediction.net home page
Sorting for platform

Sorting for platform

Message boards : Number crunching : Sorting for platform
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4347
Credit: 16,541,921
RAC: 6,087
Message 41211 - Posted: 3 Dec 2010, 15:56:34 UTC

I read a while ago that with famous, if one task crashes on a particular OS/CPU combination it is likely that the rest of the tasks from that work unit will do likewise, and the converse that if one task succeeds with a particular combination then the others are likely to do so. Would it be possible to get a greater number of models through by seeing what happens with the first ones of a batch to go out and then sending models to where they are most likely to complete? I know there must be other criteria around sending out models which may make this impossible/too time (computer or human) intensive for this to work and am sure others have thought of this but I haven't seen it discussed here....
ID: 41211 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1081
Credit: 7,026,771
RAC: 4,684
Message 41212 - Posted: 3 Dec 2010, 17:31:51 UTC
Last modified: 3 Dec 2010, 17:33:32 UTC

Unless I'm missing something, it ought to be quite easy to arrange. If you're a Mac/Linux user and try to request a HADAM3P regional model, then no model will be supplied because there is no Mac/Linux application (at the moment). So, suppose three related 'applications' were created each supporting one platform instead of the current system of one 'application' supporting three platforms. Three identical sets of work units could be created with restrictive 'initial replication' etc. Each platform cohort would work its way through its WUs independently. To avoid the complaint that minor platforms are just reproducing the work already done by major platforms then adjust the WU generation process so that each platform starts from a different place in the master WU list (e.g. Linux from the top down, Mac from the middle up).

All the WUs would eventually be covered by each platform.

Then add result validation, not as a credit allocation method, but as a work allocation strategy to prevent multiple identical completions and the efficiency of CPDN would be transformed.

Or perhaps not.
ID: 41212 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 41214 - Posted: 3 Dec 2010, 18:23:04 UTC
Last modified: 3 Dec 2010, 18:26:58 UTC

I agree with Iain.

If the concept of 'trusted computer' (which almost always completes its tasks and generates valid results) could be added, this would eliminate more duplication. A trusted computer would have the only task from a workunit. I don't know whether Boinc has already adopted this concept or it's just an idea for future development.
Cpdn news
ID: 41214 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4347
Credit: 16,541,921
RAC: 6,087
Message 41218 - Posted: 3 Dec 2010, 20:56:08 UTC

That was I think what I was groping towards. - I wanted to float it to see if I was missing something obvious. That and I had noticed that sometimes there are four or five tasks go out from a work unit, all to the same platform and all have failed. I also wondered if tasks that crash on linux could then get tried on Windows or Mac as well as the converse?
ID: 41218 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 41219 - Posted: 3 Dec 2010, 21:45:18 UTC

For what it's worth, WCG provides capability for Projects to use the "trusted computer" technique; see Single Validation – Type 1:
http://www.worldcommunitygrid.org/help/viewTopic.do?shortName=points#174

No indication that boinc is involved; my guess is that it's IBM/WCG server code.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 41219 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 108
Credit: 19,311,535
RAC: 33,371
Message 41227 - Posted: 5 Dec 2010, 13:09:30 UTC - in response to Message 41219.  

For what it's worth, WCG provides capability for Projects to use the "trusted computer" technique; see Single Validation – Type 1:
http://www.worldcommunitygrid.org/help/viewTopic.do?shortName=points#174

No indication that boinc is involved; my guess is that it's IBM/WCG server code.

While both "Adaptive replication" and "need_reliable" was developed by WCG, it's been part of standard BOINC-code for a long time.

While "Adaptive replication" is great for min_quorum = 2 projects there it can reduce the average task/wu from 2.xx to around 1.05 - 1.10 task/wu, CPDN is using min_quorum = 1 so "Adaptive replication" can't reduce this any further.

Then it comes to "need_reliable", this could be an advantage, since you're guaranteed any re-issue is only sent to "Reliable" computers that has fast turnaround-times, so chances are the re-issue will be returned fairly fast. It's also possible to set wu-priority so high initially on wu-generating that they "need_reliable". But, the big problem with Famous is that appart for some computers that routinely error-out all wu's, most Famous-errors is wu-specific, so a "Reliable" computer will give the same error...

Also worth remembering is, for a computer to become "Reliable", it must have enough Validated results, but CPDN has never used a validator...
ID: 41227 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 41228 - Posted: 5 Dec 2010, 13:17:33 UTC

My Linux box has errored 3 Famous tasks. The fourth has been running for 150 hours and is still running. CPU is AMD Opteron 1210, Linux is SuSE 11.1.
Tullio
ID: 41228 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 41229 - Posted: 5 Dec 2010, 14:13:46 UTC

Yes, the FAMOUS error rate probably makes the detection of reliable computers impossible. And any computer that's run a lot of slabs will also very probably have had iceworlds.

Would failed downloads make a computer unreliable?


Cpdn news
ID: 41229 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 108
Credit: 19,311,535
RAC: 33,371
Message 41232 - Posted: 5 Dec 2010, 22:20:26 UTC - in response to Message 41229.  

Yes, the FAMOUS error rate probably makes the detection of reliable computers impossible. And any computer that's run a lot of slabs will also very probably have had iceworlds.

Would failed downloads make a computer unreliable?

Well, without a validator no computer will become reliable... :)

But as far as download-errors is concerned, this will decrease the daily quota, and any computer with decreased daily quota is not "Reliable". Since the quota increases again on "success"-reports, if there's no other reasons for being unreliable, a computer can very quickly be back to "Reliable" again.


The problem with FAMOUS is that if example 1st. copy is sent to an "intel + windows" and this gives an error, there's a fairly good chance a "Reliable" computer that gets the re-issue will also be an "intel + windows", and in most instances this means the exact same error. So being "Reliable" doesn't really mean much for FAMOUS, since it's the wu's themselves that is unstable, and not majority of computers.
ID: 41232 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 41233 - Posted: 5 Dec 2010, 22:36:38 UTC - in response to Message 41232.  

So. A non-BOINC script that scans the computer list based on project defined criteria.


Backups: Here
ID: 41233 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4347
Credit: 16,541,921
RAC: 6,087
Message 41236 - Posted: 6 Dec 2010, 13:12:03 UTC - in response to Message 41228.  

Worth checking what happened to other tasks in the work units. Having read some of the posts on this subject in the past, if one task in a work unit crashes with a particular cpu/os combination then the others will. Also the spinup models are more likely than others to crash but those that do work are used for generating more models. The spinup models all start with a year somewhere around 499. I have also had quite a few models crash but on looking here see that it is not a fault of the computer. Looking @ the work units the other tasks have also failed to complete, both with windows and with linux. For some the model itself is unstable and ends up with a negative value for air pressure. There are other impossible values that cause a crash also at which point the most useful thing for your computer to do is to report the problem and download another work unit.
ID: 41236 · Report as offensive     Reply Quote
transient

Send message
Joined: 3 Oct 06
Posts: 43
Credit: 8,017,057
RAC: 0
Message 41238 - Posted: 6 Dec 2010, 17:01:19 UTC - in response to Message 41232.  
Last modified: 6 Dec 2010, 17:02:35 UTC

But as far as download-errors is concerned, this will decrease the daily quota, and any computer with decreased daily quota is not "Reliable". Since the quota increases again on "success"-reports, if there's no other reasons for being unreliable, a computer can very quickly be back to "Reliable" again.


There are no actual "success" tasks, right? Only completed ones. By that measure, a daily quotum can never recover. Does that mean a 'failure' on CPDN will not actually decrease the quotum for that host? Or the original quotum is restored.
Hence maybe the problem with minussing hosts that do not stay minussed?
ID: 41238 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 41244 - Posted: 7 Dec 2010, 15:11:13 UTC
Last modified: 7 Dec 2010, 15:15:45 UTC

I think the problem with the minussed computers that do not remain minussed is something completely separate. This minussing/unminussing problem isn't linked to the usual quota mechanism. There must be some defect in the CPDN's Boinc server version that nobody has been able to identify.

(If I am honest I have to say that this is not the only defect.)

Even if CPDN had a validator that could identify reliable computers, as well as the HadSM and FAMOUS models that cannot complete on some types of computer because of inherent but usually unpredictable problems with certain parameter values, there's also the current problem of failed downloads because the server appears unable to cope continuously.

I counted the successful and failed downloads for all the members who joined CPDN on one day a few days ago. 103 models downloaded successfully. 23 failed. In most cases this isn't the fault of the computer, but its daily quota is still reduced by 1 for each failed download.
Cpdn news
ID: 41244 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 108
Credit: 19,311,535
RAC: 33,371
Message 41246 - Posted: 8 Dec 2010, 21:43:31 UTC - in response to Message 41238.  

There are no actual "success" tasks, right? Only completed ones. By that measure, a daily quotum can never recover. Does that mean a 'failure' on CPDN will not actually decrease the quotum for that host? Or the original quotum is restored.
Hence maybe the problem with minussing hosts that do not stay minussed?

I didn't mean "success"-task, but "reported as success", since atleast with the "old-style" quota-code, every time a "success" was reported by client the quota was doubled (if wasn't already at max).

So, "success"-report means "client reports the task has finished without any errors".

If this later changes to invalid or something due to validator is another matter...

BTW, apparently the web-pages has been changed, so task-status doesn't call it "success" any longer, but rather "Completed, waiting for validation" or another variant of "Completed...".


As for how the "new-style" per-application quota-system works, haven't looked-up the new code, but atleast going by another project, the "pending"-tasks haven't changed the quota, only the Validated tasks has...

But this obviously can't be the case for the server-code CPDN is using, since if it was, most CPDN-computers by now would sit with a quota of 1 per application.
ID: 41246 · Report as offensive     Reply Quote

Message boards : Number crunching : Sorting for platform

©2024 climateprediction.net