Sorting for platform

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 41211 - Posted: 3 Dec 2010, 15:56:34 UTC I read a while ago that with famous, if one task crashes on a particular OS/CPU combination it is likely that the rest of the tasks from that work unit will do likewise, and the converse that if one task succeeds with a particular combination then the others are likely to do so. Would it be possible to get a greater number of models through by seeing what happens with the first ones of a batch to go out and then sending models to where they are most likely to complete? I know there must be other criteria around sending out models which may make this impossible/too time (computer or human) intensive for this to work and am sure others have thought of this but I haven't seen it discussed here.... ID: 41211 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,026,771 RAC: 4,684	Message 41212 - Posted: 3 Dec 2010, 17:31:51 UTC Last modified: 3 Dec 2010, 17:33:32 UTC Unless I'm missing something, it ought to be quite easy to arrange. If you're a Mac/Linux user and try to request a HADAM3P regional model, then no model will be supplied because there is no Mac/Linux application (at the moment). So, suppose three related 'applications' were created each supporting one platform instead of the current system of one 'application' supporting three platforms. Three identical sets of work units could be created with restrictive 'initial replication' etc. Each platform cohort would work its way through its WUs independently. To avoid the complaint that minor platforms are just reproducing the work already done by major platforms then adjust the WU generation process so that each platform starts from a different place in the master WU list (e.g. Linux from the top down, Mac from the middle up). All the WUs would eventually be covered by each platform. Then add result validation, not as a credit allocation method, but as a work allocation strategy to prevent multiple identical completions and the efficiency of CPDN would be transformed. Or perhaps not. ID: 41212 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 41214 - Posted: 3 Dec 2010, 18:23:04 UTC Last modified: 3 Dec 2010, 18:26:58 UTC I agree with Iain. If the concept of 'trusted computer' (which almost always completes its tasks and generates valid results) could be added, this would eliminate more duplication. A trusted computer would have the only task from a workunit. I don't know whether Boinc has already adopted this concept or it's just an idea for future development. Cpdn news ID: 41214 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 41218 - Posted: 3 Dec 2010, 20:56:08 UTC That was I think what I was groping towards. - I wanted to float it to see if I was missing something obvious. That and I had noticed that sometimes there are four or five tasks go out from a work unit, all to the same platform and all have failed. I also wondered if tasks that crash on linux could then get tried on Windows or Mac as well as the converse? ID: 41218 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 41219 - Posted: 3 Dec 2010, 21:45:18 UTC For what it's worth, WCG provides capability for Projects to use the "trusted computer" technique; see Single Validation â€“ Type 1: http://www.worldcommunitygrid.org/help/viewTopic.do?shortName=points#174 No indication that boinc is involved; my guess is that it's IBM/WCG server code. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 41219 · Reply Quote

Ingleside Send message Joined: 5 Aug 04 Posts: 108 Credit: 19,311,535 RAC: 33,371	Message 41227 - Posted: 5 Dec 2010, 13:09:30 UTC - in response to Message 41219. For what it's worth, WCG provides capability for Projects to use the "trusted computer" technique; see Single Validation â€“ Type 1: http://www.worldcommunitygrid.org/help/viewTopic.do?shortName=points#174 No indication that boinc is involved; my guess is that it's IBM/WCG server code. While both "Adaptive replication" and "need_reliable" was developed by WCG, it's been part of standard BOINC-code for a long time. While "Adaptive replication" is great for min_quorum = 2 projects there it can reduce the average task/wu from 2.xx to around 1.05 - 1.10 task/wu, CPDN is using min_quorum = 1 so "Adaptive replication" can't reduce this any further. Then it comes to "need_reliable", this could be an advantage, since you're guaranteed any re-issue is only sent to "Reliable" computers that has fast turnaround-times, so chances are the re-issue will be returned fairly fast. It's also possible to set wu-priority so high initially on wu-generating that they "need_reliable". But, the big problem with Famous is that appart for some computers that routinely error-out all wu's, most Famous-errors is wu-specific, so a "Reliable" computer will give the same error... Also worth remembering is, for a computer to become "Reliable", it must have enough Validated results, but CPDN has never used a validator... ID: 41227 · Reply Quote

tullio Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0	Message 41228 - Posted: 5 Dec 2010, 13:17:33 UTC My Linux box has errored 3 Famous tasks. The fourth has been running for 150 hours and is still running. CPU is AMD Opteron 1210, Linux is SuSE 11.1. Tullio ID: 41228 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 41229 - Posted: 5 Dec 2010, 14:13:46 UTC Yes, the FAMOUS error rate probably makes the detection of reliable computers impossible. And any computer that's run a lot of slabs will also very probably have had iceworlds. Would failed downloads make a computer unreliable? Cpdn news ID: 41229 · Reply Quote

Ingleside Send message Joined: 5 Aug 04 Posts: 108 Credit: 19,311,535 RAC: 33,371	Message 41232 - Posted: 5 Dec 2010, 22:20:26 UTC - in response to Message 41229. Yes, the FAMOUS error rate probably makes the detection of reliable computers impossible. And any computer that's run a lot of slabs will also very probably have had iceworlds. Would failed downloads make a computer unreliable? Well, without a validator no computer will become reliable... :) But as far as download-errors is concerned, this will decrease the daily quota, and any computer with decreased daily quota is not "Reliable". Since the quota increases again on "success"-reports, if there's no other reasons for being unreliable, a computer can very quickly be back to "Reliable" again. The problem with FAMOUS is that if example 1st. copy is sent to an "intel + windows" and this gives an error, there's a fairly good chance a "Reliable" computer that gets the re-issue will also be an "intel + windows", and in most instances this means the exact same error. So being "Reliable" doesn't really mean much for FAMOUS, since it's the wu's themselves that is unstable, and not majority of computers. ID: 41232 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 41233 - Posted: 5 Dec 2010, 22:36:38 UTC - in response to Message 41232. So. A non-BOINC script that scans the computer list based on project defined criteria. Backups: Here ID: 41233 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 41236 - Posted: 6 Dec 2010, 13:12:03 UTC - in response to Message 41228. Worth checking what happened to other tasks in the work units. Having read some of the posts on this subject in the past, if one task in a work unit crashes with a particular cpu/os combination then the others will. Also the spinup models are more likely than others to crash but those that do work are used for generating more models. The spinup models all start with a year somewhere around 499. I have also had quite a few models crash but on looking here see that it is not a fault of the computer. Looking @ the work units the other tasks have also failed to complete, both with windows and with linux. For some the model itself is unstable and ends up with a negative value for air pressure. There are other impossible values that cause a crash also at which point the most useful thing for your computer to do is to report the problem and download another work unit. ID: 41236 · Reply Quote

transient Send message Joined: 3 Oct 06 Posts: 43 Credit: 8,017,057 RAC: 0	Message 41238 - Posted: 6 Dec 2010, 17:01:19 UTC - in response to Message 41232. Last modified: 6 Dec 2010, 17:02:35 UTC But as far as download-errors is concerned, this will decrease the daily quota, and any computer with decreased daily quota is not "Reliable". Since the quota increases again on "success"-reports, if there's no other reasons for being unreliable, a computer can very quickly be back to "Reliable" again. There are no actual "success" tasks, right? Only completed ones. By that measure, a daily quotum can never recover. Does that mean a 'failure' on CPDN will not actually decrease the quotum for that host? Or the original quotum is restored. Hence maybe the problem with minussing hosts that do not stay minussed? ID: 41238 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 41244 - Posted: 7 Dec 2010, 15:11:13 UTC Last modified: 7 Dec 2010, 15:15:45 UTC I think the problem with the minussed computers that do not remain minussed is something completely separate. This minussing/unminussing problem isn't linked to the usual quota mechanism. There must be some defect in the CPDN's Boinc server version that nobody has been able to identify. (If I am honest I have to say that this is not the only defect.) Even if CPDN had a validator that could identify reliable computers, as well as the HadSM and FAMOUS models that cannot complete on some types of computer because of inherent but usually unpredictable problems with certain parameter values, there's also the current problem of failed downloads because the server appears unable to cope continuously. I counted the successful and failed downloads for all the members who joined CPDN on one day a few days ago. 103 models downloaded successfully. 23 failed. In most cases this isn't the fault of the computer, but its daily quota is still reduced by 1 for each failed download. Cpdn news ID: 41244 · Reply Quote

Ingleside Send message Joined: 5 Aug 04 Posts: 108 Credit: 19,311,535 RAC: 33,371	Message 41246 - Posted: 8 Dec 2010, 21:43:31 UTC - in response to Message 41238. There are no actual "success" tasks, right? Only completed ones. By that measure, a daily quotum can never recover. Does that mean a 'failure' on CPDN will not actually decrease the quotum for that host? Or the original quotum is restored. Hence maybe the problem with minussing hosts that do not stay minussed? I didn't mean "success"-task, but "reported as success", since atleast with the "old-style" quota-code, every time a "success" was reported by client the quota was doubled (if wasn't already at max). So, "success"-report means "client reports the task has finished without any errors". If this later changes to invalid or something due to validator is another matter... BTW, apparently the web-pages has been changed, so task-status doesn't call it "success" any longer, but rather "Completed, waiting for validation" or another variant of "Completed...". As for how the "new-style" per-application quota-system works, haven't looked-up the new code, but atleast going by another project, the "pending"-tasks haven't changed the quota, only the Validated tasks has... But this obviously can't be the case for the server-code CPDN is using, since if it was, most CPDN-computers by now would sit with a quota of 1 per application. ID: 41246 · Reply Quote