WAH2 CREDITS SET TO LOW

Author	Message
JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,060,840 RAC: 733	Message 52604 - Posted: 22 Sep 2015, 17:38:50 UTC I believe that an adjustment is needed in the number of credits awarded for running the new wah2 tasks. I am presently running wah2 on 8 of the 10 cores (on 3 different machines) that I devote to CPDN. Since I began running them my RAC has been dropping like a stone. The problem is not the number of credits awarded per model completed, but, the fact that wah2 runs much slower than other model types. On my fastest machine the hadam3p_eu models take ~102 hours to finish. On the other hand, the wah2 models take about 300 hours to finish. Both have the same number of time steps and produce the same number of trickles. Hadam3p_eu models takes 2.52 S/ts. Wah2 on the other hand, takes 7.62 S/ts. The wah2 models produce only 1/3 as many credits for a given amount of time invested crunching. If this problem isn�t fixed these tasks could become unpopular and be deselected by many users in favor of better compensated types. The solution is to increase the number of credits per trickle (by a factor of about 3) so that they yield a similar return per hour spent crunching as the �had� models. Since the credits don�t really cost Boinc anything the fix is a no-brainer. ID: 52604 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,497,933 RAC: 6,477	Message 52605 - Posted: 22 Sep 2015, 17:42:39 UTC - in response to Message 52604. Noted on moderators email list. I can't comment personally as these tasks aren't available for my OS. Dave ID: 52605 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,497,933 RAC: 6,477	Message 52607 - Posted: 22 Sep 2015, 21:47:00 UTC - in response to Message 52606. It will almost certainly get sorted at some point now the project people are aware of it based on experience. I should check on my own tasks some time and see how they compare for value. Something I have noticed is that some of the short tasks show up initially as only giving half the credits they should give but some time a few hours or the following day they show the full credit. ID: 52607 · Reply Quote

jrapdx Send message Joined: 4 Jul 15 Posts: 63 Credit: 3,223,760 RAC: 0	Message 52608 - Posted: 23 Sep 2015, 0:24:47 UTC Last modified: 23 Sep 2015, 0:35:41 UTC I haven't checked too closely the wah2 tasks running on my Windows system. Quite possibly they are slower than others. What I have noticed is the new model, hadam3pm2_*, that started ~3.5 hours ago under Linux. BOINC gave initial estimate of remaining time as 77:05:26, but more than 3 hours later, the time left was 77:01:24. For this interval, progress was ~0.09%. The salient issue is seeing only 4 minutes of reduction in time to completion vs. task running for a >3 hours. IOW going by the situation at this early point, achieving 100% progress--reaching no remaining hours--will require >3000 hours wall time, or ~2500 hours estimated elapsed time. Either way, by a factor of 10 this exceeds, for example, what hadam3prm3pm2t_eu tasks have used. It appears the hadam3pm2 model badly underestimates run time, perhaps due to a "bug" or logic error. Anyway, if my initial estimates hold up, I think it could be very discouraging for volunteers to observe such glacial progress, let alone the effects of greatly reduced credits as a result. I believe the research is important and I want to support it. Unexplained inconsistency of the sort under discussion wouldn't help convince people to continue to participate. Certainly, if I've misconstrued something I hope someone will tell me what I'm missing. ID: 52608 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,060,840 RAC: 733	Message 52609 - Posted: 23 Sep 2015, 4:11:34 UTC - in response to Message 52608. I wouldn�t worry about this just yet. The time to complettlion estimates are just that, estimates. They are often wildly wrong at first. I have never run Linux, but, with Windows I have had new WU types were the time estimate run up for the first day or so before starting down and completing in a reasonable amount of time. After completing a few models the estimates are self-correcting. ID: 52609 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,497,933 RAC: 6,477	Message 52611 - Posted: 23 Sep 2015, 6:18:05 UTC If you are talking about the hadam3pm2 (hadam3p model with MOSES II land scheme) (currently no graphics) (Linux only) as opposed to the ones below on the server status page, hadam3prm3pm2t_eu (hadam3p global model with hadrm3p regional model with MOSES II land scheme and TRIFFID available) (currently no graphics) (Linux only) unless you are going to run the computer 24/7 with no risk of the task being interrupted it may be better to abort it. Even suspending the tasks often leads to them completing but not producing the data needed. ID: 52611 · Reply Quote

jrapdx Send message Joined: 4 Jul 15 Posts: 63 Credit: 3,223,760 RAC: 0	Message 52612 - Posted: 23 Sep 2015, 7:08:21 UTC - in response to Message 52609. "Completing in a reasonable amount of time" is exactly what I was wondering about. At this point, the hadam3pm2 task has been running for ~10.5 hours and shows only 13 minutes drop in remaining time to completion. (And just 0.289% progress.) Definitely not a "strong" start. I understand that the estimates of time remaining are simply approximations, subject to big change as the computation goes on. I'm willing to let it continue to see how it shapes up. So far though the pattern has been consistent: at the rate it's going, it will take roughly 3633 hours (real time) to finish. That's 151 days, assuming the computer is operational 24/7 and not used for other compute-intensive work such as compiling software, etc. Definitely not the typical course of model-running I've so far seen, but I don't know, maybe it's more "normal" than I would expect. ID: 52612 · Reply Quote

jrapdx Send message Joined: 4 Jul 15 Posts: 63 Credit: 3,223,760 RAC: 0	Message 52613 - Posted: 23 Sep 2015, 7:39:09 UTC - in response to Message 52611. Yes, it's the hadam3pm2 I was concerned about. My thought is to let it go on for a while to see if its "slowness" persists. It's possible I'm wrong, or getting excited prematurely; in any case it should be clearer after a few more days. Actually, I'm really curious to know if anyone else running the model has encountered the same issues--or not. ID: 52613 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,497,933 RAC: 6,477	Message 52614 - Posted: 23 Sep 2015, 7:40:14 UTC - in response to Message 52612. If you are talking about Computer 1373652 with only 1/2 GB of memory, that leaves very little to run BOINC and means the swap file will be in constant use slowing things down a lot. I have seen higher amounts of memory suggested but 500MB/task plus 500MB for the OS and anything else going on is really the absolute minimum. The cpu is also a bit on the slow side for running CPDN, Though I have successfully run tasks on an atom powered netbook. They do however take thousands of hours to complete and that is with 2GB of memory between the two cores. ID: 52614 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,497,933 RAC: 6,477	Message 52615 - Posted: 23 Sep 2015, 8:19:34 UTC - in response to Message 52614. This task has failed because of lack of ram. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=18287630 click on the plus sign by stderr to see the error message. ID: 52615 · Reply Quote

Conan Send message Joined: 6 Jul 06 Posts: 141 Credit: 3,511,752 RAC: 144,072	Message 52616 - Posted: 23 Sep 2015, 11:04:00 UTC - in response to Message 52613. Last modified: 23 Sep 2015, 11:12:23 UTC Yes, it's the hadam3pm2 I was concerned about. My thought is to let it go on for a while to see if its "slowness" persists. It's possible I'm wrong, or getting excited prematurely; in any case it should be clearer after a few more days. Actually, I'm really curious to know if anyone else running the model has encountered the same issues--or not. G'Day jrapdx, I have noticed the same on my AMD 1090 Phenom, first estimate was for 96 Hours to completion. After running for 8 and 3/4 hours it is now up to 113 Hours to run with just 0.257% completed. That's an estimate of 3,284 Hours run time. That is getting back to the old days when Climate ran for months to get a WU to complete. I doubt that it will take this long and that the original 96 Hours is closer to the mark, however going on a previous failed WU it will take over 250 hours. Conan ID: 52616 · Reply Quote

jrapdx Send message Joined: 4 Jul 15 Posts: 63 Credit: 3,223,760 RAC: 0	Message 52622 - Posted: 23 Sep 2015, 22:14:27 UTC - in response to Message 52614. That computer (1373652) has been out of service for weeks, not involved at all. I'm talking about a task running on 1373243, namely "hadam3pm2_d7ni_*" (18995341), entirely distinct from what you commented on. ID: 52622 · Reply Quote

jrapdx Send message Joined: 4 Jul 15 Posts: 63 Credit: 3,223,760 RAC: 0	Message 52623 - Posted: 23 Sep 2015, 22:36:13 UTC - in response to Message 52616. Conan, Thanks for that info, it's good to know I've not missed the boat altogether. Today, about 26 hours after the task started, it's reached 0.699% progress, and still shows 76:33 time to completion, down from initial 77:06 estimate. The task is running very consistently at this rate, as it will still take >3600 hours to finish. Curiously, it is communicating with the servers frequently. There have been 10 trickles reported since yesterday, occurring every 2 to 2.5 hours or so! To this point it's incremented my total credits by 168.28. Very interesting. I hope I don't upset it too much when I have to move the computer a few hours from now. (Remodeling of the space where the computer normally resides is about done, and I need to move it back.) I'm keeping a good thought shutting down, restarting the machines doesn't cause any problems. Jules. ID: 52623 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 52624 - Posted: 24 Sep 2015, 0:34:12 UTC - in response to Message 52623. Shutting down: 1) Check the Transfers tab. If there are any uploads/downloads, wait. 2) In the menu under Activity, click on Network activity suspended. This prevents any more transfers from starting while you're doing the following. 3) In the menu under Activity, click on Suspend. This shuts down BOINC. 4) In the menu under File, click Exit BOINC Manager. 5) In the pop up window, click on the option to stop running tasks. (I forget the wording.) 6) Wait until BOINC disappears from the desktop. 7) Shut down the computer. Reverse 3) and 2) to start running again. ********************* It's been said that BOINC takes about 10 tasks from a given project to work out times. On cpdn, it needs to be "several" of EVERY different model type, to work out how long that type of model is going to take. ID: 52624 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 52626 - Posted: 24 Sep 2015, 5:58:04 UTC - in response to Message 52604. Last modified: 24 Sep 2015, 6:17:45 UTC __ thank you to Les Bayliss, Jim, Dave Jackson, and all who have posted in this thread. a lot of good information here. Dave Jackson wrote: <quote> Noted on moderators email list. I can't comment personally as these tasks aren't available for my OS. It will almost certainly get sorted at some point now the project people are aware of it based on experience. I should check on my own tasks some time and see how they compare for value. Something I have noticed is that some of the short tasks show up initially as only giving half the credits they should give but some time a few hours or the following day they show the full credit. Dave </quote> Jim wrote: <quote> I believe that an adjustment is needed in the number of credits awarded for running the new wah2 tasks. I am presently running wah2 on 8 of the 10 cores (on 3 different machines) that I devote to CPDN. Since I began running them my RAC has been dropping like a stone. The problem is not the number of credits awarded per model completed, but, the fact that wah2 runs much slower than other model types. On my fastest machine the hadam3p_eu models take ~102 hours to finish. On the other hand, the wah2 models take about 300 hours to finish. Both have the same number of time steps and produce the same number of trickles. Hadam3p_eu models takes 2.52 S/ts. Wah2 on the other hand, takes 7.62 S/ts. The wah2 models produce only 1/3 as many credits for a given amount of time invested crunching. If this problem isn�t fixed these tasks could become unpopular and be deselected by many users in favor of better compensated types. The solution is to increase the number of credits per trickle (by a factor of about 3) so that they yield a similar return per hour spent crunching as the �had� models. Since the credits don�t really cost Boinc anything the fix is a no-brainer. Jim </quote> I hope the Mods don't mind if I post, a couple of screen shots of my 40 Weather At Home 2 (wah2) v7.05 running ... 24/7 ... since 10/sept/2015 ---------------------------------------------------------------------------------------------------------- my computer 1364207 my computer 1364207 on my computer as you can see wah2 v7.05 running 24/7 take on average apprximately 1,000,000 seconds to complete ... and only get on average - 2,389 Credits. on my computer as you can see UK Met Office HadAM3P-HadRM3P Australia New Zealand running 24/7 take on averge apprximately 400,000 seconds to complete ... and get on average - 4,512 credits :-) I am not worried about credits ... I love this project so much :-) ... I will happily crunch 24/7 .. what ever the climate prediction servers send me, for no credits. so this post just for information ... in case it might help. thank you to all the: scientists, beta testers, staff, volunteer developers, volunteer moderators, volunteer crunchers, keep on crunching, :-) I love this project, Best Wishes to all, Byron ____ ID: 52626 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 52627 - Posted: 24 Sep 2015, 6:38:47 UTC - in response to Message 52626. Hi Byron That first screen shot is interesting to me, as it shows how long those new models are taking, something that I'm never going to see, now I don't run Windows. That's something like 10 days, as against the 6 days that the hadam3prm3pm2t_eu are taking me. And for anyone who missed it: Which experiment is this work unit for? where it says: with improved land surface models and at a 25km resolution over Europe That means higher resolution modelling, with more data returned, and s l o w e r running. ID: 52627 · Reply Quote

jrapdx Send message Joined: 4 Jul 15 Posts: 63 Credit: 3,223,760 RAC: 0	Message 52628 - Posted: 24 Sep 2015, 7:30:30 UTC - in response to Message 52624. Thanks for instructions re: shutting down the computer. Your timing was perfect, the remodeling work wrapped up late in the day prompting me to postpone the shutdown until morning, so I will soon get to use your good advice... ID: 52628 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,178,025 RAC: 6,312	Message 52629 - Posted: 24 Sep 2015, 8:25:36 UTC - in response to Message 52624. It's been said that BOINC takes about 10 tasks from a given project to work out times. On cpdn, it needs to be "several" of EVERY different model type, to work out how long that type of model is going to take. The remark about projects needing 10 completed tasks (actually 11 - "more than 10") before calculating realistic initial runtime estimates applies to projects running the 'CreditNew' version of the server code. Doesn't matter whether they actually give credit that way - it's the runtime estimation which is all done on the server in those cases. But we don't have that server code at this project. Any adjustment to the initial estimates is done locally by the old DCF mechanism - and that can only keep track of one value at a time. Unless the task sizes estimated by the project are accurately in proportion to their eventual total running time, different model types will pull DCF in different directions, and it'll never be able to settle to a common value which is right for all models. ID: 52629 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,178,025 RAC: 6,312	Message 52630 - Posted: 24 Sep 2015, 9:21:54 UTC - in response to Message 52626. my computer 1364207 Byron, I'd be wary about taking too much timing data from that machine - it looks to be under considerable stress, and is throwing a lot of errors. Error tasks for computer 1364207 Even with 256 GB of RAM to support the 40 running models - 6 GB each should be plenty - can the hard disk system cope with all 40 models checkpointing and preparing upload files in quick succession? That could be a big bottleneck, since all the disk accesses have to be to the same drive (or presumably RAID array) over the same interface. CPDN models can be sensitive to delays around those upload generation moments. ID: 52630 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 52633 - Posted: 24 Sep 2015, 20:39:01 UTC - in response to Message 52629. applies to projects running the 'CreditNew' version I'm fairly sure that I remember something similar from way back in the mists of time before credit new, where this applied. If not 10 tasks, then "several". But you're right about conflicts between the requirements of different model types. The end effect though is "you're going to be in for a bumpy ride, if you run different types." BOINC just isn't equipped to handle it. And running large numbers of processors on one computer does lead to resource conflicts. Better to have several smaller computers. ID: 52633 · Reply Quote