|
21)
Message boards :
Number crunching :
New work Discussion
(Message 64772)
Posted 3 Nov 2021 by Aurum Post: And the researchers are well aware that these models take a long time to run.And it really shows by how poorly they run a BONIC server. They're so lazy they don't even send out a Server Abort when they abandon a project. Last night I completed 7 N144 WUs and they called them Abandoned. That's shameless. That's about seven CPU months of work I could've done for a project that actually cares. |
22)
Message boards :
Number crunching :
New work Discussion
(Message 64771)
Posted 3 Nov 2021 by Aurum Post: or I get a year's worth of work in one delivery and must abort almost all of it.I have never received close to even six months of work even when work cache set to maximum. I've gotten a year's worth of work several times, most recently a couple of days ago. The main point is to specify the number of WUs to send. |
23)
Message boards :
Number crunching :
New work Discussion
(Message 64769)
Posted 3 Nov 2021 by Aurum Post: Not only does this project have the worst performing work server in all BOINCdom it's so rude. I just turned in 7 N144 completed tasks and they were recorded as Abandoned. |
24)
Message boards :
Number crunching :
New work Discussion
(Message 64768)
Posted 3 Nov 2021 by Aurum Post: That does nothing. Mine is set to 10 minutes.Then if it's at all possible make the checkpoints closer together.In the computing preferences menu item in "Options" there is a box :-checkpoint at most every.... seconds". |
25)
Message boards :
Number crunching :
New work Discussion
(Message 64762)
Posted 2 Nov 2021 by Aurum Post: What improvements do you have in mind?Nothing even comes close to fixing the CPU cache issue but a few upgrades could make this project a whole lot more user-friendly. I'd start by fixing the work delivery bugs. Several projects use the "Preferences for this project" page to allow the BOINCer to specify how many WUs of which project they'd like to download and maintain on their computer. Also fix the perpetual 60-minute project backoff. It makes no sense how work is delivered, it's just feast or famine. I either go days or weeks getting no WUs on a particular computer, even though the Server Status page says there's work available and another computer is getting work, or I get a year's worth of work in one delivery and must abort almost all of it. I can't think of another BOINC project that behaves this way. 16946 climateprediction.net 11/2/2021 2:14:19 PM update requested by user 16950 climateprediction.net 11/2/2021 2:14:25 PM Sending scheduler request: Requested by user. 16951 climateprediction.net 11/2/2021 2:14:25 PM Not requesting tasks: don't need (CPU: ; NVIDIA GPU: ) 16952 climateprediction.net 11/2/2021 2:14:27 PM Scheduler request completed 16953 climateprediction.net 11/2/2021 2:14:27 PM Project requested delay of 3636 seconds "Don't need" is not true. I have one 921 WU running and would like to run another. If I do get lucky and I'm blessed with a second WU I'd switch to "No new work" and switch back after one completed. Then if it's at all possible make the checkpoints closer together. |
26)
Message boards :
Number crunching :
New work Discussion
(Message 64761)
Posted 2 Nov 2021 by Aurum Post: I assume the UK MetOffice owns the code. Or is it someone else?The main problem with that is not owning the source code - they’re not allowed to make changes to most of it.Ok. This project could easily do ten or twenty times as much work per unit time if they'd just make some improvements.This project could easily do ten or twenty times as much work if they'd just make some improvements.Only if it had ten or twenty times as many researchers asking Oxford to send work out for them. The biggest problem I see is the CPU cache congestion problem. Running too many WUs on a computer slows it down to a snail's pace. I keep playing around trying to figure out the most CP work units I can run on a computer. I've tried disabling hyperthreading and that works better but I still can't run all CPUs because it still slows down. Besides if I can't run every CPU thread with CP then I'd like to support ARP etc. Right now as my older WUs complete I detach from CP and then reattach to sweep up the debris it leaves behind. Then I specify a max of two CPUs and under BOINC preferences use at most 33/36=92%. That leaves some headroom but it's still noticeably faster if I run only one CP WU. It's frustrating when I know I could be running 18 or more if not for the CPU Congestion Issue. Last time I suggested this someone said they'd have to rewrite a million lines of Fortran. I'm not a coder but I would think they'd only need to modify aspects of the code. https://www.ibm.com/docs/en/aix/7.2?topic=implementation-design-coding-effective-use-caches "Repackaging techniques can yield significant improvements without recoding..." https://hackernoon.com/programming-how-to-improve-application-performance-by-understanding-the-cpu-cache-levels-df0e87b70c90 This guy says his code ran 50x faster after optimizing for CPU cache usage. I've even seen a book dedicated to efficient CPU cache coding. |
27)
Message boards :
Number crunching :
New work Discussion
(Message 64756)
Posted 1 Nov 2021 by Aurum Post: Ok. This project could easily do ten or twenty times as much work per unit time if they'd just make some improvements.This project could easily do ten or twenty times as much work if they'd just make some improvements.Only if it had ten or twenty times as many researchers asking Oxford to send work out for them. |
28)
Message boards :
Number crunching :
New work Discussion
(Message 64754)
Posted 1 Nov 2021 by Aurum Post: I was trained in Physics at 'the other place' - the Cavendish Laboratory in fenland. I well remember two bits of advice that they drilled into me:When I got my physics degree using my dad's slide rule we learned that the three things physicists do most are: add and subtract zero, multiply and divide by one, and call it something else. :-) It's amazing how much math my generation can do in our heads compared to kids today that need a calculator to do the most rudimentry arithmetic. This project could easily do ten or twenty times as much work if they'd just make some improvements. |
29)
Message boards :
Number crunching :
New work Discussion
(Message 64740)
Posted 31 Oct 2021 by Aurum Post: Put this in your cc_config and it won't happen: <max_file_xfers_per_project>1</max_file_xfers_per_project> |
30)
Message boards :
Number crunching :
New work Discussion
(Message 64710)
Posted 26 Oct 2021 by Aurum Post: The trick with these is to stagger the completion times.I already decided that I'm only going to run one CP WU per computer. So I've already got that covered. And make sure that nothing else wants to use your net connection at an upload time.Now I'm confused. I thought the error under discussion is: Output file hadam4h_h02w_200802_4_920_012115322_0_r75796790_4.zip for task hadam4h_h02w_200802_4_920_012115322_0 exceeds size limit.Now instead of exceeding a file size you're talking about how many files are being uploaded at the same time. I'm now running 3,201 WUs of various projects so that will be next to impossible. One of these commands in ones cc_config file may be useful: <max_file_xfers>32</max_file_xfers> <max_file_xfers_per_project>32</max_file_xfers_per_project> |
31)
Message boards :
Number crunching :
New work Discussion
(Message 64707)
Posted 26 Oct 2021 by Aurum Post: I have a few 920s running. Should I abort them and lose a week's work now or let them fail at the end and lose a month's work? What is this catch and set a new limit you guys are talking about? Is that something we civilians can do? |
32)
Message boards :
Number crunching :
Site problems
(Message 64695)
Posted 25 Oct 2021 by Aurum Post: I think it's a problem in the western US from this big storm that just hammered us. The ULs keep moving, that's the important thing. |
33)
Message boards :
Number crunching :
Site problems
(Message 64691)
Posted 25 Oct 2021 by Aurum Post: Is the upload speed normally capped at 21 kBps? |
34)
Message boards :
Number crunching :
Site problems
(Message 64659)
Posted 20 Oct 2021 by Aurum Post: Check the batch number. If it's closed, then that's the reason.Those that know how to properly run a BOINC server system would issue a Server Abort signal and that would never happen. |
35)
Message boards :
Number crunching :
Site problems
(Message 64658)
Posted 20 Oct 2021 by Aurum Post: Check the batch number. If it's closed, then that's the reason.Batches that will not upload: 852, 883, 886, and 895. 16533 10/20/2021 5:43:57 AM Project communication failed: attempting access to reference site 16534 climateprediction.net 10/20/2021 5:43:57 AM Temporarily failed upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_restart.zip: transient HTTP error 16535 climateprediction.net 10/20/2021 5:43:57 AM Backing off 04:20:06 on upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_restart.zip 16536 climateprediction.net 10/20/2021 5:43:57 AM Temporarily failed upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_5.zip: transient HTTP error 16537 climateprediction.net 10/20/2021 5:43:57 AM Backing off 04:14:14 on upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_5.zip 16538 climateprediction.net 10/20/2021 5:43:57 AM Temporarily failed upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_out.zip: transient HTTP error 16539 climateprediction.net 10/20/2021 5:43:57 AM Backing off 03:02:20 on upload of hadam4h_d11e_206711_5_886_012041609_1_r1580940902_out.zip 16540 10/20/2021 5:43:58 AM Internet access OK - project servers may be temporarily down. |
36)
Message boards :
Number crunching :
Site problems
(Message 64653)
Posted 19 Oct 2021 by Aurum Post: Uploads failing. |
37)
Message boards :
Number crunching :
Batches closed
(Message 64643)
Posted 18 Oct 2021 by Aurum Post: Then what? Server abort signals? |
38)
Message boards :
Number crunching :
Site problems
(Message 64612)
Posted 11 Oct 2021 by Aurum Post: No clearly they don't. If they know what they're doing then why do we have to downgrade 3 libraries in order to run CP WUs??? They should recompile their code to include current libraries that are maintained in the Linux reposititories. They should also fix the numerous segmentation violations. But since they don't even care enough about this project to even read these forums I doubt anything will ever improve. |
39)
Message boards :
Number crunching :
Site problems
(Message 64610)
Posted 11 Oct 2021 by Aurum Post: Extremely inefficient and wasteful. They really should learn how to use BOINC. They could greatly increase their throughput with a modest effort.It's a mystery to me how they ever finish a project.They know how many of a batch will come back in a reasonable time and send out a number of work units that will bring back that many results. Sometimes if a batch is pushing the physics to the limits and consequently gets a higher failure rate than allowed for they will send out some extras. |
40)
Message boards :
Number crunching :
Site problems
(Message 64608)
Posted 10 Oct 2021 by Aurum Post: As a side issue, if you’re running 17 CPDN WUs at a time, 63 WUs reserve is over a month’s worth. Any particular reason for holding that many?I sit on none that are Ready to Start. I get the best results running one or two per computer. The 63 were all Waiting to Run and several are now running. I doubt it'll take a month to finish. It's a mystery to me how they ever finish a project. It seems like they'd take a couple of years at best with many holes in it. Maybe there's method to their madness but I suspect it's just tea and crumpets. |
©2023 climateprediction.net