climateprediction.net (CPDN) home page
Thread 'New work discussion - 2'

Thread 'New work discussion - 2'

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 42 · Next

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 66264 - Posted: 29 Oct 2022, 6:44:50 UTC

Most of the time, I download work while at the computer. This would let me suspend some tasks while they are still downloading. There is an issue with the BOINC client code that means if you have restricted BOINC to say using 8 cores and 0.2 and 0 days of work and additional work respectively, it will download 8 times as many tasks as would fill that quota if the tasks are going to use 8 cores. Richard has raised this as an bug over on git-hub but no one has assigned themselves to fixing it yet even though it is probably not a difficult one.
ID: 66264 · Report as offensive
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 150
Credit: 12,830,559
RAC: 228
Message 66265 - Posted: 29 Oct 2022, 9:19:23 UTC - in response to Message 66262.  

"BOINC starts multiple OpenIFS tasks because there are free CPU slots, even though the total memory for the tasks exceeds what's available. "

Can this be overcome by limiting the number of cores available to BOINC before downloading any of the IFS models? Allthough I have a four core CPU the box only has 24Gb of RAM.


I will not restrict my 24 core box to running 4 cores with the other 20 waiting for memory - I’ll block the OpenIFS jobs if they won’t play happily.
ID: 66265 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,636,385
RAC: 11,909
Message 66266 - Posted: 29 Oct 2022, 9:35:59 UTC - in response to Message 66262.  

"BOINC starts multiple OpenIFS tasks because there are free CPU slots, even though the total memory for the tasks exceeds what's available. " Can this be overcome by limiting the number of cores available to BOINC before downloading any of the IFS models? Allthough I have a four core CPU the box only has 24Gb of RAM.
Yes, that's how I do it, though it would be better to have an app_config.xml control file do it if possible (not my area of expertise).

I think the problem is the way boinc wants to work, not with the model itself. We need to test this further, it might be we can avoid it by over-specifying the memory required (though that's not ideal). Anyway, bottom line is we would not put the model out on the production site until we're happy with it. I mention it to show some of the issues involved, to explain why it takes time for workloads to appear. I often see comments about why there is no work - there's alot of development going on in the background.
ID: 66266 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,636,385
RAC: 11,909
Message 66267 - Posted: 29 Oct 2022, 9:38:28 UTC - in response to Message 66265.  

I will not restrict my 24 core box to running 4 cores with the other 20 waiting for memory - I’ll block the OpenIFS jobs if they won’t play happily.
I might be wrong but I think in this situation you would not get the OpenIFS tasks anyway, because the server would see there's not enough free memory available. Remember it's boinc making the decisions, not the model.
ID: 66267 · Report as offensive
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 14,977,739
RAC: 10,025
Message 66268 - Posted: 29 Oct 2022, 10:24:11 UTC - in response to Message 66261.  
Last modified: 29 Oct 2022, 10:27:22 UTC

... boinc client will monitor memory and suspend the tasks if memory is exceeded.

I don't know that I'd trust BOINC like this. I've had my PC restart or get really bogged down and require intervention because it ran out of memory due to running too many tasks of a high RAM project (LHC ATLAS in this case). This happened when some batches of tasks have a higher memory demand than normal and as a user you don't really know that ahead of time.

... if you haven't got the RAM, the models will crash. ... If it is, we might need to put a health warning on running multiple OpenIFS instances if it's not possible to control this.

This does not give me good vibes. It'd be very disappointing if OpenIFS will end up being similar to Hadley in that misconfigured or "underpowered" machines will easily be able to get tasks and crash them. It's hard for me to see that putting out a warning will make enough people take note and do something about it. I hope you'll be able to figure something out. If not, it'd be good if the project explored some creative solutions to prevent or at least minimize the task trashing that happens with Hadley. I do believe that there are a few of those kinds of ideas to explore and try but I don't have hope that the project will do it. I'm rooting for OpenIFS to be better. :-)
ID: 66268 · Report as offensive
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 66269 - Posted: 29 Oct 2022, 10:52:36 UTC - in response to Message 66268.  
Last modified: 29 Oct 2022, 10:57:49 UTC

... if you haven't got the RAM, the models will crash. ... If it is, we might need to put a health warning on running multiple OpenIFS instances if it's not possible to control this.
I ran some of the early OpenIFS tasks on my now dead laptop which only had 8GB of RAM and four cores. The tasks peaked using five or six GB RAM if I remember correctly. Even running four at once did not cause them to crash, they just ended up running a lot slower as data was swapped between RAM and swap. If only going a bit above the limit even this doesn't happen as the tasks rarely peak in RAM usage at the same time.
ID: 66269 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,748,059
RAC: 5,647
Message 66270 - Posted: 29 Oct 2022, 10:56:41 UTC - in response to Message 66264.  

Richard has raised this as an bug over on git-hub but no one has assigned themselves to fixing it yet even though it is probably not a difficult one.
This bug is certainly relevant, but I don't think it's likely to affect us seriously. The main effect is to cause BOINC to download more work than can be comfortably completed in the time expected, which slows down the return rate for the ensemble as a whole, but doesn't impact individual tasks. It really is trivial: details are in GitHub Issue 4151, where I've identified the precise routines and values which need a tweak. It's probably a one-liner, but I daren't code it as an outside volunteer, because it operates on the server, and I have no access to a live project server for testing. Someone with Glenn's experience might be able to help, and I'll add a note referencing this project to the original report. I'm also in touch with Laurence Field (LHC, CERN) who oversees BOINC's server releases: there's a significant enhancement in the pipeline, so we could be just in time to catch the next update. I'll drop him a note.

This bug is just one example of a very general BOINC flaw: in general, it works very well and smoothly in the steady state, but boundary conditions and perturbations throw it into a tizzy. Much like the climate, really.

In particular, when a new application and set of tasks are released by a project which hasn't had much work recently, BOINC tends to overshoot. There's the 'total estimated runtime' bug discussed above, but there's also the 'number of idle cores' to consider. The server might very well allocate 6 tasks to Bryn Mawr's 24-core machine, irrespective of free memory and 'memory per task' declarations. I had an example of this with the recent test on the dev site: one of my six-core machines requested work for six idle cores, and got the lot. I've only just completed the last and returned the machine to normal service this morning.

There's a project setting which could help here: <max_wus_in_progress>, in Project Options (job limits). You could clamp it down a bit for IFS, by setting that to 2 ("one and a spare").
ID: 66270 · Report as offensive
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 66271 - Posted: 29 Oct 2022, 14:06:58 UTC - in response to Message 66270.  

They could also just allow the user to set the maximum number of work units running, or at least downloaded at a time.
LHC does the former, and WCG does the latter.
ID: 66271 · Report as offensive
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 150
Credit: 12,830,559
RAC: 228
Message 66272 - Posted: 29 Oct 2022, 14:34:58 UTC - in response to Message 66267.  

I will not restrict my 24 core box to running 4 cores with the other 20 waiting for memory - I’ll block the OpenIFS jobs if they won’t play happily.
I might be wrong but I think in this situation you would not get the OpenIFS tasks anyway, because the server would see there's not enough free memory available. Remember it's boinc making the decisions, not the model.


I’ll give them their chance and I certainly won’t shoot the messenger, these new tasks sound perfect for those with the kit to run them but, same as the Rosetta Python tasks, if my set-up is not up to the job of running them and trying restricts my ability to run other work then I’ll block them and run what work I can.
ID: 66272 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,636,385
RAC: 11,909
Message 66273 - Posted: 29 Oct 2022, 15:29:23 UTC - in response to Message 66270.  

There's a project setting which could help here: <max_wus_in_progress>, in Project Options (job limits). You could clamp it down a bit for IFS, by setting that to 2 ("one and a spare").
That's a useful tip, thanks Richard. I will discuss with Andy. I would much prefer a server-side solution. I believe boinc should 'just-work' and not involve lots of fiddling by the user.

Useful to get feedback from everyone here, thanks.
ID: 66273 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,748,059
RAC: 5,647
Message 66277 - Posted: 30 Oct 2022, 12:56:01 UTC - in response to Message 66273.  

On review, I should have referred you to Job limits advanced

The basic version only allows you to control the job limit down to 1 task per core, which isn't enough for this use case. The advanced version includes control at the per host level.
ID: 66277 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,636,385
RAC: 11,909
Message 66278 - Posted: 30 Oct 2022, 13:09:20 UTC - in response to Message 66277.  

On review, I should have referred you to Job limits advanced
The basic version only allows you to control the job limit down to 1 task per core, which isn't enough for this use case. The advanced version includes control at the per host level.
Yes, I read this and thought the same. For the higher memory configurations, I think we will definitely need to use this.

It's turning out to be more work to get the boinc side working correctly than to get the model ready to run under boinc :(
ID: 66278 · Report as offensive
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 66280 - Posted: 1 Nov 2022, 23:25:46 UTC - in response to Message 66265.  
Last modified: 1 Nov 2022, 23:26:19 UTC

I will not restrict my 24 core box to running 4 cores with the other 20 waiting for memory - I’ll block the OpenIFS jobs if they won’t play happily.


24 cores, or 24 threads? There's a difference, and especially for CPDN tasks, "more threads" is not always better.

I've got a pair of 3900X boxes (12C/24T), and I've written some scripts that track "instructions retired per second." I rarely see a difference between 12 and 18 tasks running for most BOINC workloads (and if I do, the 18 task box is usually accomplishing less actual work per second), and the CPDN tasks typically seem to peak around 8 threads, though I don't recall seeing much of a difference dropping to 6. It's not just the cores that matter - it's the cache. I've absolutely seen "more threads mean lower aggregate system throughput," and CPDN is particularly bad for that.

I would expect 4C on a 12T/24C processor to be below optimum, but... depending on the tasks, not by much. Though we'll have to see once the actual OpenIFS tasks show up.

There are some WCG tasks that use very little cache and I get linear speedups with number of threads assigned, but CPDN is definitely not that.
ID: 66280 · Report as offensive
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 150
Credit: 12,830,559
RAC: 228
Message 66281 - Posted: 2 Nov 2022, 7:28:46 UTC - in response to Message 66280.  

I will not restrict my 24 core box to running 4 cores with the other 20 waiting for memory - I’ll block the OpenIFS jobs if they won’t play happily.


24 cores, or 24 threads? There's a difference, and especially for CPDN tasks, "more threads" is not always better.

I've got a pair of 3900X boxes (12C/24T), and I've written some scripts that track "instructions retired per second." I rarely see a difference between 12 and 18 tasks running for most BOINC workloads (and if I do, the 18 task box is usually accomplishing less actual work per second), and the CPDN tasks typically seem to peak around 8 threads, though I don't recall seeing much of a difference dropping to 6. It's not just the cores that matter - it's the cache. I've absolutely seen "more threads mean lower aggregate system throughput," and CPDN is particularly bad for that.

I would expect 4C on a 12T/24C processor to be below optimum, but... depending on the tasks, not by much. Though we'll have to see once the actual OpenIFS tasks show up.

There are some WCG tasks that use very little cache and I get linear speedups with number of threads assigned, but CPDN is definitely not that.


24T, I run 3900 rather than 3900X as they only pull 65w.

I always run a mix, no more than 4 CPDN, no more than 6 Rosetta and the rest WCG, TN-Grid and SIDock and I find that sort of mix is fairly happy.

I agree, running fully loaded runs up against the peak package power of the CPU, running fewer threads pulls the same power by running faster clock speeds but I’m more the big kid than the deep analytical thinker :-)
ID: 66281 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,636,385
RAC: 11,909
Message 66282 - Posted: 2 Nov 2022, 13:19:00 UTC - in response to Message 66281.  

I only have intel machines rather than AMD. I did some tests a while ago. I get best throughput (ie. credit) by allowing boinc to use N-1 cores (not threads). i.e. for a 4C/8T machine, tell boinc it can use 3 CPUs.

This is especially true for multithreaded apps where ideally the machine needs to be as quiet as possible to avoid threads being moved from one core to another, which would delay all the other threads in the app. I do have cpu binding set in the OpenMP environment variables for OpenIFS, but I am not sure how well that's respected on home PCs. The quieter the machine the better, TLB misses and page faults are particularly costly.
ID: 66282 · Report as offensive
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 66283 - Posted: 2 Nov 2022, 14:30:41 UTC - in response to Message 66282.  

The AMD Ryzens with a large cache-to-core ratio have always worked well for me on the CPDN projects.

That is why I am using Ryzen 3600's. The have 32MB of total L3 cache and only 12 virtual cores. I often restrict that to 50% of the cores, so the work units are usually running on full cores in effect.
I have heard that OpenIFS does not take so much cache, so I may be able to use 75% of the cores.
ID: 66283 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,748,059
RAC: 5,647
Message 66284 - Posted: 2 Nov 2022, 14:30:54 UTC - in response to Message 66282.  

It's not just the MT tasks. I mentioned I'd snagged a number of HadSM4 / N144 tasks from the dev site when prepping up for the IFS tests. Look at dev host 80.

The final task of the batch - 10589 - ran in a clean environment, and recorded times of elapsed 63,553.44, CPU 62,995.77. That's about as good as it gets.

The earlier ones rather caught me by surprise, and ran while BOINC tasks from other projects were still running. I mostly run GPU tasks, and there's an unfortunate tendency nowadays for GPU programmers not to care about the number of CPU cycles they're stealing from the machine in the background.

The worst case is the first (10650). recording times of elapsed 58,056.28, CPU 29,823.74. That's over 94% excess elapsed time, or barely 50% efficiency.
ID: 66284 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,636,385
RAC: 11,909
Message 66287 - Posted: 3 Nov 2022, 13:52:02 UTC - in response to Message 66283.  

I have heard that OpenIFS does not take so much cache, so I may be able to use 75% of the cores.
I don't know if OpenIFS uses less/more cache than the hadley centre models. If anything, I might guess it uses more because it's a spectral model not a pure gridpt model like the MetO models. Operational models like IFS are highly tuned to whatever supercomputer they are running on; optimizing cache use can be done by tuning the blocking algorithms to some extent and making sure the code is vectorizing well. ECMWF and the MetOffice really do chase that extra 1% speedup for operational use (even to making sure the model is running in the cabinets with the closest electrical connections).

OpenIFS memory use varies depending which part of the code (e.g. dynamics, physics, surface) is operating - each has its own characteristics. For OpenIFS@CPDN I only did a basic tuning exercise on Intel because of the variety of machines in use in CPDN. However, I doubt in practise it makes much difference on volunteer machines because they are more likely to be busier, as Richard demonstrates in the previous message.

Try it and see what gives the best throughput on your machines, for whatever other workload you might have on there (and what cpu temp you like to keep to).
ID: 66287 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66288 - Posted: 3 Nov 2022, 15:02:02 UTC - in response to Message 66284.  

I mostly run GPU tasks, and there's an unfortunate tendency nowadays for GPU programmers not to care about the number of CPU cycles they're stealing from the machine in the background.
Not really stealing, some instructions have to run on the CPU because the GPU just can't do those.

Try it and see what gives the best throughput on your machines, for whatever other workload you might have on there (and what cpu temp you like to keep to).
If it's too warm, I get a bigger fan.

Or.... decent heatsink paste. I've worked wonders on completely broken GPUs by replacing the heatsink paste with 17Wm/m2K stuff, this goes on CPUs aswell. I get 45Wm/m2K pads for memory chips on GPUs.

I refuse to call it W/mK because you can't divide an area by a thickness!
ID: 66288 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,636,385
RAC: 11,909
Message 66289 - Posted: 3 Nov 2022, 16:58:04 UTC

Update regarding OpenIFS tasks (only) following a meeting with the CPDN team. Several projects ongoing with expected no. of tasks: 5000, 3000, 2000. These'll be coming out in the run-up to Christmas. I'm sure one of the moderators will update nearer the time when these go into testing.
ID: 66289 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 cpdn.org