climateprediction.net home page
Posts by alanb1951

Posts by alanb1951

1) Message boards : Cafe CPDN : BOINC options Select Columns. (Message 70337)
Posted 8 Feb 2024 by alanb1951
Post:
It's also there in 7.20.5 (Linux) from Gianfranco's repository.

Cheers - Al.
2) Message boards : Cafe CPDN : World Community Grid mostly down for 2 months while transitioning (Message 69398)
Posted 25 Jul 2023 by alanb1951
Post:
@SolarSyonyk,

It would appear as though WCG is down, again.

I've not been able to get new WUs for a bit, uploads are failing, their feeder service seems to be down, reporting WUs doesn't seem to remove them from my machines, and their forums are a 503.
WCG's host provider seems to be SHARCNET1, which appears to be the Graham cluster and Graham Cloud sited at University of Waterloo.

Those facilities appear to have been having the occasional problem during July, and (as noted by Dave Jackson in a reply that landed before mine) there is a planned outage of two days scheduled for 25th..27th July; I think the problems of 21st July caused the initial WCG outage and the systems didn't recover fully enough to make it worth restarting before said outage then having to shut everything down again...

I'm not convinced WCG can be blamed for this outage (or other hardware-related problems), and I sometimes wonder what sort of [software] bag of nails IBM handed over :-)

For more information about the Alliance (in English) see https://alliancecan.ca/en and for [general] service status2 see https://status.alliancecan.ca/

Cheers - Al.

1 Part of Digital Research Alliance of Canada (formerly Compute Canada).

2 If some of the status reporting is the best that WCG are getting I'm not surprised that they can't give proper status updates :-)

[Edited to note Dave's short response that got in before mine!]
3) Message boards : Cafe CPDN : World Community Grid mostly down for 2 months while transitioning (Message 68583)
Posted 10 Mar 2023 by alanb1951
Post:
The server status page is a poor version of this. No idea why they keep it there at all. no idea why some projects don't have an honest queue displayed. What are they hiding?
Can't comment on other projects, especially those I have had little involvement with but CPDN only updates the server status page every couple of hours which means small batches of work can be gone before they show up on the page.
This is common - the updating frequency is low on other projects too. Rosetta is odd, since the updating time of server status and the updating time of the main page are different, so one is usually ahead of the other, but not always the same one.

The standard BOINC server status page can only see work that has been prepared for sending out, as it can only see the BOINC database. That's probably the most honest indication of available work for most projects, since the total amount of work may be indeterminate for one reason or another! (And you can't hide what you don't know...)

There are some projects out there that have an exactly known number of work units (if all goes well) -- individual batches here at CPDN may be examples, ARP1 sub-project at WCG is another1, and I'm sure there are many more. However, more common seem to be projects such as MilkyWay (both sub-projects) or WCG projects such as MCM1, OPN1/OPNG and SCC12 that run until some target is hit3, whilst there may also be genuinely open-ended ones (lots of "mathematical" projects?)

I'd actually be interested to know where Rosetta gets that very large number from (and its likely accuracy) -- it's almost certainly not coming from anything in the BOINC database itself, otherwise it could possibly update at the same time(s) as the server status page :-)

Cheers - Al.

P.S. I wonder if knowing how much work is available long-term is only of major interest to badge-hunters? The clamour at WCG when a project was known to be nearing its end used to be quite something to behold... :-)

1 The only argument about the number of work units was whether the year of data (at two days per "generation") would have 364 days or 366 -- a difference of 35609 work units!

2 In the case of SCC1 there has been such a long hiatus that many think it won't return, but childhood cancers are a key research area so I mention it anyway...

3 Those projects process data for a given "target" for as long as the scientists deem appropriate (e.g. MilkyWay runs a set of streams "until converged"), then they move on to another target.
4) Message boards : Number crunching : OpenIFS Discussion (Message 68494)
Posted 27 Feb 2023 by alanb1951
Post:
I don't know whether the below is of any diagnostic use, but I'll report it in case...

I just noticed that one of my tasks (for work unit 12215433) had stalled over 24 hours ago, and it appeared that model.exe had finished but for some reason the wrapper (which was still present but quiescent) hadn't dealt with it as the stderr.txt file ended with the following:
  15:37:36 STEP 2952 H=2952:00 +CPU= 20.358

That 15:37:36 was on the 25th, and when I checked my boinc log for around that time I saw the usual flurry of checkpoint messages that seem to accompany the construction and submission of a trickle, but the next scheduler request was for new work, not a trickle, and there was no sign of the files being uploaded. As well as checking the boinc log, I checked the system logs to see if there was anything odd around that time -- there wasn't anything obvious.

Rather than just aborting it I decided to suspend and resume it to see what would happen; I wasn't optimistic that it would recover it successfully (as something had obviously broken initially) but it did seem to restart and, of course, it shut down more or less immediately (nothing more to do!) This time, it managed to upload the files and flesh out the end of stderr.txt; unfortunately it then it reported "double free or corruption (!prev)", so no luck... ((!prev) instead of the seemingly more usual (out) --- an effect of not really having anything to do?)

I see that a retry has gone out promptly, and I suspect it'll run to completion without problems -- ah, well...

Cheers - Al.
5) Message boards : Number crunching : OpenIFS Discussion (Message 68445)
Posted 24 Feb 2023 by alanb1951
Post:
Someone can have the retry for my first one of this batch: it got a "double free or corruption"...

The system in use is a Ryzen 3700X with 32GB RAM, and it is only using about half that under the current load (including a second OIFS task it got when reporting this one.) I only run one CPDN task at a time. and none of the other BOINC stuff I'm currently running on that system (a maximum of 9 other processes) will get up to a single GB of RAM!

I'll keep an eye on both this system and the other one that also has a single CPDN task in its BOINC mix (a Ryzen 5600H with 32GB RAM, currently showing about 24GB free...)...

Cheers - Al.
6) Message boards : Number crunching : What does "Didn't need" mean on work-unit status webpage? (Message 68210)
Posted 5 Feb 2023 by alanb1951
Post:
AndreyOR wrote:
Jean-David wrote:
I do not understand how that applies on the current task my machine is working on.
If the one marked "Didn't need" ran before mine, why send it to me at all?
This doesn't apply to your tasks. I'm referring to a task that was never sent to anyone. It never ran.

I think it does, it seems to be the exact situation as yours. These tasks expired while still in queue waiting to be sent out. I'm trying to remember if I've seen a similar thing at MilkyWay last year when they had validation issues and very excessive queue of tasks was getting generated.

A task also gets marked "Didn't need" if the BOINC admin decides to cancel [part of?] a work unit -- that was what happened in bulk at MilkyWay, so you remembered correctly1.

So if a work unit is to be withdrawn and resubmitted with changed parameters (or with a revised application, or...) there may be tasks marked "Didn't need" as a result of that...

Cheers - Al.

1 I did a code dive at the time, then had an exchange of information with their Admin about the reason their generator had run away - I think they have fixed it now :-)
7) Message boards : Number crunching : OpenIFS Discussion (Message 67878)
Posted 19 Jan 2023 by alanb1951
Post:
What is shrss?
According to the man page for atop on XUbuntu 22.04 it is "the resident size of shared memory (`shrss`)" (same as SHR in top?)

Cheers - Al.
8) Message boards : Number crunching : One of my computers is hoarding tasks and I don't know why (Message 67737)
Posted 15 Jan 2023 by alanb1951
Post:
Steven,

I notice that your system with more memory is running a fairly recent BOINC client (7.20.2) whereas the others that I looked at seem to be running 7.18.1.

If I recall correctly, the fix for the "use of max_concurrent may lead to excess work being fetched" problem didn't make it into the Linux client until the 7.20 versions; it may well be that if you get hold of a 7.20 client the problem will go away!

With or without use of [project_]max_concurrent, I used to have to moderate CPDN downloads by using No New Tasks as the default for CPDN, cutting the number of available "CPUs" before allowing new tasks, updating, then setting No New Tasks again and resetting the CPU count (in that order!)[1] -- it would always send at least enough work to occupy every visible "CPU"... As CPDN and WCG were my only projects doing CPU work, and I could limit work downloads for WCG at the server end, I wasn't seeing the overload issue until WCG went on hiatus, at which point the alternative projects I took CPU work from started to over-load if I used [project_]max_concurrent. However, once I found a 7.20 client that stopped.

Cheers - Al.

[1] Not very convenient -- much care needed to make sure no GPU tasks running at the time...
9) Message boards : Number crunching : OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected! (Message 67305)
Posted 4 Jan 2023 by alanb1951
Post:
Richard,

Thanks for having a look to see what's going on...

The message is written by https://github.com/BOINC/boinc/blob/master/client/app_control.cpp#L1551, and seems to be controlled by

old_time = atp->checkpoint_cpu_time;  // the saved time of the last checkpoint
if (old_time != atp->checkpoint_cpu_time) {  // if they are different ...
so you shouldn't see two messages with the same time.

Edit - OK, so 1 second apart is indeed 'different'. But I can never unravel David Anderson's spaghetti code much beyond that.

But it isn't trying to checkpoint so unless something is writing a non-zero value to the task's checkpoint_cpu_time it should just do nothing (always zero) - or have I misread/misunderstood that section of code (quite likely; my opinion of David's code is much the same as yours!)?

And from a later message...

I think the wisest thing is to turn off that log flag (it'll spam the system journal in no time), and stop worrying about it.

That's what I did, but as soon as WCG's GPU application returns I either forego some performance analysis I'm doing that needs to know where it checkpoints (to make some sort of sense of a GPU activity trace) or I forego CPDN (as I currently have no machines with the capacity to consider setting up a second BOINC client with different log behaviour...)

Now, my tiny contribution wouldn't be missed (no sarcasm intended), but if someone could find out how to stop that spam...

Cheers - Al.
10) Message boards : Number crunching : OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected! (Message 67292)
Posted 4 Jan 2023 by alanb1951
Post:
I asked about the checkpoint "log spam" in the OpenIFS discussion thread on 23rd December but. given when I posted it, that message is now two pages back :-)

For what it's worth, BOINC Manager on my machine doesn't seem to think the application checkpoints using the client checkpoint mechanism at all whilst it's running if you look at the task properties (there was never a last checkpoint time); that's consistent with my understanding of some of what Glenn has said about the matter, but it doesn't explain this -- is it something odd in the client libraries or something in the CPDN wrapper or main program?

It would be nice if it didn't do this, and it would be interesting to know why it does do it!

Cheers - Al.

P.S. Please tell me it's not using something in the BOINC checkpoint mechanism as a 1 second timer :-) ...
11) Message boards : Number crunching : OpenIFS Discussion (Message 67014)
Posted 23 Dec 2022 by alanb1951
Post:
I happened to try a couple of these tasks to see what effect they would have on the rest of my BOINC work-load. No problems there, but...

I usually have checkpoint debug turned on if I'm running certain WCG tasks or if I'm doing perf stat analyses (trying to dodge genuine checkpoints!). Imagine my surprise when I found that my BOINC log was being "spammed" with a checkpoint message once a second (or, more accurately, 9 or 10 times in every 10 or 11 seconds), with gaps of a few seconds whenever it was consolidating files or arranging an upload. Given that the BOINC standard checkpoint mechanism is apparently not being used by the application, this seems a bit strange :-)

[If this has already been discussed I missed it; that said, I don't suppose that many people do checkpoint debug most of the time!...]

Here's the front of the first task starting up...

22-Dec-2022 03:24:54 [climateprediction.net] Starting task oifs_43r3_ps_0497_1993050100_123_962_12179141_0
22-Dec-2022 03:24:54 [climateprediction.net] [cpu_sched] Starting task oifs_43r3_ps_0497_1993050100_123_962_12179141_0 using oif
s_43r3_ps version 105 in slot 6
22-Dec-2022 03:25:00 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed
22-Dec-2022 03:25:01 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed
22-Dec-2022 03:25:03 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed
22-Dec-2022 03:25:04 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed

and here's one of the intervals where I believe it was doing file movement/uploading...

22-Dec-2022 03:36:40 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed
22-Dec-2022 03:36:41 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed
22-Dec-2022 03:37:00 [World Community Grid] [checkpoint] result MCM1_0193439_9713_3 checkpointed
22-Dec-2022 03:37:04 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed
22-Dec-2022 03:37:05 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed

Now, the writing of these lines isn't a major I/O nuisance, but it is a space-consuming one! So eventually I got fed up and turned off checkpoint debug logging :-) -- fortunately, I'm not running WCG work that I want to monitor at present, and I would quite like to see what happens to throughput with one of these in a machine at the same time as a WCG ARP1 task (though there aren't any at present...) so I'll carry on with my [infinitely small] contribution for now.

If this shouldn't be happening, I hope it can be stopped... If, however, it's a natural part of how the programs are designed, I'd be interested to know why it happens.

Cheers - Al.
12) Message boards : Number crunching : Site problems (Message 64568)
Posted 2 Oct 2021 by alanb1951
Post:
Looks like file names may be slightly different in Red Hat for the record the ca-bundle.crt on Ubuntu is by default in /var/lib/boinc-client/

For what it's worth, on Ubuntu that entry in boinc-client is a link to /etc/ssl/certs/ca-certificates.crt, which gets updated as necessary - I would expect most Linux distributions that have a properly curated BOINC client package to do something similar, but I could be wrong :-)

And I note that we still have the (internal to CPDN) PHP warnings at the top of every page, so it looks like Oxford have some sorting out to do anyway...

Cheers - Al.

[Edit...] P.S. The certificate bundle was most recently updated on 28th September, 2021, round about when this root certificate expired.
13) Message boards : Number crunching : Site problems (Message 64549)
Posted 1 Oct 2021 by alanb1951
Post:
I wonder if any of their certificates have DST Root CA X3 at the top of their certificate chain. It lapsed on 29th September, I believe...

Cheers - Al.
14) Questions and Answers : Unix/Linux : *** Running 32bit CPDN from 64bit Linux - Discussion *** (Message 63311)
Posted 11 Jan 2021 by alanb1951
Post:
Disclaimer, I have never tried running CPDN in a VBox client, so what follows may be irrelevant, but it does relate to VBox performance issues.

I used to run 32-bit WCG tasks in a VBox client for throughput experiments (and usually the 32-bit tasks in the VM finished slightly quicker than similar 64-bit tasks on the host) I still use VBox for other, non BOINC-related, tasks such as a Windows VM for my camera update software(!) and a Linux VM for home banking and such like.

A couple of observations about VirtualBox: it seems that whenever I did a significant version upgrade (rather than a maintenance update) something would need re-configuring. If lucky, it would just be the video or sound; however, it was often something that affected performance!

Firstly, if you don't install up-to-date Guest Additions, performance or behaviour may suffer.

Then there's the VBox version... One major update I did turned off the user of KVM paravirtualization, for instance. Performance hit! And somewhere around my move from XUbuntu 18.04 to 20.04 (host and client, in this particular case), it turned out that one of my VMs was not using host cache for disk I/O - that VM was running like a slug until I found that issue via the VBox online forums.

My apologies if none of the above is news, but as I don't see massive performance hits if I time some of my home-grown non-interactive programs in a VM and on the host of that VM (trying to ensure nothing else is busy at the time), I wonder if some of it is "tuning" related.

I have never tried using Windows as a host, by the way; the above is all Linux-hosted. Someone hoating on Windows might know whether something akin to the KVM or host-cache might be an issue there...

Cheers - Al.

[Edited to add the rider about Windows hosts...]
15) Message boards : Number crunching : Big models (Message 62639)
Posted 7 Aug 2020 by alanb1951
Post:
Question: How do people feel about a monthly upload of around 193Mb?

The answer to that rather depends on how long it takes to produce a month's worth of data to upload! If these are going to be models that do several years in a single job, that could be several "months" per real-time day, after all.

And there's another issue that may be critical - checkpointing. If the checkpoints are as frequent as they were on those HadCM3s ones we had last Autumn, folks with ext4 filestore are going to be a tad unhappy! And, of course, if there are too many jobs running at once the machine could become disk-bound if running spinning media rather than solid-state...

(ext4 is more or less the default nowadays, I believe, and as far as I am aware there is no way to avoid more or less immediate writes without turning journaling off, which rather defeats the point... One could work around it by putting [part of] /var/lib/boinc-client on a separate ext3 partition and playing with cache parameters, I suppose, but not everyone is a Linux guru...)

The above said, it's good to know there might be some new work in the pipeline, and perhaps it'll be 64-bit and more tuned to modern hardware???

Cheers - Al.
16) Message boards : Number crunching : "No tasks sent" (Message 62420)
Posted 10 May 2020 by alanb1951
Post:
There should be some way to better spread these small test batches (if that’s what this micro-batch was) around so that they don’t all get sucked into this kind of black hole machine. Otherwise this will become more and more of a problem as processors get ever larger numbers of cores.


I agree. Usually this is done via the testing site. I have I suspect the slowest machine on there and on that these tasks would take about 20 days. The machine I use for testing site runs 24/7 which the machine with the tasks clearly doesn't as it is a lot faster yet still takes 65 days to turn tasks around..

I don't know why these were sent out on the main site rather than testing. The other option if there is a good reason for using the main site for them would be to send out say 100 instead of 15 which would hopefully get enough data back quickly.

Dave,

There is another reason a system can seem to take an age to send stuff back, which you may or may not have considered. I run CPDN on an i7-6700K and a Ryzen 3700X system, both of which take about 7.5 days to run one HadAM4h task but which seem to take a month or more to turn jobs around. In my case this is because if I'm not careful CPDN will send me as many tasks as I have "CPUs" available to BOINC but I only run one (i7) or two (Ryzen) CPDN tasks at once! Perhaps that machine that snagged all the WIndows jobs hit the same problem?...

I wish every BOINC site had the facility they've implemented at WCG whereby you can say "I never want more than N of these on my machine at once" and it won't send work that would exceed that. The alternative, of course, would be if the BOINC client respected the max_concurrent option and sent that as the number of CPUs instead of the actual number, which would offer the same sort of control!

Cheers - Al.
17) Message boards : Number crunching : Work available and being requested but none downloaded (Message 62381)
Posted 1 May 2020 by alanb1951
Post:
Although it probably won't help Bryn Mawr solve the problem, I thought I'd post my observations on how fetching from CPDN has worked recently.

I have two Linux machines that request CPDN tasks on occasion. I use app_config.xml to divide the work-load as I require, and set CPDN to a lower resource share to reflect the relative workloads wanted, but whenever a necessary CPDN work fetch takes place it will try to fetch enough tasks to have one for each "CPU" BOINC has access to! From that, it would appear that the only part of the request that is of relevance to the CPDN server is the inst value!

To ensure I only get as many tasks as I want my routine is usually as follows: suspend WCG (my only other CPU task source), reduce %age of CPUs to use to match the number of CPDN tasks I want on the machine, allow CPDN to get work then update CPDN - it will fetch just enough tasks to make up the numbers - then set CPDN back to "No New Tasks" to avoid future unwanted fetches, restore the original %age of CPUs and resume WCG!

I had last observed this on 25th April when I forgot to follow my usual CPDN fetch routine on my i7-7700K (BOINC is allowed 5 "CPUs" for CPU work [there's a GPU too...] and I only run 1 CPDN task at a time, so I try for two tasks at any one time...) and when it ran out of tasks it fetched 5 new ones because I hadn't put No New Tasks on previously!

However, as there seems to be a suggestion that this might be a recent change, I have just allowed my other machine (a Ryzen 3700X, with 13 of the 16 "CPUs" allowed to BOINC (and CPDN restricted to two)) to get more work; it had 2 running and 4 waiting [already more than I really wanted, but never mind...], so I paused WCG, set CPUs to 50% and allowed CPDN to fetch - it got two tasks as I expected (and the request was as shown below...)

Fri 01 May 2020 06:43:36 BST | climateprediction.net | [work_fetch] set_request() for CPU: ninst 8 nused_total 6.00 nidle_now 2.00 fetch share 1.00 req_inst 2.00 req_secs 86400.00
Fri 01 May 2020 06:43:36 BST | climateprediction.net | [work_fetch] request: CPU (86400.00 sec, 2.00 inst) NVIDIA GPU (0.00 sec, 0.00 inst)

Fri 01 May 2020 06:43:36 BST | climateprediction.net | Sending scheduler request: To fetch work.
Fri 01 May 2020 06:43:36 BST | climateprediction.net | Requesting new tasks for CPU
Fri 01 May 2020 06:43:42 BST | climateprediction.net | Scheduler request completed: got 2 new tasks

Note that I have my cache control set to 0.45 + 0.05 days so it asked for 1 day of work, (The Ryzen usually takes just over 7 days to do one of these, so if it were paying any attention to that it wouldn't send anything!) Note also there's a GPU, but I doubt that's making any difference to the actual request.

Pursuant to Les's comment about older clients, and other things that might come into play, my Ryzen is on XUbuntu 18.04.04 with a 5.3 kernel and uses BOINC client 7.9.3 (I didn't want the version available from Gianfranco's repository when I set it up), and the Intel box is on an earlier 18.04 with client 7.14.2.

So, as I said, probably not helpful for Bryn Mawr but it is a "data point"... (and perhaps a marker to try a different client?)

Good luck - Al.

[Edited to add some (obvious?) steps I'd forgotten to list...]
18) Message boards : Number crunching : New work Discussion (Message 62050)
Posted 27 Jan 2020 by alanb1951
Post:
@Wolfman1360
I vaguely remember discussion of Rosetta eating up l3 cache as well, but can't find the discussion anywhere.
Is this still true today and should I be limiting it alongside the n216 and n144?

Jim1348 has referred to local threads where this has come up; if you look in the threads about UK Met Office HadAM4 at N216 resolution and UK Met Office HadAM4 at N144 resolution you'll find several mentions of L3 cache bashing (especially in the N216 thread, but in this message in the N144 thread I actually replied to one of your posts, talking about workload mixes (and again in this message)... Jim1348 (and others) had some good contributions in those threads too. I don't recall many explicit references to Rosetta, but WCG MIP1 (which uses Rosetta) got some dishonourable mentions...

You may also have seen (or even participated in) threads about MIP1 at WCG -- because of the model construction it uses, the rule of thumb is that one MIP1 per 4 or 5 MB of L3 cache! I haven't got time to track those down at the moment - sorry!

For what it's worth, if you run MIP1 alongside N216 you'll see the same sort of hit as if running extra N216 tasks; N144 is nowhere near as bad!

Cheers - Al.

[Edited to fix a broken link, then to fix a typo I'd missed!]
19) Message boards : Number crunching : UK Met Office HadAM4 at N144 resolution (Message 61757)
Posted 21 Dec 2019 by alanb1951
Post:
@Wolfman1360
Will I see much of a performance gain by allowing all cores to crunch away at N144 with hyper threading enabled, or will this just be twice the time for little if any gain?
Right now I have n144 limited to 8 concurrent on the Ryzen and 4 on the 4770 and 2600 thus theoretically illiminating HT, and 4 216 on the Ryzen with 2 allowed on the 2 intels. I'll change this to 1 per your advice, thank you. Rosetta is currently using the remaining 8 threads of the Ryzen and 4 of the other two. I've been a strong supporter of WCG since 2016 and need a bit of a break from there, though I will try and get some of the arp.

Regarding cutting down the numbers - my comment about cache use for CPDN was specific to N216 tasks (which run HadAM4h - I presume the h is for high(er) resolution!) You would probably still be o.k. with 2 N144 tasks on an 8MB-cache machine most of the time, cutting down to one if more N216 turns up!

As for multiple tasks and hyperthreading - that one is a lot more complex, because unless you can guarantee which CPU thread runs a particular application you can't be sure whether you might get two heavy floating-point using applications on the same core (at which point you may well see a fairly substantial throughput drop per individual process, even if the applications aren't particularly hard on the L3 cache...) However, if you get lucky and one of the applications is mostly shuffling stuff around and doing pattern-matching or integer arithmetic a core may be able to keep both threads fairly busy!

I reckon on about a 15% drop on individual processes, but even then if I'm running 6 tasks on my 7700K or 14 tasks on my 3700X (instead of 4 or 8 respectively) I get more work done. If the drop was 40+ % I probably wouldn't use the hyperthreads...

Whatever your workload mix, there will probably be a point at which using more cores/threads will actually result in doing less science per hour - there was a thread over in the SETI forums about the best CPU workload on a Threadripper, and if I remember rightly the conclusion was that if more than 75% of the threads were used the performance drop became unacceptable. And I have noticed that if I don't leave two threads clear for operating system stuff on my 3700X performance takes a noticeable hit because of all the I/O services and such like causing extremely high numbers of cpu-migrations (expensive) as well as the expected context switches (reasonably efficient)...

I have lots of numbers about throughput for various workloads, but there really is no simple (or brief) answer to "what will happen if..." - you'll have to try things out. And if you are interested enough, and you have a Linux system on which you can get root access, you can get hold of the performance tools to check out throughput, context switches, cache access et cetera for yourself!

Good luck arriving at a workload mix that meets your objectives!

Cheers - Al.

P.S. My best machine for running individual processes is an i5-7600 (4 cores, no hyperthreading), which runs tasks about 10% faster than either my 7700K or 3700X despite being run at about 12% lower CPU clock speed. It only does WCG work but I only allow 1 MIP1, don't run ARP1 and use only 3 cores for BOINC.
20) Message boards : Number crunching : UK Met Office HadAM4 at N144 resolution (Message 61744)
Posted 20 Dec 2019 by alanb1951
Post:
@Wolfman1360

Les is right about N144 (HadAM4) not being anywhere near as much of a "nuisance" as N216 (HadAM4h)! He's also right about the i7 and its cache limitations.

However, you mentioned running WCG work on the same system... You may or may not be aware of the issues regarding MIP1 and L3 cache (it likes about 4 or 5MB too!) Some of the other projects also work on quite large memory grids but don't seem to put so much stress on L3 cache (though the cumulative effect might mount up if you run several at once.)

Best WCG projects to run alongside CPDN are MCM1 and anything VINA-based (e.g. FAHV if/when it returns, SCC1 when it returns, and the recently completed OET1 and ZIKA projects); those tend to require less volume of L3 cache and their performance does not seem to degrade as dramatically (or put excessive stress on other applications). The worst is definitely MIP1!

Good luck deciding on a workload mix that suits you on that Ryzen, and don't try to run more than one HadAM4h or MIP1 on that 4770! (I have an i7-7700K (8MB L3 cache) and that shows severe performance degradation if I let it run more than one of either of those!!!)

Cheers - Al.

[Edited for sentence order...]


Next 20

©2024 climateprediction.net