climateprediction.net home page
About the new many core computers

About the new many core computers

Message boards : Number crunching : About the new many core computers
Message board moderation

To post messages, you must log in.

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59755 - Posted: 10 Mar 2019, 0:49:30 UTC

I know from testing, that models on hyperthreaded computers run better if the hyperthreading is left on, but not used.
e.g. on my Haswell, only run 4 models.

This leaves the rest for the OS.

But how well do the 16, 24, 32 core computers fare?

It's possible that bottlenecks in getting to the FPU for so many at once may be a hindrance.

Any thoughts?
ID: 59755 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 59756 - Posted: 10 Mar 2019, 1:40:15 UTC

I'm absolutely sure this is true. But large, fast caches on the processors, and fast memory can make up for some of it to be sure. So those Xeons and Threadrippers with lots of L2 and L3 cache and quad channel DDR4 won't have as much contention when running multiple models. But I can't imagine turning on SMT/Hyperthreading on those will result in much more throughput.
ID: 59756 · Report as offensive     Reply Quote
TR UNESCO Global Geopark

Send message
Joined: 7 Jul 17
Posts: 13
Credit: 94,077,591
RAC: 52,401
Message 59920 - Posted: 4 Apr 2019, 19:33:22 UTC - in response to Message 59755.  

This is certainly true on an 8core 16thread Ryzen - I've tested it with many configurations and the very best is to run only 8 tasks and get the DDR4 memory clock as high as possible (performance scaled linearly up with memory clocks) CPU clock speed had less effect. I also leave SMT on but found no significant difference with it off. After about a month of testing I calculated that running 8 tasks instead of 16 on my 1st gen Ryzen resulted in about 6% more productivity overall.

The hyperthreaded intel (ivy bridge) systems I run did better at running all threads (8) - effectively they did the same amount of work either way.
ID: 59920 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 59927 - Posted: 6 Apr 2019, 23:07:36 UTC
Last modified: 6 Apr 2019, 23:17:04 UTC

I agree with what's been posted so far (except I don't tweak clocks, know nothing about that)
A few years ago I did serious tests on the 4-core/8-thread Intels (Ivy and Sandy) and found that, on Linux, with SMT enabled, and no non-CPDN use of the machines, runnning 5 models gained 5% throughput -- meaning the slowdown from running 5 rather than 4 was barely made up for by the +20% number of models running. More than 5 models per 8 threads lost big, and bigger the more models per SMT up to big loss trying 8 max threads.

Now -running Wine under Linux My more-than-4-core-8-thread experience is recent - like within a year.I've done no formal tests on what I now have Haven't done any tests with more models than cores.

Willing to try some tests if 6/12 Intel and 8/16 AMD info will help others. I can't buy an AMD threadripper 16 core or a similar Xeon.

Both Ryzen 7 2700+ -- much faster than AMD's previous cpus.
and Intel i7-8700K CPU @ 3.70GHz surprisingly much faster than the Intel i7-7700 CPU @ 3.60GHz even allowing for six cores versus four

How to test? Please advise
ID: 59927 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 59928 - Posted: 6 Apr 2019, 23:25:13 UTC - in response to Message 59755.  

I'm thinking that nowadays the number of specialized FPU's per core is unknown to us users. Makes sense to me to experiment a bit.

I think the OS, and also web browsers and email don't use floating-point much, if at all. So yeah, it's the FPUs/core , and maybe the memory bandwidth, that may be the bottlenecks.


I know from testing, that models on hyperthreaded computers run better if the hyperthreading is left on, but not used.
e.g. on my Haswell, only run 4 models.

This leaves the rest for the OS.

But how well do the 16, 24, 32 core computers fare?

It's possible that bottlenecks in getting to the FPU for so many at once may be a hindrance.

Any thoughts?

ID: 59928 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 59929 - Posted: 6 Apr 2019, 23:47:28 UTC

This is getting into the weeds a bit, but it is complicated to really test this semi-rigorously any more. In the past, we might just have batches of one model type/region. Now, that's the rarity with various batches with various regions, each having their own grid area and resolution, possibly testing some different physics. So it's hard to get 8 or 12 ANZ models (for example) only on a PC so a proper test on an older i7 could be done, 4 at once vs. 8 at once to compare the effects of hyperthreading/SMT on throughput.

I know in the past that some models had better throughput (credits/day) when using all 8 threads on an i7. The more complex and memory intensive the model, the more likely that is NOT to be the case. Eight SAM25 models for example would probably do very poorly utilizing all 8 threads, whereas any 50 km regional model over a relatively smaller region would do better. I remember the old FAMOUS models which were not memory intensive at all. One could easily produce many more credits in the same time period running hyperthreading than without.
ID: 59929 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 59930 - Posted: 7 Apr 2019, 1:16:03 UTC - in response to Message 59927.  

How to test? Please advise

I use the write-rate to disk.
https://www.cpdn.org/cpdnboinc/forum_thread.php?id=8647&postid=58755#58755
ID: 59930 · Report as offensive     Reply Quote
TR UNESCO Global Geopark

Send message
Joined: 7 Jul 17
Posts: 13
Credit: 94,077,591
RAC: 52,401
Message 59933 - Posted: 8 Apr 2019, 5:01:18 UTC - in response to Message 59929.  

Geophi raises a very good point. After reviewing data I'm sad to report my Ryzen SMT tests were all done on large PNW25 models.

I hope I can evaluate some smaller models soon.

I hope to test the 3rd gen Ryzen when it releases - the 2nd gen seem like a strong evolution and the 3rd's die shrink should provide good efficiency.
ID: 59933 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 60117 - Posted: 8 May 2019, 13:11:03 UTC - in response to Message 59933.  

I hope to test the 3rd gen Ryzen when it releases - the 2nd gen seem like a strong evolution and the 3rd's die shrink should provide good efficiency.



Looking at some of the figures,

Ryzen 9 3850X: 16-cores, 32-threads, clocked at 4.3GHz to 5.1GHz


I think there would be steam coming out of my internet connection.
ID: 60117 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 60118 - Posted: 8 May 2019, 13:42:41 UTC

FWIW, I have been running my Ryzen 2600 on only 9 (out of 12) cores for the past couple of months, leaving three cores free.
https://www.cpdn.org/cpdnboinc/results.php?hostid=1480861
My current RAC is 29,913.11

But a day ago I increased that to all 12 cores, and will see what it does.
ID: 60118 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,487,746
RAC: 3,014
Message 60120 - Posted: 8 May 2019, 18:22:37 UTC - in response to Message 60118.  
Last modified: 8 May 2019, 18:23:49 UTC

FWIW, I have been running my Ryzen 2600 on only 9 (out of 12) cores for the past couple of months, leaving three cores free.
https://www.cpdn.org/cpdnboinc/results.php?hostid=1480861
My current RAC is 29,913.11

But a day ago I increased that to all 12 cores, and will see what it does.


Soon I will be testing six-core/12T Xeon E5-2630 v2 @2.6 GHz on Win10 with 16GB RAM only and will report back. However any hints how to measure read/write load would be appreciated (if this is useful metric), otherwise will stick to sec/TS and RAC.
ID: 60120 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 60122 - Posted: 8 May 2019, 21:44:13 UTC - in response to Message 60120.  

However any hints how to measure read/write load would be appreciated (if this is useful metric), otherwise will stick to sec/TS and RAC.

To measure the writes to disk, I use SsdReady. There is a free version that is good enough.
http://www.ssdready.com/ssdready/

The writes are a useful proxy for the work done, as long as you account for all the other writes to disk. (I place the BOINC Data folder on a ramdisk, which appears as a separate disk drive, and so can just monitor it directly.)

This time I thought I would try RAC, since it had already stabilized. But I see that the figure I gave, 29,913.11, is for two machines, and one is no longer active. For the machine I will be monitoring now, the RAC is actually 27,000, so that is what I will use.

It will probably take at least a week, maybe more, to get a meaningful comparison. I normally pay no attention to RAC at all.
ID: 60122 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 60123 - Posted: 9 May 2019, 8:07:40 UTC

I think there would be steam coming out of my internet connection.


Now to find out how long it takes twox148MB uploads to go through on free wifi on cross country trains!
ID: 60123 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,487,746
RAC: 3,014
Message 60137 - Posted: 15 May 2019, 15:56:44 UTC - in response to Message 60122.  

I have 4 SAM50 batch 814 running on the Xeon machine now (6/12 threads). Firstly it was only one WU+ a WCG WU, but after adding 3 more SAM50s the sec/TS of 1st WU increased. but still bellow the sec/TS for the other 3. I will pause in the next few days to see whether sec/TS changes until completion and then will try with 6 or more cores. SSDReady crashed but before it did reads/writes were ~45GB/28GB per day.
ID: 60137 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 60138 - Posted: 15 May 2019, 17:24:36 UTC - in response to Message 60137.  

I have 4 SAM50 batch 814 running on the Xeon machine now (6/12 threads). Firstly it was only one WU+ a WCG WU, but after adding 3 more SAM50s the sec/TS of 1st WU increased. but still bellow the sec/TS for the other 3. I will pause in the next few days to see whether sec/TS changes until completion and then will try with 6 or more cores. SSDReady crashed but before it did reads/writes were ~45GB/28GB per day.

The sec/TS is the average over the run, so when you added the other SAMs, cache contention and perhaps other things would slow the original one down. But depending on how far along into the run it already was, that may be barely noticeable in the sec/TS if pretty far in, or quite noticeable if it wasn't very far along to begin with. If looking at individual timesteps, or CPU time between the newer trickles, you should see that it is on par with the more recently downloaded SAM50 tasks.
ID: 60138 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,487,746
RAC: 3,014
Message 60139 - Posted: 15 May 2019, 18:12:05 UTC - in response to Message 60138.  

I have 4 SAM50 batch 814 running on the Xeon machine now (6/12 threads). Firstly it was only one WU+ a WCG WU, but after adding 3 more SAM50s the sec/TS of 1st WU increased. but still bellow the sec/TS for the other 3. I will pause in the next few days to see whether sec/TS changes until completion and then will try with 6 or more cores. SSDReady crashed but before it did reads/writes were ~45GB/28GB per day.

The sec/TS is the average over the run, so when you added the other SAMs, cache contention and perhaps other things would slow the original one down. But depending on how far along into the run it already was, that may be barely noticeable in the sec/TS if pretty far in, or quite noticeable if it wasn't very far along to begin with. If looking at individual timesteps, or CPU time between the newer trickles, you should see that it is on par with the more recently downloaded SAM50 tasks.


Thanks. I was looking at individual timesteps and there is slight slow down from 1.44 to 1.51 for 1st WU while the other 3 SAM50s are at 1.58/9
ID: 60139 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 60140 - Posted: 15 May 2019, 18:51:02 UTC - in response to Message 60139.  

Thanks. I was looking at individual timesteps and there is slight slow down from 1.44 to 1.51 for 1st WU while the other 3 SAM50s are at 1.58/9

Sorry, I wasn't being clear. There's no easy way to see individual timesteps as they occur. But if you look on the webpages for the tasks, the CPU time taken between trickles for your older task (after you started running the new tasks) should be similar to the CPU time taken between between trickles for your newer tasks. So it's running at the same speed as the other tasks now. It's just the cumulative sec/TS that you see averages in that earlier time when it was by itself.
ID: 60140 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,487,746
RAC: 3,014
Message 60143 - Posted: 16 May 2019, 6:40:43 UTC - in response to Message 60140.  

Thanks. I was looking at individual timesteps and there is slight slow down from 1.44 to 1.51 for 1st WU while the other 3 SAM50s are at 1.58/9

Sorry, I wasn't being clear. There's no easy way to see individual timesteps as they occur. But if you look on the webpages for the tasks, the CPU time taken between trickles for your older task (after you started running the new tasks) should be similar to the CPU time taken between between trickles for your newer tasks. So it's running at the same speed as the other tasks now. It's just the cumulative sec/TS that you see averages in that earlier time when it was by itself.


I've been looking exactly these and yet WU1 is still faster than the other 3, both CPU time (sec) and average sec/TS are shorter, though increasing with every trickle.
ID: 60143 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 60144 - Posted: 16 May 2019, 7:46:03 UTC
Last modified: 16 May 2019, 8:18:20 UTC

Since increasing from 9 to 12 cores on my Ryzen 2600 about a week ago, I have gone from a RAC of 27 k to almost 29 k. It will probably take a month to stabilize.

But note that the output per virtual core will decrease (though the output increases per physical core) with hyperthreading (or SMT) because you are dividing one physical core between two instruction streams. That keeps each physical core busy a higher percentage of the time, which is the whole point as it increases the total output.

Also note that you need enough memory to handle it, or the disk swapping will kill you. I have 32 GB, so no problem. Half that much would work with the present Windows work units on 12 cores, though the new Linux ones will apparently need more.
ID: 60144 · Report as offensive     Reply Quote

Message boards : Number crunching : About the new many core computers

©2024 climateprediction.net