New work discussion

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4352 Credit: 16,595,762 RAC: 6,274	Message 69016 - Posted: 27 Jun 2023, 18:01:07 UTC I am now also getting transient http errors on my uploads. This may be just congestion but I have alerted the project people anyway. ID: 69016 ·

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 259 Credit: 32,091,992 RAC: 22,910	Message 69017 - Posted: 27 Jun 2023, 18:49:18 UTC - in response to Message 68998. Last modified: 27 Jun 2023, 18:51:08 UTC I'm getting only 2% per day, but that's running 24 tasks on 24 threads. Definitely don't do that for CPDN tasks. You literally lose throughput in "instructions retired per second" compared to far fewer tasks running at a time. I've got some 3900Xs and I find CPDN throughput (overall system IPS) peaks in the 8-12 tasks, and there's not a huge difference between them, so I typically run 8 tasks max - a bit less heat, and a slightly shorter return time per task. Or whatever I can with the RAM limits, some of the boxes are a bit tighter on RAM than ideal for the new big tasks. Some of the other projects, you gain total IPS up to a full 24 tasks. Though I've found for World Community Grid and such, I improve performance if I manually limit it to only running one type of task at a time. Having the branch predictors and such all trained properly seems to help, or there's less instruction cache conflict, or... something of the sort. I've not pulled out the counters in great detail to inspect it, I just parse the per-core instructions retired MSR with a little C program. ID: 69017 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4352 Credit: 16,595,762 RAC: 6,274	Message 69018 - Posted: 27 Jun 2023, 19:05:08 UTC My uploads have started again. ID: 69018 ·

Aurum Send message Joined: 15 Jul 17 Posts: 94 Credit: 18,354,396 RAC: 6,590	Message 69019 - Posted: 27 Jun 2023, 19:20:07 UTC - in response to Message 68877. Looking forward to it. Which app would that be? Weather at Home Windows tasks. About how much RAM per task will be required? My Win7 is running three 8.24 wah2 using 423 MB each. ID: 69019 ·

rob Send message Joined: 5 Jun 09 Posts: 80 Credit: 3,046,017 RAC: 3,192	Message 69020 - Posted: 27 Jun 2023, 19:43:39 UTC - in response to Message 69012. Thanks Dave - I was a bit over excited to see new work, and rather dismayed to see so many errors:-( Glad it's a know problem and folks are digging into it. ID: 69020 ·

zombie67 [MM] Send message Joined: 2 Oct 06 Posts: 52 Credit: 26,209,214 RAC: 3,355	Message 69021 - Posted: 28 Jun 2023, 2:07:32 UTC - in response to Message 69019. My Win7 is running three 8.24 wah2 using 423 MB each. Same. These tasks do not use a lot of RAM compared to others. ID: 69021 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69022 - Posted: 28 Jun 2023, 8:17:40 UTC - in response to Message 69017. I'm getting only 2% per day, but that's running 24 tasks on 24 threads. Definitely don't do that for CPDN tasks. You literally lose throughput in "instructions retired per second" compared to far fewer tasks running at a time. I changed it to 12 instead of 24 tasks, and the % per day has not yet increased. I'll leave it a bit longer, but so far I seem to have halved my total throughput by doing no HT. Or whatever I can with the RAM limits, some of the boxes are a bit tighter on RAM than ideal for the new big tasks. You seem to have a very small amount of RAM. You seem to have put more money into the CPUs, and I put more into RAM. Maybe it's because I run LHC a lot. Some of the other projects, you gain total IPS up to a full 24 tasks. I've found all the projects except this one benefit from using all the threads. What's different about these programs? Do I overload the cache? Though I've found for World Community Grid and such, I improve performance if I manually limit it to only running one type of task at a time. Having the branch predictors and such all trained properly seems to help Interesting. Do the predictors not look at only one running program? Eg if you had 3 of one program running and 3 of another, they'd sit on seperate cores, so use different predictors? ID: 69022 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 816 Credit: 13,672,275 RAC: 8,057	Message 69023 - Posted: 28 Jun 2023, 9:35:43 UTC - in response to Message 69022. Last modified: 28 Jun 2023, 9:37:44 UTC I'm getting only 2% per day, but that's running 24 tasks on 24 threads. Definitely don't do that for CPDN tasks. You literally lose throughput in "instructions retired per second" compared to far fewer tasks running at a time. I changed it to 12 instead of 24 tasks, and the % per day has not yet increased. I'll leave it a bit longer, but so far I seem to have halved my total throughput by doing no HT. Check out the graph in this message (and thread discussion) https://www.cpdn.org/forum_thread.php?id=9184&postid=68081 where I did some throughput tests with OpenIFS. Best throughput came with 50% threads in use. Throughput stops increasing at >50% threads in use as the runtime increases with more threads. I would expect the same to apply with the UM family of models. These are highly numerical models and will compete for the same floating point units on the chip if you run on all threads. Cache size is not the main factor for performance of these models. --- CPDN Visiting Scientist ID: 69023 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4352 Credit: 16,595,762 RAC: 6,274	Message 69024 - Posted: 28 Jun 2023, 10:28:27 UTC I changed it to 12 instead of 24 tasks, and the % per day has not yet increased. I'll leave it a bit longer, but so far I seem to have halved my total throughput by doing no HT. I am getting about 7.5% a day on each of my 7 tasks running. ID: 69024 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4352 Credit: 16,595,762 RAC: 6,274	Message 69025 - Posted: 28 Jun 2023, 10:37:49 UTC Still got seven zips to finish uploading. I think the issue is just congestion and when the server gets hit a bit less often things should improve. (Though with so many computers crashing everything, that the congestion is still happening seems a bit odd to me.) ID: 69025 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69026 - Posted: 28 Jun 2023, 11:58:20 UTC - in response to Message 69023. Check out the graph in this message (and thread discussion) https://www.cpdn.org/forum_thread.php?id=9184&postid=68081 where I did some throughput tests with OpenIFS. Best throughput came with 50% threads in use. Throughput stops increasing at >50% threads in use as the runtime increases with more threads. I would expect the same to apply with the UM family of models. These are highly numerical models and will compete for the same floating point units on the chip if you run on all threads. Cache size is not the main factor for performance of these models. I can believe your graph, especially since I see less heat production running CPDN than other projects. Clearly the "thought" part of the CPU is having to wait on something else. Perhaps RAM access, cache limitation, etc. Or maybe it isn't compatible with hyperthreading. I'm just confused as to why when I dropped it from 24 to 12 tasks on a 12/24 machine, I didn't see the % per day rising. I'll leave it another couple of days and see what happens. Just because the tasks started 24 at a time shouldn't mean they remain slow when given more of the CPU to themselves. ID: 69026 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 816 Credit: 13,672,275 RAC: 8,057	Message 69027 - Posted: 28 Jun 2023, 13:06:26 UTC - in response to Message 69026. Last modified: 28 Jun 2023, 13:07:50 UTC I would expect the same to apply with the UM family of models. These are highly numerical models and will compete for the same floating point units on the chip if you run on all threads. Cache size is not the main factor for performance of these models. I can believe your graph, ........ I'm just confused as to why when I dropped it from 24 to 12 tasks on a 12/24 machine, I didn't see the % per day rising. Because the graph shows that we hit maximum throughput when running the same number of tasks as cores (not threads) i.e. 12 -> 24 tasks gives the same throughput because each extra task then slows down due to contention on the chip. The %/day will go up if you go reduce tasks below the no. of cores but then you are not getting maximum throughput (tasks complete per day). You might get a better throughput by mixing projects; numerical codes like WaH which need the floating pt units with another project tasks that need say mostly integer, but it would need testing. Hyperthreading works best for tasks that are multithreaded (parallel). --- CPDN Visiting Scientist ID: 69027 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69029 - Posted: 28 Jun 2023, 13:33:53 UTC - in response to Message 69027. Last modified: 28 Jun 2023, 13:36:02 UTC Because the graph shows that we hit maximum throughput when running the same number of tasks as cores (not threads) i.e. 12 -> 24 tasks gives the same throughput because each extra task then slows down due to contention on the chip. The %/day will go up if you go reduce tasks below the no. of cores but then you are not getting maximum throughput (tasks complete per day). But I'm watching the % per day on one task in Boinc. If the throughput is the same for the whole CPU, and I'm doing 12 instead of 24, each task should go twice as fast. You might get a better throughput by mixing projects; numerical codes like WaH which need the floating pt units with another project tasks that need say mostly integer, but it would need testing. Ah, I didn't think of that. I've done that before with SP and DP on GPUs. Usually that happens anyway since I don't often get a large number of CPDN tasks. I don't suppose there's any chance of using GPUs for these floating point parts? Projects which have done so have been rewarded with a massive speedup. I know this project isn't that busy anyway, but faster returns would presumably be good for the scientists. Hyperthreading works best for tasks that are multithreaded (parallel). It also works well if I run something like Universe@Home, which is single threads like here, but probably much simpler programs, since even my phone will run them. In fact most projects I find 24 threads is 1.5 times the total throughput as 12 threads. ID: 69029 ·

Ingleside Send message Joined: 5 Aug 04 Posts: 108 Credit: 19,565,133 RAC: 31,967	Message 69031 - Posted: 28 Jun 2023, 14:44:45 UTC - in response to Message 69027. Because the graph shows that we hit maximum throughput when running the same number of tasks as cores (not threads) Well, at least to my eyes it seems "Tasks completed/day" increases from 1 - 7 running tasks, where at 6 running tasks the blue point is below the 14-grid-line while 7, 8 and 9 the 14-grid-line seems to be hidden behind the blue line & dots. For 10 running tasks the blue line dips a little again. Now if it's 7, 8 or 9 tasks that is the highest tasks completed/day I can't really see from the graph. The increase is much steeper going from 1 - 6 running tasks than from 6 - 7 running tasks, making it easy to overlook the small increase. Now if the small increase is significant enough to run more than 1 task / core I can't really say. ID: 69031 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69032 - Posted: 28 Jun 2023, 15:15:26 UTC - in response to Message 69031. Last modified: 28 Jun 2023, 15:15:52 UTC Looks like about 0.25 out of 14 to me, so insignificant. And the tasks get done quicker, although we don't know how quick they want them back. Dave Jackson (?) told me about a month, even though they STILL show a year on the deadline, which can get doubled for a resend. ID: 69032 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 816 Credit: 13,672,275 RAC: 8,057	Message 69036 - Posted: 28 Jun 2023, 17:47:32 UTC - in response to Message 69031. Because the graph shows that we hit maximum throughput when running the same number of tasks as cores (not threads) Well, at least to my eyes it seems "Tasks completed/day" increases from 1 - 7 running tasks, where at 6 running tasks the blue point is below the 14-grid-line while 7, 8 and 9 the 14-grid-line seems to be hidden behind the blue line & dots. For 10 running tasks the blue line dips a little again. Now if it's 7, 8 or 9 tasks that is the highest tasks completed/day I can't really see from the graph. The increase is much steeper going from 1 - 6 running tasks than from 6 - 7 running tasks, making it easy to overlook the small increase. Now if the small increase is significant enough to run more than 1 task / core I can't really say. Yes, the throughput increases up to 6 tasks but after that reaches it's limit no matter how many more tasks are started. I suggest not to pay too much attention to the little variations in the graph after 6 tasks. This was done on a quiet as I could make it PC, it's best if you run the same tests on your machine, though I'm sure the advice of no more tasks than cores for best throughput would still apply. ID: 69036 ·

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 259 Credit: 32,091,992 RAC: 22,910	Message 69038 - Posted: 28 Jun 2023, 20:11:20 UTC - in response to Message 69022. Last modified: 28 Jun 2023, 20:17:17 UTC I changed it to 12 instead of 24 tasks, and the % per day has not yet increased. I'll leave it a bit longer, but so far I seem to have halved my total throughput by doing no HT. As far as I can tell, %/day doesn't update very quickly, and will only show "% done / total runtime" numbers, but I've not paid much attention to it - I optimize based on the instructions per second retired, based on the CPU counters, and try to maximize that (or at least get it into the big plateau region). You seem to have a very small amount of RAM. You seem to have put more money into the CPUs, and I put more into RAM. Maybe it's because I run LHC a lot. Yeah, one of my 3900X is at 16GB, the other is at 32GB. They're both somewhat under-memoried, but I've found 32GB not to really matter - that's enough to load up the CPU pretty well, and the 16GB one just runs fewer tasks. I'm somewhat power and thermally limited out here for large parts of the year anyway (they run in a solar powered off grid office), so a few less watts is fine with me. They're all scrap builds out of used parts obtained cheap from various sources - they're not high cost builds. I've found all the projects except this one benefit from using all the threads. What's different about these programs? Do I overload the cache? I'm not really sure. CPDN tasks tend to be pulling a lot more data out of main memory than most other BONC-type tasks (in terms of GB/second of memory transfer), but I think they're also better optimized in terms of the execution units on the CPU core. Hyperthreading only gains you anything if you have unused execution slots - if you have, say, some very well hand optimized code that is dispatching instructions at the limit of the processor, HT gains you nothing, and actually loses performance from the increased cache pressure. A 3900X has 32kb L1I, 32kb L1D, and 512kb L2 per physical core, with a shared 64MB L3 split in weird ways I'm not going to worry about. If you can fit your tasks substantially in that cache, with room to spare, and they're not making good use of the execution slots, HT is a clear win. But scientific compute tasks tend to not fit that general guideline, and the CPDN tasks are heavier and better optimized than a lot of other tasks. Do the predictors not look at only one running program? Eg if you had 3 of one program running and 3 of another, they'd sit on seperate cores, so use different predictors? It depends. But in general, they're not (or, at least, weren't...) flushed on Interesting. task switches, and they tend to be rather blind to what's executing - they care about virtual address and the type of instruction, and that's it. Spectre and various other microarchitectural exploits rely on this to mis-direct speculation into useful paths, so it's typically flushed on task switch, I think... but I've not looked too deeply into it lately. I do run all my Linux compute boxes with "migitations=off" on the kernel command line, though - that eliminates some of the state clearing between task switches, and gains me some performance back at the cost of possible side channels. Since they're literally only used for compute tasks, I really don't care. I'll agree with you that for a lot of BOINC tasks, throughput goes up all the way to the hyperthreaded core limit - but that's just not the case for CPDN. And I've not had the desire to profile the tasks well enough to find out the details, because it doesn't really matter what it's limiting on - performance still peaks somewhere at "physical core count or fewer." For a 3900X (12C/24T), somewhere in the 8-12 task range seems to be the main plateau, so I just leave it there and don't worry about the details. ID: 69038 ·

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 259 Credit: 32,091,992 RAC: 22,910	Message 69039 - Posted: 28 Jun 2023, 20:15:10 UTC - in response to Message 69027. Last modified: 28 Jun 2023, 20:22:27 UTC You might get a better throughput by mixing projects; numerical codes like WaH which need the floating pt units with another project tasks that need say mostly integer, but it would need testing. It needs testing, and will really depend on cache pressure, which particular execution slots they're using, etc. But I would generally expect CPDN tasks to "play nice" with integer heavy, small kernel stuff like some of the World Community Grid tasks. ... not that I've had enough work from various places recently to have it matter. Hyperthreading works best for tasks that are multithreaded (parallel). Eh. It depends on the code. It tends to work well for tasks that are either poorly optimized for whatever reason (can't keep the execution units busy), or, more commonly, tasks that are RAM heavy and spend a lot of time stalled waiting on data from DRAM. If it's waiting around for instructions or data, another executing thread with data in cache pretty much runs for free. Just, you now have two tasks splitting your cache. //EDIT: Actually, I stand corrected about the WCG tasks. I'm running a whole whopperload of Mapping Cancer Markers tasks right now, so I compared my two 3900X boxes - one running at 12 threads, one at 18. The 12T box progress rate per task is 0.000209 - so a total progress of 0.002508 The 18T box progress rate per task is 0.000142 - so a total progress of 0.002556. Faster, but only barely. Maybe 2%. The 12T box is retiring a rock solid 123G instructions per second. The 18T box is retiring somewhere between 115G and 125G IPS, but jumps around a lot more. ID: 69039 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1067 Credit: 16,546,621 RAC: 2,321	Message 69041 - Posted: 28 Jun 2023, 22:23:05 UTC - in response to Message 69039. I have a Linux box,ID: 1511241, Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Red Hat Enterprise Linux 8.8 (Ootpa) [4.18.0-477.13.1.el8_8.x86_64\|libc 2.28] BOINC version 7.20.2 Memory 125.08 GB Cache 16896 KB Where it says 16 processors, it means 8 real and 8 hyperthreaded ones. I tell the Boinc-client to run only 12 processors for boinc tasks. There is a Linux utility that analyzes RAM use including processor caches. At the moment I am running 3 Einstein, 4 WCG (MCM=1), and 5 DENIS. I get this. # perf stat -aB -d -e cache-references,cache-misses Performance counter stats for 'system wide': 10,157,999,539 cache-references (66.67%) 6,379,971,454 cache-misses # 62.807 % of all cache refs (66.67%) 1,326,645,312,439 L1-dcache-loads (66.67%) 38,106,376,737 L1-dcache-load-misses # 2.87% of all L1-dcache accesses (66.67%) 3,700,436,522 LLC-loads (66.67%) 2,786,678,543 LLC-load-misses # 75.31% of all LL-cache accesses (66.67%) 61.259220589 seconds time elapsed I wish I knew what the Cache, L1-dcache, and LLC caches were. ID: 69041 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69042 - Posted: 29 Jun 2023, 2:07:14 UTC - in response to Message 69038. As far as I can tell, %/day doesn't update very quickly, and will only show "% done / total runtime" numbers, but I've not paid much attention to it - I optimize based on the instructions per second retired, based on the CPU counters, and try to maximize that (or at least get it into the big plateau region). I was looking at the % complete for the task, and the time taken, and dividing one by the other in my head. Obviously for the task which started out as 24 at a time, this is an average of 24 and 12 settings, but it should move towards 4% per day instead of 2%. It's STILL at 2%/day. And even stranger, my old Xeon X5650 (running with HT on) is getting 4%/day. That CPU should be 3.3 times slower according to benchmarks. My conclusion is all PCs are different, especially with unusual collections of whatever RAM I could get hold of. I'll put it back on 24. Yeah, one of my 3900X is at 16GB, the other is at 32GB. They're both somewhat under-memoried, but I've found 32GB not to really matter - that's enough to load up the CPU pretty well I upped this one to 96GB, mainly for LHC, that uses 3GB per thread. Although now I've hit my 7Mbit upload speed with my ISP, their CMS tasks send a lot of data back while computing (and I total 106 cores across 8 machines), and if they can't get it sent, they pause computing until they can. I had to arrange for a careful mix of their different tasks, since my phone company still has me connected to the next town (stupid historic wiring I guess) and won't do fibre to the premises for another two years. and the 16GB one just runs fewer tasks. I'm somewhat power and thermally limited out here for large parts of the year anyway (they run in a solar powered off grid office), so a few less watts is fine with me. I'd buy another solar panel, there are cheap second hand ones on Ebay. They're all scrap builds out of used parts obtained cheap from various sources - they're not high cost builds. You have two Ryzen 9 CPUS and you say they're not high cost? Just where did you get a scrap Ryzen? Hyperthreading only gains you anything if you have unused execution slots - if you have, say, some very well hand optimized code that is dispatching instructions at the limit of the processor, HT gains you nothing, and actually loses performance from the increased cache pressure. A 3900X has 32kb L1I, 32kb L1D, and 512kb L2 per physical core, with a shared 64MB L3 split in weird ways I'm not going to worry about. If you can fit your tasks substantially in that cache, with room to spare, and they're not making good use of the execution slots, HT is a clear win. But scientific compute tasks tend to not fit that general guideline, and the CPDN tasks are heavier and better optimized than a lot of other tasks. In my experience, CPDN is the only one not to gain from HT, so I guess the others aren't that tightly coded (which isn't surprising, most code isn't nowadays, let alone small underfunded projects). It depends. But in general, they're not (or, at least, weren't...) flushed on Interesting. task switches, and they tend to be rather blind to what's executing - they care about virtual address and the type of instruction, and that's it. Spectre and various other microarchitectural exploits rely on this to mis-direct speculation into useful paths, so it's typically flushed on task switch, I think... but I've not looked too deeply into it lately. I just looked up Spectre, and it's cousin Meltdown: https://en.wikipedia.org/wiki/Spectre_(security_vulnerability) https://en.wikipedia.org/wiki/Meltdown_(security_vulnerability) Ouch: "A purely software workaround to Meltdown has been assessed as slowing computers between 5 and 30 percent in certain specialized workloads" I do run all my Linux compute boxes with "migitations=off" on the kernel command line, though - that eliminates some of the state clearing between task switches, and gains me some performance back at the cost of possible side channels. Since they're literally only used for compute tasks, I really don't care. Is this possible in Windows? I'll agree with you for a lot of BOINC tasks, throughput goes up all the way to the hyperthreaded core limit - but that's just not the case for CPDN. And I've not had the desire to profile the tasks well enough to find out the details, because it doesn't really matter what it's limiting on - performance still peaks somewhere at "physical core count or fewer." For a 3900X (12C/24T), somewhere in the 8-12 task range seems to be the main plateau, so I just leave it there and don't worry about the details. At least it just peaks, which means I'm not actually slowing throughput down by running 24, just returning the results later. ID: 69042 ·

New work discussion - 2