climateprediction.net (CPDN) home page
Thread 'Thread affinity and transparent huge pages benefits'

Thread 'Thread affinity and transparent huge pages benefits'

Message boards : Number crunching : Thread affinity and transparent huge pages benefits
Message board moderation

To post messages, you must log in.

AuthorMessage
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 43,261,443
RAC: 72,712
Message 71000 - Posted: 16 Jun 2024, 21:37:43 UTC
Last modified: 16 Jun 2024, 21:42:25 UTC

Finally got enough WU to have some nice plots when messing with optimizations. The benefit of affining thread and enabling transparent huge pages are expected for memory intenstive workloads. So this is just putting some quantitive number over empirical expectations for this specific oifs batch.

Vertical axis is task runtime in seconds. Horizontal is one per sample, ordered by return time from oldest to latest.



One is 7950X and you can easily see when I started doing the optimization around sample 60. It reduced runtime by ~7-8%. (Sample ~44-58 is when I gambled with 13 tasks on the 64G host. While nothing errored out, it was not a bright idea for performance either. These points were excluded from the percentage calculation.)



This one is more complicated. It's 7950X3D, but Linux VM on Windows. I run 6 oifs tasks in the VM, and 16 tasks from other projects on Windows. The dots around sample 30, 42, 58 are peak hours where I paused Windows boinc but didn't pause the VM.

The setup is 6C/12T VM bound to the X3D cluster, core [2,14) on Windows. The first drop around sample 40 are enabling huge pages and affining threads inside the VM. That's about 10% improvement, whether I compare non-peak or peak samples. The second drop around sample 70 is when I started affining Windows boinc tasks away from the VM cores. Now it's getting pretty close to the peak samples when I only run the 6 oifs inside the VM without the 16 windows tasks.

Appendix - Sharing the simple commands and code

Enabing huge page at run time. Only effective for current boot. You can set `transparent_hugepage=always` on kernel cmdline to make it persist across boot, but how you do that is distro dependent, so I'm leaving that out.
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Verify huge page usage. Some huge page could exist even before that due to the default madvised mode, but the number will increase significantly once you set it to `always` that allows kernel to combine pages as it sees fit. AFAIC, most oifs usage is covered by huge pages just by the rough numbers.
grep Huge /proc/meminfo

Affinitize threads on Linux is done through `taskset`.
# Set pid 1234 to core 0-1
sudo taskset -apc 0,1 1234

To find out your CPU topology, as which CPU number belongs to which L3 or SMT sibling, use `lstopo` and check the `P#`. This is important because we don't want to bind two tasks onto SMT siblings. I bind each task to the two sibling threads.

Putting them together, I have a script invoked by in crontab every 10 minutes. Make sure you tune that `i=2`, `$i,$(($i+8))` and `i=$(($i+1))` to match your topology. They control how each task gets assigned to cores.

#!/bin/bash

i=2
for pid in $(pgrep oifs_43r3_model | sort); do
        taskset -apc $i,$(($i+8)) $pid
        i=$(($i+1))
done

If your host is native linux, you can stop here. For Windows, the "details" tab in task manager will let you select affinity. I use that to set affinity for the vmware process, while also affining thread inside the Linux guest as above. Seems to be 1:1 mapping at least for vmware workstation 17. (Can't figure out how to verify this other than looking at per-core usage on Windows host, which roughly matches Linux guest for affected cores. This is a bit of handwavy.)
Meanwhile, to bind everything else away, I use a powershell script loop. `$names` are other boinc process name I want to bind away from the cores used by VM. `$cpumask` is decimal of cpumask. Make sure you change that for your need.
$names = @('milkyway_nbody_orbit_fitting_1.87_windows_x86_64__mt','wcgrid_mcm1_map_7.61_windows_x86_64','einstein_O3AS_1.07_windows_x86_64__GW-opencl-nvidia-2')
$cpumask = 4294950915  # 0xFFFFC003

While ($true) {
        @(Get-Process $names) | ForEach-Object { $_.ProcessorAffinity = $cpumask }
        Start-Sleep -Seconds 300
}

PS: I don't really do any programming in Windows. Someone please tell me how to get powershell accept hex? `0x` prefix is supposed to work according to documentation, but I get `SetValueInvocationException` if I use hex.
ID: 71000 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71002 - Posted: 17 Jun 2024, 11:23:07 UTC - in response to Message 71000.  
Last modified: 17 Jun 2024, 11:24:08 UTC

Worth adding that hugepages are beneficial because it can reduce TLB misses (translation lookaside buffer); essentially a TLB miss means accessing data from next level down storage (whatever that might be).

Have you tried RAMdisks? They can make a huge improvement at the increased risk you lose everything if the machine goes down unexpectedly.

What size hugepages are you using? We would normally test enabling hugepages on HPC jobs. However, just on the batch jobs, not on the entire machine. Also, setting it too high could slow the code down. It has to be tested as you've done. I'd want to be sure it's not adversely affecting the rest of the machine though.

I have played with task affinity using task manager on Windows 11 but it made no difference (unless Task Manager was lying to me). When I get more time I'll have another go.
---
CPDN Visiting Scientist
ID: 71002 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 71004 - Posted: 17 Jun 2024, 14:21:01 UTC - in response to Message 71002.  

Worth adding that hugepages are beneficial because it can reduce TLB misses (translation lookaside buffer); essentially a TLB miss means accessing data from next level down storage (whatever that might be).


Not really - it's not having to access data from the next level down storage, it's having to do a (probably partial) page table walk, which is memory accesses that aren't really accomplishing anything (other than working out the virtual to physical mappings... which kind of have to happen for any other accesses to be able to happen). Depending on how the processor's TLB is arranged, and if it has a fixed number of mappings for large pages, it may or may not help a lot, but it's certainly worth trying, and I'd expect some performance improvements, as have been shown. But one can build a processor TLB design where enabling large pages hurts. I'm just not sure how modern x86 chips are doing things these days...


Have you tried RAMdisks? They can make a huge improvement at the increased risk you lose everything if the machine goes down unexpectedly.


Are any of the CPDN tasks actually disk bound with even a slow SSD? I don't see much in the way of disk accesses that strike me as "something improved by a ramdisk," though chewing up the RAM I have would certainly reduce the number of tasks I can run. I need to upgrade the RAM in a few of my boxes...
ID: 71004 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71006 - Posted: 17 Jun 2024, 14:42:32 UTC - in response to Message 71004.  

Are any of the CPDN tasks actually disk bound with even a slow SSD? I don't see much in the way of disk accesses that strike me as "something improved by a ramdisk


This is what my main Linux machine is doing. No CPDN tasks available at the moment. All my Boinc data are on a 7200 rpm spinning hard cfive, on a partition all its own. The other partitions on that drive are seldom used (mainly videos).

Notice there are 14 total tasks running: 13 Boinc tasks, and the boinc client. The machine is also running Firefox where I am typing this, but I do not type fast enough to put a noticeable load on the 16-core machine.

top - 10:29:07 up 11 days, 22:57,  2 users,  load average: 13.15, 13.43, 14.00
Tasks: 473 total,  14 running, 459 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.1 sy, 80.9 ni, 18.3 id,  0.0 wa,  0.2 hi,  0.0 si,  0.0 st
MiB Mem : 128086.0 total,   1816.4 free,   6809.1 used, 119460.6 buff/cache
MiB Swap:  15992.0 total,  15848.5 free,    143.5 used. 117967.0 avail Mem 

    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
2250093    5542 boinc     39  19 R 377464   0.3  99.3  2 383:13.99 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.05_x86_64-pc-li+ 
2279983    5542 boinc     39  19 R 376732   0.3  99.6 15 130:59.06 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.05_x86_64-pc-li+ 
2286174    5542 boinc     39  19 R 376492   0.3  99.5  6  87:23.78 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.05_x86_64-pc-li+ 
2288140    5542 boinc     39  19 R 213000   0.2  99.5 13  72:19.73 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP4G_1.33_x86_64-pc+ 
2288142    5542 boinc     39  19 R 212912   0.2  99.6  2  72:19.53 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP4G_1.33_x86_64-pc+ 
   5542       1 boinc     30  10 S  46240   0.0   0.1  5 233487:01 /usr/bin/boinc                                                            
2295746    5542 boinc     39  19 R  40780   0.0  99.5  7  16:07.92 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
2286177    5542 boinc     39  19 R  40200   0.0  99.5  0  87:34.88 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
2295175    5542 boinc     39  19 R  39228   0.0  99.5  1  20:11.83 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
2287722    5542 boinc     39  19 R  38920   0.0  99.5  3  74:56.89 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
2289384    5542 boinc     39  19 R  38900   0.0  99.5  4  67:00.71 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
2287207    5542 boinc     39  19 R   2528   0.0  99.7  9  67:00.94 ../../projects/denis.usj.es_denisathome/HuVeMOp_0.02_x86_64-pc-linux-gnu  
2294141    5542 boinc     39  19 R   2508   0.0  99.5 12  28:41.28 ../../projects/denis.usj.es_denisathome/HuVeMOp_0.02_x86_64-pc-linux-gnu  
2294649    5542 boinc     39  19 R   2508   0.0  99.7  5  25:35.86 ../../projects/denis.usj.es_denisathome/HuVeMOp_0.02_x86_64-pc-linux-gnu  


ID: 71006 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 43,261,443
RAC: 72,712
Message 71007 - Posted: 17 Jun 2024, 17:05:33 UTC - in response to Message 71002.  
Last modified: 17 Jun 2024, 17:05:50 UTC

Worth adding that hugepages are beneficial because it can reduce TLB misses (translation lookaside buffer); essentially a TLB miss means accessing data from next level down storage (whatever that might be).

SolarSyonyk had it right. The benefit is not necessarily through reducing next level access. The entire page walk can hit cache but still hurt performance a lot. A TLB miss effectively means that specific memory access is blocked because it needs the physical address first. Whatever latency a page walk incurs is in addition to the normal hit or miss for the data once the address is available. Now if a page work actually miss in cache, that will compound and destroy performance quickly. Modern micro-arch has hardware page walkers that will try to get ahead and hide the latency too. Still TLB miss is to be avoided as much as possible for memory intensive applications, which is how huge pages help by covering much large area of memory per TLB entry. The kernel doc page explained it succinctly if anyone is interested: https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html

Have you tried RAMdisks? They can make a huge improvement at the increased risk you lose everything if the machine goes down unexpectedly.

This is the opposite of what one wants to do here. The allocation workload does is anonymous memory and ramdisk can only potentially help with file backed memory. However, in any scenario where memory capacity or bandwidth is already the bottleneck, the last thing you want is forcing files or even pages not in the hot path into memory.

What size hugepages are you using? We would normally test enabling hugepages on HPC jobs. However, just on the batch jobs, not on the entire machine. Also, setting it too high could slow the code down. It has to be tested as you've done. I'd want to be sure it's not adversely affecting the rest of the machine though.

I'm enabling the transparent huge page (THP) feature in kernel and AFAIK, it only uses 2MB huge pages. For applications that we have control, we use a combination of 2MB and 1GB pages in production because we can ensure the application only request the right size it needs. However, here I have no control over the application memory allocation call, so THP is the only thing I can do. Another concern of THP is the wasted memory within huge page causing additional OOMs, which I didn't observe even when I only have like 1GB of headroom when going by 5GB per job. Empirically it makes sense given the memory swing of oifs is hundreds of MB at a time, so 2MB pages shouldn't result in many partially used pages.

FWIW, this is the current stat on my 7950X3D VM with 32G memory. More than half of the memory is covered by 2MB pages and the low split stats should mean that the huge pages were actively put to good use during its lifetime.
$ egrep 'trans|thp' /proc/vmstat
nr_anon_transparent_hugepages 9170
thp_migration_success 0
thp_migration_fail 0
thp_migration_split 0
thp_fault_alloc 190586370
thp_fault_fallback 12973323
thp_fault_fallback_charge 0
thp_collapse_alloc 8711
thp_collapse_alloc_failed 1
thp_file_alloc 0
thp_file_fallback 0
thp_file_fallback_charge 0
thp_file_mapped 0
thp_split_page 13881
thp_split_page_failed 0
thp_deferred_split_page 12984
thp_split_pmd 27158
thp_scan_exceed_none_pte 18
thp_scan_exceed_swap_pte 23689
thp_scan_exceed_share_pte 0
thp_split_pud 0
thp_zero_page_alloc 2
thp_zero_page_alloc_failed 0
thp_swpout 0
thp_swpout_fallback 13872

$ grep Huge /proc/meminfo 
AnonHugePages:  18757632 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
ID: 71007 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 43,261,443
RAC: 72,712
Message 71008 - Posted: 17 Jun 2024, 17:23:19 UTC - in response to Message 71002.  
Last modified: 17 Jun 2024, 17:29:02 UTC

Have you tried RAMdisks? They can make a huge improvement at the increased risk you lose everything if the machine goes down unexpectedly.

Oh, I think I get what you were at now. Are you referring to the data/checkpoint written to disk? I assumed they are not application blocking, as the application writes just get buffered by the page cache and flushed to disk asynchronously by kernel.

If oifs job actually waits for the flush like database applications do, then it could matter in some cases. AFAIC, each oifs task writes ~50GB to disk. Assuming a 5 hour runtime on a fast machine, that's 3MB/s on average, but all happen as periodic spikes through large sequential writes. If it's spinning rust with 100MB/s write bandwidth, I guess it could be 3% of time spent on disk writes, even worse with multiple tasks if they are not splayed. Likely not worth a consideration for SSDs (especially NVMe ones) even if the write are synchronous, and all my hosts use SSDs for applications...
ID: 71008 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71009 - Posted: 17 Jun 2024, 19:20:01 UTC - in response to Message 71008.  
Last modified: 17 Jun 2024, 19:25:50 UTC

Have you tried RAMdisks? They can make a huge improvement at the increased risk you lose everything if the machine goes down unexpectedly.
Oh, I think I get what you were at now. Are you referring to the data/checkpoint written to disk? I assumed they are not application blocking, as the application writes just get buffered by the page cache and flushed to disk asynchronously by kernel. ...

OIFS will wait programmatically until the write completes in the configuration we use for CPDN. That includes the model output and the restart/checkpoint files. In tests I've found the model can slow down between 5-10% depending on exactly how much is written in model results. That's compared to a test that doesn't write anything. I've not tested using RAMdisk on the desktop, only when I was working in HPC.

p.s. forgot to add that we usually used 4Mb for hugepages when I was employed!

pp.s. thx for correcting me on tlb misses!
---
CPDN Visiting Scientist
ID: 71009 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 43,261,443
RAC: 72,712
Message 71010 - Posted: 17 Jun 2024, 19:50:25 UTC - in response to Message 71009.  

OIFS will wait programmatically until the write completes in the configuration we use for CPDN. That includes the model output and the restart/checkpoint files. In tests I've found the model can slow down between 5-10% depending on exactly how much is written in model results. That's compared to a test that doesn't write anything. I've not tested using RAMdisk on the desktop, only when I was working in HPC.

Thanks for the details. Suddenly splaying tasks at initial start seem to be worth the hassle, especially if I play with those cloud instances again next time. Guess this could be one of the reasons why running a larger VMs of the same disk slowed down oifs, since the network disk had pretty low fixed bandwidth. :-(

p.s. forgot to add that we usually used 4Mb for hugepages when I was employed!

Must be one of those interesting non-x86 architecture back then. AFAIK, x86 only support 4K, 2M and 1G pages. Is that SPARC? :-P

I more or less feel x86 is held back by the 4K pages a bit. Apple M* is using 16K pages, and a lot of aarch64 benchmarks are published with 64K page size. Some vendor we work with for data center workload refused to support 4KB pages for their aarch64 implementation at all due to performance reason. ¯\_(ツ)_/¯
ID: 71010 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 71011 - Posted: 17 Jun 2024, 21:58:55 UTC - in response to Message 71010.  

Must be one of those interesting non-x86 architecture back then. AFAIK, x86 only support 4K, 2M and 1G pages. Is that SPARC? :-P


No, x86/32-bit had 4MB large pages. The page directory maps 4MB on 32-bit x86 (1024 page table entries per 4k page), but only 2MB on 64-bit x86 (512 entries per 4k page table, because they're 64-bit entries, instead of 32-bit entries).


I more or less feel x86 is held back by the 4K pages a bit. Apple M* is using 16K pages, and a lot of aarch64 benchmarks are published with 64K page size. Some vendor we work with for data center workload refused to support 4KB pages for their aarch64 implementation at all due to performance reason. ¯\_(ツ)_/¯


It just depends on what you're doing. I doubt x86 will ever change away from 4kb pages, there's too much that assumes that implicitly. Large pages get you a lot, but I'm not sure how much it matters, with some of the (probably leaky...) TLB optimizations on x86 chips anymore.

ARMv8/AArch64 is a whole heck of a lot more flexible, though.
ID: 71011 · Report as offensive     Reply Quote

Message boards : Number crunching : Thread affinity and transparent huge pages benefits

©2024 cpdn.org