home page
Posts by wujj123456

Posts by wujj123456

1) Message boards : Number crunching : New Work Announcements 2024 (Message 71048)
Posted 16 days ago by wujj123456
Just wondering whether to abort the one that I've got. It's survived 2 restarts so far, so if there is still a problem, its luck must run out soon.

From my experience, either the WUs quickly fail on restart, or it's just fine. For my previous 8.24 batches, every WU survived from an unexpected restart completed successfully. So I see no reason to abort them when they are happily crunching.
2) Message boards : Number crunching : Thread affinity and transparent huge pages benefits (Message 71010)
Posted 24 days ago by wujj123456
OIFS will wait programmatically until the write completes in the configuration we use for CPDN. That includes the model output and the restart/checkpoint files. In tests I've found the model can slow down between 5-10% depending on exactly how much is written in model results. That's compared to a test that doesn't write anything. I've not tested using RAMdisk on the desktop, only when I was working in HPC.

Thanks for the details. Suddenly splaying tasks at initial start seem to be worth the hassle, especially if I play with those cloud instances again next time. Guess this could be one of the reasons why running a larger VMs of the same disk slowed down oifs, since the network disk had pretty low fixed bandwidth. :-(

p.s. forgot to add that we usually used 4Mb for hugepages when I was employed!

Must be one of those interesting non-x86 architecture back then. AFAIK, x86 only support 4K, 2M and 1G pages. Is that SPARC? :-P

I more or less feel x86 is held back by the 4K pages a bit. Apple M* is using 16K pages, and a lot of aarch64 benchmarks are published with 64K page size. Some vendor we work with for data center workload refused to support 4KB pages for their aarch64 implementation at all due to performance reason. ¯\_(ツ)_/¯
3) Message boards : Number crunching : Thread affinity and transparent huge pages benefits (Message 71008)
Posted 24 days ago by wujj123456
Have you tried RAMdisks? They can make a huge improvement at the increased risk you lose everything if the machine goes down unexpectedly.

Oh, I think I get what you were at now. Are you referring to the data/checkpoint written to disk? I assumed they are not application blocking, as the application writes just get buffered by the page cache and flushed to disk asynchronously by kernel.

If oifs job actually waits for the flush like database applications do, then it could matter in some cases. AFAIC, each oifs task writes ~50GB to disk. Assuming a 5 hour runtime on a fast machine, that's 3MB/s on average, but all happen as periodic spikes through large sequential writes. If it's spinning rust with 100MB/s write bandwidth, I guess it could be 3% of time spent on disk writes, even worse with multiple tasks if they are not splayed. Likely not worth a consideration for SSDs (especially NVMe ones) even if the write are synchronous, and all my hosts use SSDs for applications...
4) Message boards : Number crunching : Thread affinity and transparent huge pages benefits (Message 71007)
Posted 24 days ago by wujj123456
Worth adding that hugepages are beneficial because it can reduce TLB misses (translation lookaside buffer); essentially a TLB miss means accessing data from next level down storage (whatever that might be).

SolarSyonyk had it right. The benefit is not necessarily through reducing next level access. The entire page walk can hit cache but still hurt performance a lot. A TLB miss effectively means that specific memory access is blocked because it needs the physical address first. Whatever latency a page walk incurs is in addition to the normal hit or miss for the data once the address is available. Now if a page work actually miss in cache, that will compound and destroy performance quickly. Modern micro-arch has hardware page walkers that will try to get ahead and hide the latency too. Still TLB miss is to be avoided as much as possible for memory intensive applications, which is how huge pages help by covering much large area of memory per TLB entry. The kernel doc page explained it succinctly if anyone is interested:

Have you tried RAMdisks? They can make a huge improvement at the increased risk you lose everything if the machine goes down unexpectedly.

This is the opposite of what one wants to do here. The allocation workload does is anonymous memory and ramdisk can only potentially help with file backed memory. However, in any scenario where memory capacity or bandwidth is already the bottleneck, the last thing you want is forcing files or even pages not in the hot path into memory.

What size hugepages are you using? We would normally test enabling hugepages on HPC jobs. However, just on the batch jobs, not on the entire machine. Also, setting it too high could slow the code down. It has to be tested as you've done. I'd want to be sure it's not adversely affecting the rest of the machine though.

I'm enabling the transparent huge page (THP) feature in kernel and AFAIK, it only uses 2MB huge pages. For applications that we have control, we use a combination of 2MB and 1GB pages in production because we can ensure the application only request the right size it needs. However, here I have no control over the application memory allocation call, so THP is the only thing I can do. Another concern of THP is the wasted memory within huge page causing additional OOMs, which I didn't observe even when I only have like 1GB of headroom when going by 5GB per job. Empirically it makes sense given the memory swing of oifs is hundreds of MB at a time, so 2MB pages shouldn't result in many partially used pages.

FWIW, this is the current stat on my 7950X3D VM with 32G memory. More than half of the memory is covered by 2MB pages and the low split stats should mean that the huge pages were actively put to good use during its lifetime.
$ egrep 'trans|thp' /proc/vmstat
nr_anon_transparent_hugepages 9170
thp_migration_success 0
thp_migration_fail 0
thp_migration_split 0
thp_fault_alloc 190586370
thp_fault_fallback 12973323
thp_fault_fallback_charge 0
thp_collapse_alloc 8711
thp_collapse_alloc_failed 1
thp_file_alloc 0
thp_file_fallback 0
thp_file_fallback_charge 0
thp_file_mapped 0
thp_split_page 13881
thp_split_page_failed 0
thp_deferred_split_page 12984
thp_split_pmd 27158
thp_scan_exceed_none_pte 18
thp_scan_exceed_swap_pte 23689
thp_scan_exceed_share_pte 0
thp_split_pud 0
thp_zero_page_alloc 2
thp_zero_page_alloc_failed 0
thp_swpout 0
thp_swpout_fallback 13872

$ grep Huge /proc/meminfo 
AnonHugePages:  18757632 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
5) Message boards : Number crunching : Thread affinity and transparent huge pages benefits (Message 71000)
Posted 25 days ago by wujj123456
Finally got enough WU to have some nice plots when messing with optimizations. The benefit of affining thread and enabling transparent huge pages are expected for memory intenstive workloads. So this is just putting some quantitive number over empirical expectations for this specific oifs batch.

Vertical axis is task runtime in seconds. Horizontal is one per sample, ordered by return time from oldest to latest.

One is 7950X and you can easily see when I started doing the optimization around sample 60. It reduced runtime by ~7-8%. (Sample ~44-58 is when I gambled with 13 tasks on the 64G host. While nothing errored out, it was not a bright idea for performance either. These points were excluded from the percentage calculation.)

This one is more complicated. It's 7950X3D, but Linux VM on Windows. I run 6 oifs tasks in the VM, and 16 tasks from other projects on Windows. The dots around sample 30, 42, 58 are peak hours where I paused Windows boinc but didn't pause the VM.

The setup is 6C/12T VM bound to the X3D cluster, core [2,14) on Windows. The first drop around sample 40 are enabling huge pages and affining threads inside the VM. That's about 10% improvement, whether I compare non-peak or peak samples. The second drop around sample 70 is when I started affining Windows boinc tasks away from the VM cores. Now it's getting pretty close to the peak samples when I only run the 6 oifs inside the VM without the 16 windows tasks.

Appendix - Sharing the simple commands and code

Enabing huge page at run time. Only effective for current boot. You can set `transparent_hugepage=always` on kernel cmdline to make it persist across boot, but how you do that is distro dependent, so I'm leaving that out.
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Verify huge page usage. Some huge page could exist even before that due to the default madvised mode, but the number will increase significantly once you set it to `always` that allows kernel to combine pages as it sees fit. AFAIC, most oifs usage is covered by huge pages just by the rough numbers.
grep Huge /proc/meminfo

Affinitize threads on Linux is done through `taskset`.
# Set pid 1234 to core 0-1
sudo taskset -apc 0,1 1234

To find out your CPU topology, as which CPU number belongs to which L3 or SMT sibling, use `lstopo` and check the `P#`. This is important because we don't want to bind two tasks onto SMT siblings. I bind each task to the two sibling threads.

Putting them together, I have a script invoked by in crontab every 10 minutes. Make sure you tune that `i=2`, `$i,$(($i+8))` and `i=$(($i+1))` to match your topology. They control how each task gets assigned to cores.


for pid in $(pgrep oifs_43r3_model | sort); do
        taskset -apc $i,$(($i+8)) $pid

If your host is native linux, you can stop here. For Windows, the "details" tab in task manager will let you select affinity. I use that to set affinity for the vmware process, while also affining thread inside the Linux guest as above. Seems to be 1:1 mapping at least for vmware workstation 17. (Can't figure out how to verify this other than looking at per-core usage on Windows host, which roughly matches Linux guest for affected cores. This is a bit of handwavy.)
Meanwhile, to bind everything else away, I use a powershell script loop. `$names` are other boinc process name I want to bind away from the cores used by VM. `$cpumask` is decimal of cpumask. Make sure you change that for your need.
$names = @('milkyway_nbody_orbit_fitting_1.87_windows_x86_64__mt','wcgrid_mcm1_map_7.61_windows_x86_64','einstein_O3AS_1.07_windows_x86_64__GW-opencl-nvidia-2')
$cpumask = 4294950915  # 0xFFFFC003

While ($true) {
        @(Get-Process $names) | ForEach-Object { $_.ProcessorAffinity = $cpumask }
        Start-Sleep -Seconds 300

PS: I don't really do any programming in Windows. Someone please tell me how to get powershell accept hex? `0x` prefix is supposed to work according to documentation, but I get `SetValueInvocationException` if I use hex.
6) Message boards : Number crunching : One of my oifs_43r3_bl_1018 taskss errored out. (Message 70998)
Posted 25 days ago by wujj123456
I have a couple of interesting ones that I have to abort. Upon reaching 99.98% or something, they just never finished, with time left continue to count hours into the negative territory. For one of them, I checked `ps` and the oifs process has exited already actually. I originally thought it's one of my specific host, until another host got a similar result. However, the resends were successful. It's unclear to me what went wrong for them. Perhaps in the wrapper that's handling the final results?

It's pretty rare though, affecting ~1% of my WUs so far. Just a bit annoying to babysit because I need to abort them manually...
7) Message boards : Number crunching : Batch 1017 Errors (Message 70985)
Posted 28 days ago by wujj123456
I'd run more WUs but I get this mysterious missive and WUs stop coming: "This computer has finished a daily quota of 1 tasks"

AFAIK, this is server side work issuing logic trying to protect against faulty hosts that always error out. If a host returned error results, the quota will be reduced until it becomes 1. Once a task successfully finishes, your quota will be lifted and can get more WUs.

This happened to me when this fixed batch initially started, because a few days ago every result was an error. All my hosts that took part in that round had to finish 1 WU first before getting more tasks as usual. Meanwhile, I happen to have one host not getting any WU last time and it was able to fetch more work off the bat.
8) Message boards : Number crunching : New Work Announcements 2024 (Message 70971)
Posted 8 Jun 2024 by wujj123456
Probably because there are no more linux tasks available, according to the server status. I have stopped resends for batch 1017, otherwise we'll be swamped by always failing tasks.

Thanks. Oops, I read the wrong column and thought tasks are still available. Guess I will wait for the next batch of fun while figuring out how to not be upload bandwidth limited next time... :-)
9) Message boards : Number crunching : New Work Announcements 2024 (Message 70968)
Posted 8 Jun 2024 by wujj123456
A different topic. Is there any criteria gating what client can get new tasks? Most of my Linux machines are happily crunching, except one host where I've migrated from a physical disk to a VM. I've since reset the project, waited for the 1 hour update interval many times, but each time still get reply of no new tasks. I also tried uninstalling boinc, clear the data directory and install again. That didn't help either, though the new client get associated to the same host id, so if it's some server side filtering it won't make a difference anyway.
10) Message boards : Number crunching : New Work Announcements 2024 (Message 70967)
Posted 8 Jun 2024 by wujj123456
Ah sorry I should have explained. It's not a time series but a histogram. It's sampling the RSS usage over 10 minutes with a rate of one sample per second and grouping them into buckets. RSS is from whatever shown by `ps`. The number on the left are recorded RSS bytes, divided into equal buckets. The number on the right of each bar are the number samples that fall into that bucket. Then the percentage is total percentage that falls into this bucket and below. The stars are just visualization. You can think this graph as a CDF rotated by 90 degrees.

Yes, the actual memory allocation pattern is as what you described. My goal with this little script is to figure out the range of RSS this task actually use over time, so that I can set the concurrent correctly.
11) Message boards : Number crunching : New Work Announcements 2024 (Message 70965)
Posted 8 Jun 2024 by wujj123456
Ouch, I just read the "Batch 1017 Errors" post. I didn't know we'd use batch numbers across apps and thought that must be a continuation for WAH batches and skipped the post... Sorry for the duplicates.

On the other hand, same as observed by Jean-David Beyer, the RSS usage is not capped at 3.5GB. This looks like the normal OIFS apps when I collected RSS every second for 10 minutes.
2311604 - 2488436: ************** (82, 13.8%)
2488437 - 2665269: **************** (15, 16.3%)
2665270 - 2842101: ******************** (19, 19.5%)
2842102 - 3018934: *********************** (20, 22.9%)
3018935 - 3195766: ************************** (21, 26.4%)
3195767 - 3372599: ****************************** (20, 29.8%)
3372600 - 3549431: ******************************** (16, 32.5%)
3549432 - 3726264: ********************************** (10, 34.2%)
3726265 - 3903097: ************************************ (11, 36.0%)
3903098 - 4079929: ********************************************************************************** (272, 81.8%)
4079930 - 4256762: *********************************************************************************** (8, 83.2%)
4256763 - 4433594: ************************************************************************************* (9, 84.7%)
4433595 - 4610427: ************************************************************************************** (8, 86.0%)
4610428 - 4787259: **************************************************************************************** (12, 88.0%)
4787260 - 4964092: ****************************************************************************************** (13, 90.2%)
4964093 - 5140925: **************************************************************************************************** (58, 100.0%)
12) Message boards : Number crunching : New Work Announcements 2024 (Message 70957)
Posted 8 Jun 2024 by wujj123456
I got the same failure too:

It seems that the calculation happily finished at but the result is expecting more? This is on a machine with enough memory, runs no other projects and has never paused the WU.
13) Message boards : Number crunching : A performance oddity. (Message 70642)
Posted 12 Mar 2024 by wujj123456
Does CPDN use system libraries extensively for calculation? If so, that could simply be 24.04 is actually faster. IIRC, Ubuntu 24.04 is experimenting with x86-64-v3 target while anything older is using baseline x86-64. AVX2 can make this kind of difference under the right conditions, though 20% does sound a bit too good to be true. However, I haven't tried 24.04 so I'm not sure if they've rolled x86-64-v3 out to test images already.
14) Message boards : Number crunching : WaH v8.29 bug leaves files behind in BOINC/data/projects/climateprediction -- please delete by hand (Message 70553)
Posted 24 Feb 2024 by wujj123456
This seems to be minimal compared to the hundreds of MB that crashed tasks leave behind. It's probably easier to just reset the project once I'm out of work, unless my disk space is running short. Glad more improvements are coming too.
15) Message boards : Number crunching : Trickles stop new work arriving (Message 70308)
Posted 3 Feb 2024 by wujj123456
Boinc client's scheduling left a lot to be desired honestly. The trick I use in this situation is to set low-priority project's share to 0 whenever work shows up for high priority projects. That way, boinc client will only fetch minimal number of tasks to fill all the cores, but not the full buffer. When next time CPDN updates, it will request new work. It's not perfect, but at least I only need to manage the project shares occasionally given how sporadic CPDN work is.
16) Message boards : Number crunching : New Work Announcements 2024 (Message 70293)
Posted 2 Feb 2024 by wujj123456
Setting defaults to 1-2 and resetting all current preference initially is reasonable to me, but I really hope we can honor override afterwards. This solves the problem of people never reading forums, while allowing people paying attention to use more cores on bigger machines once they have app_config updated.
One caveat is that the setting is global, so it would also negatively affect WAH and HadAM4 even though they don't face the same memory problem. Other than WCG, I haven't seen per-app max jobs settings. I suppose it won't be a trivial change on server side to implement that, but if we could that would be the best IMO.
17) Message boards : Number crunching : Batches closed (Message 70292)
Posted 2 Feb 2024 by wujj123456
This is a bad policy, and needs to be changed. If the project is not going to use the results of any tasks out in the wild, they should be aborted from the server side. Otherwise electricity and time are just be wasted. Seems contrary to the goal of CPDN, right?

Agreed. I'd rather server abort these immediately. Given the trickling mechanism, even if it's for credits, it won't affect people's contribution. Can't think of a reason of not aborting them...
18) Message boards : Number crunching : New Work Announcements 2024 (Message 70266)
Posted 2 Feb 2024 by wujj123456
What kind of preference setting were you thinking of?

I read "there will be a limit of either one or two from the server" as even if someone's cpdn project preference set max# of jobs to "no limit", they will still get limited to 1 or 2 per host for OpenIFS tasks. So I wonder how one can get more tasks for the host if it has a lot of memory, without doing multi-client or VMs.

If I remembered wrong and the default max# of jobs in preference is 1 or 2 and you will continue to honor that setting if it's set to "no limit", then whatever setting I was asking for already exists.
19) Message boards : Number crunching : New Work Announcements 2024 (Message 70263)
Posted 1 Feb 2024 by wujj123456
The major problems as always will come not from those who read the noticeboards but from the set and forget brigade.

Will there be a preference setting that one can override for people that actively monitor the output and have bigger machines?
20) Message boards : Number crunching : Multithread - why not? (Message 70243)
Posted 31 Jan 2024 by wujj123456
Shame no one is doing a fork of the code. Beyond me but making the client respect the maximum memory usage rather than some sort of average can't be the most difficult of jobs. Maybe putting it in as a request on git-hub rather than going direct to David? (I am probably showing my ignorance of the politics of this but hey ho?)

I'm also naive in politics, but even technically, once forked, the new code base has to be maintained by someone forever. It's probably not a great investment to fork an entire code base just for a single feature request...

Next 20