|
1)
Message boards :
Number crunching :
How to optimize credit production with OpenIFS tasks
(Message 68109)
Posted 29 Jan 2023 by wujj123456 Post: This article has an interesting discussion about codes, including weather models, running on HPC machines and what kind of hardware is desirable. (Open)IFS was designed from the ground-up for HPC and ECMWF provide it as a benchmark model for HPC vendors. The article is generally reasonable, but seems to have misunderstood the server market line-ups provided by both vendors. 1. Those high core count SKUs were not for HPC at first place. For scale-out workload that Bergamo is targeting, higher core count is the king. There are separate high frequency lower core count parts designed for HPC that still provides full memory size and bandwidth. AMD makes it clear in marketing material, though you likely have to sort through Intel's SKU list to find the right parts and hope they didn't gut the memory capability for lower core count ones. Specifically, for this “It’d be much nicer to have an 8 core or 16-core part at a 4-plus GHz kind of frequency”. On AMD side, parts with F suffix directly tackle this market. This has been the case since Zen 2 and Genoa is no exception. Those 9*74F are meant for this. Intel has similar offerings. 2. On bandwidth side, it's a universal pain shared by not only HPC, but all workloads, including the scale-out ones our company mostly runs. Whatever you see, whether it's HBM or DDR5 on top parts are the max configuration manufacturing can achieve for now. Everyone is asking for more memory bandwidth since we've hit the bandwidth wall, but we are all limited by physics. CXL memory is the answer industry is pursing, but we likely need another generation or two for CXL3 ecosystem to mature. I suspect HPC applications that are no stranger to NUMA could easily reap the benefit. In a few years, we probably would end up having SRAM (L1/L2/L3) -> HBM -> DDR -> DDR on CXL. Each step going down the hierarchy would likely be an order of magnitude of more capacity at the expense of almost doubling latency. |
2)
Message boards :
Number crunching :
The uploads are stuck
(Message 68100)
Posted 28 Jan 2023 by wujj123456 Post: Reminder to reset <ncpus> tag in cc_config.xml if you changed it This reminds me of some other project's trick. They grant additional credits if tasks are returned within X days to disincentive excessive hoarding. No matter how people fake it, whether it's bogus core count or multiple clients per machine, they can't fake the actual compute throughput of the machine. Given CPDN has credit granting script run once a week instead of continuously, it might even be possible to adjust for upload server downtime if necessary. |
3)
Message boards :
Number crunching :
One of my computers is hoarding tasks and I don't know why
(Message 68072)
Posted 27 Jan 2023 by wujj123456 Post: Well, I probably should have debugged this first. <work_fetch_debug> did the trick and this likely has nothing to do with version. For the top priority project that was picked, I saw this repeatedly showing up in every fetch cycle. [work_fetch] deferring work fetch; upload active I was able to observe the same issue happening with LHC when it's slowly uploading after a batch of tasks finish. Turns out if the current project picked for fetching is constantly uploading, the fetch will just be deferred and boinc won't try the next project. It's the right behavior if we assume upload should be quick, but sometimes uploading is going to take forever... Guess I need to build up a job cache big enough for the entire period of uploads just in case the project uploading ends up getting picked for fetching. |
4)
Message boards :
Number crunching :
One of my computers is hoarding tasks and I don't know why
(Message 68070)
Posted 27 Jan 2023 by wujj123456 Post: For people that downloaded the version in PPA (7.20.5), have you run into the issue of boinc not fetching tasks even when the host is idle? My general setup is: CPDN, share 1000, max concurrent < total CPU. Universe, WCG, Asteriods, LHC, share 100 or so, max concurrent for LHC on some hosts. When the issue happens, if I manually trigger a project update for non-CPDN projects, they would all refuse to fetch work due to not being top priority for CPU project, even when there was no WU running. I have to set no new work for CPDN before other projects would fetch during updates. For the two hosts that hit the problem, I set min buffer days to 0.3 and additional buffer days to 0.2 for one but only a min buffer day of 0.01 with no additional buffer for the other. My shares for projects haven't changed for a few weeks and this only happened on 7.20.5 twice so far. However, I am only running 7.20.5 at this point, so I am not sure it's really the newer version to blame. Edit: Changed my mind since I can't stay on old version forever. Will just enable the debug flags instead. Is work_fetch_debug the right flag, or do I also need priority_debug and cpu_sched_debug? |
5)
Message boards :
Number crunching :
The uploads are stuck
(Message 68024)
Posted 24 Jan 2023 by wujj123456 Post: USB3 external drives are a thing. You probably should just give up this idea no matter how frustrated you are. The server is clearly hosted in real data centers or hosting facility and none of them AFAIK will ever take non-rackable hardware. If they allow any random customer to add funny hardware they have to help maintain later, the whole operation will fall apart very fast. It's not that hard to procure proper server gear and deploy quickly. It's likely the process or funding stopping the team from doing that. Trying to prevent people from adding random USB drives might be one of the good reasons why such process exists. :-) |
6)
Message boards :
Number crunching :
The uploads are stuck
(Message 68019)
Posted 24 Jan 2023 by wujj123456 Post: In an ideal world that might be true and when things are working it mostly is. However, the project is I suspect not financed well enough to have the level of redundancy to achieve that. Pretty much summed it up. 24/7 support and multiple levels of redundancy to move failure out of critical path cost a lot of money. At least when I was in grad school, I've never heard of research projects run systems that way. Their funding is better off putting into actual research, while tolerating some downtime or even data loss when the unfortunate happens. On the other hand, for the past month or so, the bottleneck for OpenIFS might actually be the server side infrastructure, not volunteers' compute power. If such failure keeps happening, it could justify changing where money is spent to address the real bottleneck that's slowing down the progress. Or next iteration of the app can consider tuning the compute/network ratio if possible. Guess all we can do is to wait and it would probably take a while for these improvements to materialize even if the team is working on them already. |
7)
Message boards :
Number crunching :
The uploads are stuck
(Message 67957)
Posted 21 Jan 2023 by wujj123456 Post: I've got a dual core EPYC VM running with 10GB RAM at $11.63/mo. It's running about 20h per task, with two going at any given time: https://www.cpdn.org/results.php?hostid=1538282 Thanks. That's around similar number I come up with and all three generally fall into similar $0.4-$0.5/WU range with their cheapest instance types. They are all pretty competitive against each other, but far more expensive than my own setup. Guess this should be totally expected given their machines are loaded with all other cool stuff I don't need plus better network, uptime, etc and they still need to make money. My upload link isn't great either, but enough for now. Hopefully next versions of OpenIFS would have higher compute to bandwidth ratio to make it easier. |
8)
Message boards :
Number crunching :
The uploads are stuck
(Message 67947)
Posted 21 Jan 2023 by wujj123456 Post: I've been playing with some GCP "Spot" instances (like preemptible, but won't power off automatically at 24h, especially if they're small) to add some cycles, and the AMD boxes are churning along hard. I suppose I'll shut those off, they're not exactly long on disk space. :/ Curious what's your $ per WU. I've also recently checked EC2, GCP or Azure and they all have that nice catch of bandwidth cost. Their bandwidth costs around $0.08-0.1 per GB and that would mean around $0.15 - $0.2 per WU. That alone already exceeds cost per WU for whatever I can get with my own equipment, electricity and home network. Azure covers first 100GB and others' free usage is negligible. I honestly wonder if I missed some great deals hidden in their pages of pages of pricing list. Would be nice to cross check. Otherwise, until the bandwidth to compute ratio significantly drops, OpenIFS probably makes no sense in major cloud vendors. |
9)
Message boards :
Number crunching :
The uploads are stuck
(Message 67946)
Posted 21 Jan 2023 by wujj123456 Post: Perhaps it's casual weekend crunchers turning on their computers and finally started to offload their backlog after three weeks. Hopefully the transfer process can eventually win out... |
10)
Message boards :
Number crunching :
The uploads are stuck
(Message 67917)
Posted 20 Jan 2023 by wujj123456 Post: Finally cleared all of my backlog. Got decent speed for the past 24 hours, especially during the last 12 hours that maxed out my upload link. Yay! |
11)
Message boards :
Number crunching :
The uploads are stuck
(Message 67911)
Posted 19 Jan 2023 by wujj123456 Post: @wujj123456 - problem solved, no need to follow up. But for the record - you also have to add the new disk to fstab, and if using UUIDs, use the UUID of the formatted partition, not the UUID of the underlying hardware Congrats! Yeah, mounting on boot and sometimes permission could be a problem when migrating to a new disk and glad you sorted it out. |
12)
Message boards :
Number crunching :
OpenIFS Discussion
(Message 67879)
Posted 19 Jan 2023 by wujj123456 Post: Turns out if all we care about is just getting RSS of the OpenIFS app for some period, it's much faster to just write a script instead of trying to observe it live or from history. I meant to do this for a while just to understand how much the memory usage swings, and guess the discussion finally pushed me to do that. Shitty script here: https://pastebin.com/GtAiv5XB. One RSS sample per second and total count is in the parentheses. --help has some flags you can tune. Probably lots of rough edges for corner cases and it's Linux only. That's what I got for the current public app after running it for 5 minutes. $ ./boinc_task_memory.py --slot 15 2023-01-18 16:17:51,760 [INFO] pid of slot 15: 495869 2568212 - 2714144: ***************** (51) 2714145 - 2860076: (2) 2860077 - 3006008: (0) 3006009 - 3151940: (0) 3151941 - 3297872: (2) 3297873 - 3443804: ****** (19) 3443805 - 3589736: * (3) 3589737 - 3735668: * (4) 3735669 - 3881600: ** (8) 3881601 - 4027532: ************************** (78) 4027533 - 4173464: *********************************** (107) 4173465 - 4319396: * (5) 4319397 - 4465328: (2) 4465329 - 4611260: ** (6) 4611261 - 4757192: * (5) 4757193 - 4903125: ** (8) |
13)
Message boards :
Number crunching :
The uploads are stuck
(Message 67873)
Posted 18 Jan 2023 by wujj123456 Post: For example: Your disc-swap under windows would have been very ease: Stop BOINC, copy whole BOINC-DIR to new location, (re-) install BOINC if needed and tell the "new" location being the Data-Section and that's it. You could continue at your last point and nothing will change. Under Linux this won't work It's actually even simpler in Linux. Stop the client, mv the whole directory to the new location, create a symlink pointing to the new location with the name of previous directory and then start the client. The client will continue to operate on the old directory name except that's now just a link to the new directory. (Of course you can go the other route of changing boinc client config to use new directory name, similar to the Windows setup you described, but involves config editing. ) The story is still the same though. One has to be aware of how boinc data directory and client identifier assignment works. The data has to be migrated properly before next time boinc instance contacts the server. That's not intuitive, which is what caused the loss of work here. |
14)
Message boards :
Number crunching :
OpenIFS Discussion
(Message 67865)
Posted 18 Jan 2023 by wujj123456 Post: below requires cgroupv2 though it should be default now in most distros. Both are useful to look at per process stats too, like finding out peak RSS for each OpenIFS task. For that we probably need short intervals to be recorded. To look at history, you want to start with `atop -r <timestamp>` though, otherwise the top rows aren't any more useful than top or other tools if you are monitoring live. atop man page explains the meaning, except it doesn't cover shrss either. Guess it's small enough we can just ignore. |
15)
Message boards :
Number crunching :
OpenIFS Discussion
(Message 67859)
Posted 18 Jan 2023 by wujj123456 Post: Working set size reported by boinc client is smoothed and on my system a measurement is taken every 10 seconds. This would systematically under-estimate the peak memory usage, which is what matters if we want to make sure the hosts don't ever run out of memory. Even worse, boinc client uses that smoothed working set size for scheduling, which is causing all kinds issues for OpenIFS and forcing us to use app_config, instead of relying on client to handle memory properly. For folks interested in debugging memory usage, I would recommend installing atop or below, both will give you historical snapshot whenever you want to check back. atop is widely available, though you might need to tune the default window to be shorter to be useful. Also note that atop captures per-thread information which could be a lot and can wear out SSDs really fast. I personally use below which doesn't have these problems, but not many distros have them so you might have to install rustc, build and install unit files yourself. There is probably an Ubuntu PPA though. |
16)
Message boards :
Number crunching :
The uploads are stuck
(Message 67856)
Posted 18 Jan 2023 by wujj123456 Post: If FAQ thread opens up for the specific disk usage issues, we should definitely move there. Otherwise, at least I don't mind continue discussing here since all these problem are triggered by upload being stuck and this will likely be the thread people check. I totally learnt the 100G limit from discussion here. PS: For context, it is morning here, so I am just catching up and replying to comments I find relevant. Not intentionally breaking your discussions or trying to move that away. |
17)
Message boards :
Number crunching :
The uploads are stuck
(Message 67854)
Posted 18 Jan 2023 by wujj123456 Post: Funny... Going back to the topic. I am seeing changing behaviors too. Before 2023-01-16 11:00 there were minimal upload for me. Then it started ramping up and saturated my upload on 2023-01-17 0:00 until it drops off on 16:00 the same day. I've been getting some uploads for around 1/3 of my link ever since. (Timestamps are accurate to hour in UTC.) From this comment, it's clear the team is still tweaking the connections to improve uploads while trying not to run out of disk space on upload server. Seems like the current bottleneck is moving data off the upload server. Hopefully it would clear soon to enable more upload connections. I remember from other comments that the usual limit is 300 so we still have a long way to go before full recovery. [/img] |
18)
Message boards :
Number crunching :
The uploads are stuck
(Message 67851)
Posted 18 Jan 2023 by wujj123456 Post: 1) 100GB UI Limit I also hit the first two but not the last one, though I know the behavior of swapping out disks since I upgrade pretty often and I'm lazy so I usually attempt to boot the same disk first. None of these are obvious for people not familiar with BOINC client, and none of them come with any meaningful diagnostic information. Logs for 1) is even misleading. I do want to note that one complexity here is that while all these problems are triggered by CPDN upload server being down, none of them are within CPDN team's control or code. They are all BOINC client behavior and it's a bit unfair to expect CPDN team (rather any project team) to educate every subtlety of BOINC client when they themselves may not be the expert. On the other hand, as a volunteer who lost weeks of work trying to contribute, that's totally frustrating. I feel we should take the opportunity to raise the issues (probably again) in BOINC's forum or github. At least 1) and 2) can use much better logging or notice in UI to inform users. In addition, IMO, 1) is really a bug needs to be fixed. I have no idea how to solve or even communicate 3) though. BOINC client is supposed to require minimal user maintenance, so the current behavior is desirable for most people casually wiping to re-install or upgrading their disk while dropping all WUs on the floor. Edit: Looks like fix for 1) is merged. Guess we just need to wait for clients to upgrade. |
19)
Message boards :
Number crunching :
The uploads are stuck
(Message 67813)
Posted 17 Jan 2023 by wujj123456 Post: I'd argue against doing this or that there's even a need. It seems to me that BOINC upload is for the most part a background process that does its job relatively well. Pretty much the only times uploading generates user complaints are when upload servers aren't working right. The length of this upload outage is rather unique but even so the progress has been very good so far. Even the users who've had a hard time getting a connection slot are starting to report completed uploads. Even though CPDN put in a due date grace period I suspect it'll hardly be needed, which I believe was also Glenn's position in an earlier post. Unfortunately not everyone's upload is that fast relative to their compute and I will very likely need the grace period. I got good connection in past one day and half. So far, I've uploaded around 60 with 170 pending. I have a few WUs due in 2-3 days and they seem to be determined to be the very last to go. If boinc client ordered uploads properly, they would have all been reported by now, removing the need for extending the grace period. On the other hand, I don't agree this should require user intervention either. Boinc client should simply order this correctly by itself, just like how it prioritizes compute deadlines. After all, the goal that matters is to get the WUs reported by deadline, and upload is part of the process to get there. Edit: Thinking more, I realized it's possible boinc would order this properly as the deadline approaches, just like how it does with compute. Perhaps I will learn whether that's the case in two days... |
20)
Message boards :
Number crunching :
One of my computers is hoarding tasks and I don't know why
(Message 67780)
Posted 16 Jan 2023 by wujj123456 Post: However, I do control my job cache by setting: Store at least 2 days of work with a 0.01 day's work, to keep a small, oft-reporting cache. Maybe that makes a difference. I think you are spot on with the buffer settings. From the description of the fix, I suspect one has to have a decently large work_buf_additional_days set. Otherwise, boinc will aggressively fetch each time buffer drops below work_buf_min_days, and any newly fetched work will likely bring the buffer above work_buf_min_days + work_buf_additional_days which immediately stops all fetches. If there is a big enough work_buf_additional_days, now it opens up the opportunity for boinc to repeatedly fetch from the top priority but concurrent limited project while never running the simulation to realize the work buffer is full. This could also mean setting a very small work_buf_additional_days can be a workaround for clients using web configs. I have hit this twice with WCG when I set concurrent limit for one of the apps, after running the same app_config for months. They have server side limiting which was how I solved the problem. I have a 0.2 work_buf_min_days and 0.3 work_buf_additional_days. Guess need to upgrade my client too now that a fix is available. |
©2023 climateprediction.net