climateprediction.net home page
Posts by alanb1951

Posts by alanb1951

21) Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution (Message 61444)
Posted 2 Nov 2019 by alanb1951
Post:
Jim1348 - re your Ryzen tasks...

I followed the link in your post above to see how you might be getting on with tasks that didn't abort(!) and was intrigued to see tasks taking well over 20 seconds per time step. So far I've finished two (each took about 7 days, 7hours) and in both cases the average per time step is under 18 seconds...

So I wondered how many you are running at a time and, perhaps, what your overall workload is on that system. On mine I only allow 2 CPDN at a time (and I also only allow 2 WCG MIP1 (cache-killers!)) - I also only let BOINC have 14 out of 16 "CPUs"

Fun machines, aren't they! I could write an essay about the machine turning the CPU fan off with 14 tasks running, but I won't...

Cheers - Al.
22) Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution (Message 61392)
Posted 26 Oct 2019 by alanb1951
Post:
In reply to Jim1348 in re Ryzen 3700X

I'm about to take delivery of a Ryzen 3700X (32MB L3 cache, though I gather access is constrained to 8MB per 2 cores (4 threads)); I'll be interested to see how that behaves as and when it gets some CPDN work to do (and will probably do some bulk tests with WCG MIP1 to get an idea if there's no CPDN work available!)

Cheers - Al.

[1] Someone over at WCG seemed to think 5MB cache was what a MIP1 job would like. The user offered no justification for that number but 4MB probably isn't enough for near-optimum performance.

Thanks a lot for the cache info. I was beginning to think that the issues were deeper than I had found.
I just happen to have a Ryzen 3700x, and was wondering what its large L3 cache would do here. But I would need to add more memory. So let us know, and I could do it.


I have finally got the beast up and running on Ubuntu 18.04-3 (kernel 5.0.0-32-generic). It has 32GB of 3200MHz RAM, boots from an NVMe SSD, and I've put /var on HDD RAID 1 so that logs and checkpoint files aren't hammering the SSD. (/home is on RAID as well - all my non-laptop builds are done like that...) I haven't done any tuning apart from making sure that the memory clock and fabric clock are fixed at 3200 and 1600 respectively.

It has taken until now to get a decent work-load built up; I'm currently running 12 WCG tasks (with a check to stop MIP1 from running more than two at a time) and 2 CPDN HadAM4h tasks at a time, along with one GPU task from SETI@Home, Einstein@Home or MilkyWay@Home, so the system is getting a fair work-out. It seems to have all clocks at about 3.95GHz, and the machine is drawing about 140W not counting the GPU.

As regards checkpointing and completion times, after the first checkpoint (which seems to cover a few more time steps) it seems to checkpoint about every 60 minutes. I haven't had one generate a trickle yet but at current rate of progress I expect a trickle at about 33 hours 20 minutes, and the tasks to finish in about 5 days 13 hours.

I'm going to let the machine run with that sort of work load for a while to make sure it's behaving consistently, and I plan on doing some experiments with more HadAM4h tasks running at once when the current two have finished. It might be interesting to find out how many I can run at once without serious degradation it the only other work on the machine is WCG MCM1 (which is very cache-friendly!)

I'll try to do some task-level performance stats at some point, but on AMD CPUs there's no direct way of getting a count of L3 cache misses (I think it counts them at the cache level rather than the CPU level...) so one key stat isn't available. Ah, well...

Hope this was of interest - Al.
23) Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution (Message 61303)
Posted 21 Oct 2019 by alanb1951
Post:
TL;DR - you probably don't want to run more than one of these per 4MB+ of L3 cache...

Jim1328's time estimates for an i7-8700 prompted me to do some tests (see below) as my experience with the Microbiome application (MIP1) at WCG, which is also a memory hog, suggests that one should only run one instance of that per 4MB (or more[1]) of L3 cache; running more results in significant increases in cache misses, with a corresponding drop in overall CPU effectiveness (for any BOINC tasks running, not just the hogs!) -- indeed, running 4 at a time on a machine with 8MB cache resulted in CPU temperatures dropping by 10C or more and run times nearly double that of a single task (which I restricted using the max_concurrent mechanism)

Testing on an i5-7600 (6MB L3 cache, 4 cores, no hyper-threading, 8GB RAM, 3.5GHz clock) has shown HadAM4@N216 to be a cache-wrecker as well (no surprise there). I did tests with 1 HadAM4 task, 2 HadAM4 tasks, 3 HadAM4 tasks, and my normal workload if I have a CPDN task - 1 CPDN, 2 WCG.

Running a single HadAM4 task with no company yields a checkpoint every 81 minutes; running two at once yields checkpoints every 91 minutes; running three, checkpoints are about 110 minutes apart. This is consistent with changes in the number of instructions run in a fixed time interval, which I monitored with the perf stat command. As checkpoints seem to be taken once per model day and there are about 120 days per 4-month model I'd reckon these would complete in about 6.8 days (running 1 at a time), 7.6 days (running 2 at a time) or 9.2 days (3 at a time).

By the way, under my usual workload [avoiding MIP1 tasks as they mess up the cache too!], checkpoints are about 83 minutes apart, so it can be seen that the WCG tasks aren't really getting in the way. (If MIP1 tasks get in there, the checkpoints are about 86 minutes apart.)

There's one thing in favour of running lots of these on a multi-core machine - your power draw will drop (as evidenced by CPU temperatures!) as the cores end up waiting for memory accesses more and more often! But I suspect there comes a point where each task takes so long to run that it's just not worth it - I, for one, will continue to treat CPDN as minority work on my Intel machines in order to maximize throughput.

I'm about to take delivery of a Ryzen 3700X (32MB L3 cache, though I gather access is constrained to 8MB per 2 cores (4 threads)); I'll be interested to see how that behaves as and when it gets some CPDN work to do (and will probably do some bulk tests with WCG MIP1 to get an idea if there's no CPDN work available!)

Cheers - Al.

[1] Someone over at WCG seemed to think 5MB cache was what a MIP1 job would like. The user offered no justification for that number but 4MB probably isn't enough for near-optimum performance.
24) Message boards : Number crunching : Excessive checkpointing on new Linux hadcm3s tasks? (Message 61147)
Posted 2 Oct 2019 by alanb1951
Post:
Some follow-up information...

I looked at changing some of the cache control values as per Jim1348's notes; however, I don't think it makes any difference because I'm using ext4 filesystems and (as I understand it) they effectively force regular synchronization (5 seconds by default!). That would explain why iostat and cat /proc/diskstats didn't report any difference in amounts written when I tried (and might explain why some others aren't seeing changes...)

(Apparently, iostat and friends are supposed to report actual device activity, not user write requests...)

I'm actually measuring output to a spinning disk, not an SSD, so I can't use smartctl to confirm how much data is actually being written. Perhaps someone who is using an SSD and has adjusted those sysctl parameters could have a look at that?

If it turns out not to be possible to alter the checkpoint interval, I certainly won't be letting BOINC use an SSD if I intend to continue doing CPDN work! Each HadCM3s task writes about 383GB of checkpoint data during a 20-year model run, so 3 jobs -> 1 Terabyte!

By the way, the current HadAM4 jobs seem to checkpoint about once every 20 minutes on my machine (as against the once a minute of HadCM3s) but the checkpoint file is nearly 4 times the size of the pair of files written by HadCM3s tasks. It comes out at about 71GB of checkpoint data per 12-month model task.

Cheers - Al.
25) Message boards : Number crunching : Excessive checkpointing on new Linux hadcm3s tasks? (Message 61029)
Posted 27 Sep 2019 by alanb1951
Post:
This has been raised with the project. Check points are each model day. When this model type was introduced, computers were a lot slower and solid state disks didn't even exist or if they did cost an arm and a leg for even a 40GB one. So the problem has sort of crept up on us. I don't know how quick a fix it is to change checkpoints to every ten days or even monthly giving 12 checkpoints per zip file?

Dave,

Thanks for this!

I hope they can (and do) change this because I will not be running CPDN on my next system (Ryzen 3700X, I hope) if it's going to be hammering the discs like that if I get hadcm3s work units. (Thanks for the numbers and cache tuning stuff, Jim1348!)

I presume we aren't ever going to get the facility to deselect certain applications back; if that's the case they ought to try to make details like checkpoint frequency as consistent as they can across all applications available on a given platform (at least as far as the most frequent checkpointing is concerned).

I can understand an application ignoring the user's checkpoint guidelines by not checkpointing as often as the user allows, but checkpointing more often ought to be a no-no, as this situation demonstrates!... (It was theoretically possible to determine that limit and if the limit was reasonable enable some code on the checkpoint logic saying "how long since the last one? If longer than limit, do another..." -- I don't think that has changed.)

By the way, do we know what the checkpoint behaviour of HadAM4 and OpenIFS is/will be???

Cheers - Al.
26) Message boards : Number crunching : Excessive checkpointing on new Linux hadcm3s tasks? (Message 61008)
Posted 26 Sep 2019 by alanb1951
Post:
I recently landed a few of the new Linux tasks (batch 835). I don't usually turn on checkpoint debug in BOINC-Manager, but I had cause to need to do so on one of my machines and was surprised to see that these tasks were checkpointing about once a minute! I turned the logging on on my other machine that had some CPDN work and it was the same! (Turned logging off again!)

Now, on one machine I've got the checkpointing limit set to 600 seconds and on the other 240 seconds; it's obviously not respecting that!

As I said, I don't normally monitor this, so for all I know this could have been standard behaviour for as long as hadcm3s tasks have been available. Alternatively, it might only be doing this on my machines (though I suspect that's unlikely).

This is not exactly disc-I/O friendly if it's deliberate so I wonder has it always been like this, is this a side-effect of them trying to make Linux tasks more crash-proof, or is it a bug.

Any insight appreciated - Al.
27) Questions and Answers : Unix/Linux : *** Running 32bit CPDN from 64bit Linux - Discussion *** (Message 52902)
Posted 17 Nov 2015 by alanb1951
Post:
cn96 wes:wes /scratch/wes/BOINC/projects 65> cd /scratch/wes/BOINC/projects/climateprediction.net/
cn96 wes:wes /scratch/wes/BOINC/projects/climateprediction.net 66> ls -l hadam3* hadrm3*
-rwxr-xr-x 1 wes wes 2355035 Nov 11 01:33 hadam3p_eu_um_7.01_i686-pc-linux-gnu.zip
-rwxr-xr-x 1 wes wes 2664840 Nov 11 01:33 hadam3prm3pm2t_eu_7.01_i686-pc-linux-gnu
-rwxr-xr-x 1 wes wes 75730 Nov 11 01:33 hadam3prm3pm2t_eu_data_7.01_i686-pc-linux-gnu.zip
-rwxr-xr-x 1 wes wes 3771315 Nov 11 01:33 hadam3prm3pm2t_eu_se_7.01_i686-pc-linux-gnu.zip
-rwxr-xr-x 1 wes wes 2359300 Nov 11 01:33 hadrm3p_eu_um_7.01_i686-pc-linux-gnu.zip


The files are there, and the permissions seem to be OK. Any other idea?


Are any other projects currently running successfully on the two problem systems? If so, are there any differences in permissions or ownership of the project directories for those projects??

I notice that these files are owned by user:group wes:wes rather than by boinc:boinc as would be the case on a default install of BOINC from a Ubuntu repository. Given that the default install also goes in /var/lib/boinc-client, a non-respository install may explain these differences.

Is the location and ownership of the various directories exactly the same on the machines you've got that are running CPDN successfully? If it is, I'm at a loss to explain what you're seeing.

Good luck with solving this - Al.
28) Message boards : Number crunching : HadCM3 short errors (Message 51947)
Posted 12 May 2015 by alanb1951
Post:
There seems to be another batch of "No resubmission" jobs (originally from 22nd December 2014) - I'd had several of these fail (memory allocation error) before I realized...

Since the first of these turned up, I've not seen a single hadcm3s job that isn't from that bad batch, though I presume not all 35,000+ jobs available according to the server status page are bad jobs. So I'm left wondering whether to babysit BOINC/CPDN to watch for bad jobs or to [temporarily] stop taking hadcm3s jobs at all...

Ah, well, as a Linux user at least I can do MOSES+Triffid jobs instead...
29) Questions and Answers : Unix/Linux : Multiple CP task management (Message 51416)
Posted 15 Feb 2015 by alanb1951
Post:
Following up from what Les said...

On my main machine, an i7-4770s (3.1GHz clock) running Ubuntu 14.04 and using hyperthreading, the longest jobs I have seen ran for under 300 hours - these were hadcm3n.

hadcm3s jobs typically took just under a day, the longer hadam3p_eu jobs about 67 hours, the short hadam3p_eu jobs about 17 hours. All quite quick...

The ones that really mess with BOINC Manager's Tasks display are the "original" Moses II jobs - hadam3pm2 - there seems to be a problem in the way they communicate status, and when they report 8.3% progress they are, in fact, nearly finished!!! (I think there's a "factor of 12" problem in there somewhere!)

Note, however, that Moses+Triffid tasks (hadam3prm3pm2t_eu) seem to show an accurate progress rate.

In practice, on my main machine a Moses II (hadam3pm2) job will run for about 175 hours, whilst a Moses+Triffid job will run for about 180 hours.

As I don't know what hardware you have, I can't guess how long a job might take on your machine; a slower machine of mine (an i3-2100 at the same clock rate) typically takes about 50% longer to run. I've never run CP jobs on my laptop, so I've no idea how they'd go on a 2GHz clock machine...

Hope the above might reassure you somewhat. As Les says, you might as well just leave it to it!
30) Questions and Answers : Unix/Linux : Second task doesn't update percentage and time elapsed unless I switch windows. (Message 50822)
Posted 14 Nov 2014 by alanb1951
Post:
I can confirm the above observations about both 7.2.x and 7.4.x clients, but on Ubuntu 14.04 in my case.

I have 7.2.42 from the official Ubuntu repository on two of my machines, and the displays work as one might expect.

I have 7.4.8 from costamagnagianfranco's PPA on my other machine (I have not yet updated to any later version offered there...), and that shows the same strange display update behaviour.

(I'm sticking with it despite the odd display behaviour - it's an inconvenience rather than a blocker.)

On the system with the issue there are 8 processes running at once and if I am only displaying active tasks it is only the last one that doesn't update unless given a helping hand (or mouse click) - 7 lines update o.k.

However, if I show all tasks the number of active task lines that update seems to depend on what I've sorted on! (But it's never more than 7 out of 8...)

By the way, I can get it to update by selecting any task in the list (or by doing anything else that seems to cause a major refresh. Slightly quicker than switching away!
31) Message boards : Number crunching : HadCM3 short - errors galore (Message 50328)
Posted 27 Sep 2014 by alanb1951
Post:
Interested to know if anyone else is getting this with the short models. I noticed disk usage seemed to be getting a bit high for BOINC and on checking, the last 4 short models hadn't cleaned up after themselves though they had sent all zips and cleared from Tasks In Progress view. If others have had this is it only on nix boxen or a global issue?

Edit: Just to be completely clear, this is not crashed tasks leaving their detritus on my disk which I know is a problem but models that have finished without a hitch other than having to wait to report/upload zips on some occasions.


Just to confirm Eirik Redd's earlier reply to your post...

I run CPDN on two Ubuntu machines, and I don't think any of the HADCM3S tasks that downloaded successfully have ever cleared up after themselves - it's been a regular task to recover the disc space!

I'm seeing this with both BOINC 7.2.33 (on 12.04) and BOINC 7.4.8 (on 14.04) which I got from costamagnagianfranco's PPA. So if it is the client rather than the application causing the clean-up problem, it's in the latest release candidate as well as the current production version.

Hopefully someone will chip in with something definitive about why it might be happening (or perhaps we need a separate thread to motivate that?) The current behaviour is certainly a nuisance!

Al.



Previous 20

©2024 climateprediction.net