climateprediction.net (CPDN) home page
Thread 'UK Met Office HadAM4 at N216 resolution'

Thread 'UK Met Office HadAM4 at N216 resolution'

Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 61259 - Posted: 18 Oct 2019, 7:21:59 UTC

My 3.50GHz Haswell looks like taking about 14 days for these, even though BOINC is saying about 3.3 days.
ID: 61259 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 61260 - Posted: 18 Oct 2019, 7:30:47 UTC - in response to Message 61259.  

My 3.50GHz Haswell looks like taking about 14 days for these, even though BOINC is saying about 3.3 days.

Similar percentage difference here. If the figure in the task files that determines the estimate is the same one one that determines credit, may need to mention this to the project?
ID: 61260 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61269 - Posted: 18 Oct 2019, 19:34:36 UTC - in response to Message 61260.  

The estimates for my i7-8700 are a bit strange. If you just add the Elapsed Time and Time Left, you get about 5 days. But if you look at the % completed (only 6.1%), it comes out to about 26 days. Normally that means the "Time Left" is wrong, and will adjust itself in due course by slowing down. But at the moment, it is still decreasing in real time. Eventually, one or the other will change to more consistent values. The "% completed" could be non-linear, and the final result somewhere in between.
ID: 61269 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 61270 - Posted: 18 Oct 2019, 20:03:05 UTC - in response to Message 61269.  

If you're running on the hyper cores, then it may be that.
One of the researchers said some time back that doing that results in a lot of switching in the processor.
I guess the code has something that likes/needs "real" cores.
ID: 61270 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61271 - Posted: 18 Oct 2019, 20:08:26 UTC - in response to Message 61270.  
Last modified: 18 Oct 2019, 20:13:51 UTC

Yes, it is on hyper cores. I can do real cores next, with a bit of memory juggling. I wanted to use my i7-9700 (8 real cores) anyway, but found that it was not stable with 64 GB of memory, at least not at the rated speed. But I now have new memory that might be more compatible. Or at least I can run the i7-8700 on real cores if need be. It should be ready by Christmas.

EDIT: That much memory is not needed now, but I am planning for the OpenIFS.
ID: 61271 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 61287 - Posted: 20 Oct 2019, 7:39:15 UTC

I've just noticed something interesting.
One of the 4 models running, which are batch 842, is now 35 minutes behind the others. Also about 0.15% behind.
It was the last to start, about 1 minute behind the 3rd one to start.

This is my "general use computer", and I've noticed it's slow to react, or even frozen for a few seconds.

11.5 hours until the first lot of zips.
ID: 61287 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 61289 - Posted: 20 Oct 2019, 9:33:33 UTC - in response to Message 61287.  

This is my "general use computer", and I've noticed it's slow to react, or even frozen for a few seconds.


I have noticed this on my slow general use computer. But mine only has 2GB/core which really isn't enough if much else is running at the same time with these tasks. I have restricted it to just one of the two cores which has sorted that out.
ID: 61289 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,522,141
RAC: 1,164
Message 61290 - Posted: 20 Oct 2019, 13:50:15 UTC

Dave and Les -

I stumbled across a Linux package called xosview which shows some cool information about memory usage, paging, cache, and if a cpu is in a wio (waiting for I/O) state.

Maybe you already knew of it.
ID: 61290 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 61291 - Posted: 20 Oct 2019, 16:19:06 UTC - in response to Message 61290.  

I stumbled across a Linux package called xosview which shows some cool information about memory usage, paging, cache, and if a cpu is in a wio (waiting for I/O) state.


https://sourceforge.net/projects/xosview/

I have not run this one.
http://xosview.sourceforge.net/
ID: 61291 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61295 - Posted: 20 Oct 2019, 19:36:00 UTC - in response to Message 61271.  
Last modified: 20 Oct 2019, 19:37:00 UTC

One thing I have learned by monitoring the writes is that enabling or disabling hyper-threading on my i7-8700 has no effect. The write rate stays exactly the same at 33.5 GB/day, on either six full cores or twelve virtual cores. So the total work output would be the same over a period of time in either case. So you might as well save memory and operate on six full cores, or in other words just set BOINC to run on 50% of the available cores.

As for the times, that is still a bit of a mystery and I won't know until I complete some under a given set of circumstances, but probably around 13 days on full cores and twice that on virtual cores.

My i7-9700 should do better, but it is still early.
ID: 61295 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 61296 - Posted: 20 Oct 2019, 20:10:32 UTC

Finally! My first lot of zips have shown up. A bit over 137 Megs.
ID: 61296 · Report as offensive     Reply Quote
22

Send message
Joined: 14 Mar 15
Posts: 1
Credit: 970,308
RAC: 12,438
Message 61297 - Posted: 20 Oct 2019, 20:42:15 UTC

Do you have a figure for time between checkpoints to disk? I guess 2 and a half hours? Preferable to keep those tasks in memory and not shut down too often...
ID: 61297 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 61298 - Posted: 20 Oct 2019, 21:07:54 UTC - in response to Message 61297.  

You can work out the figure for yours from the BOINC Properties list.

Click on a model in the Tasks tab, then click on Properties to the left.
A third of the way down the list is the time of the last checkpoint.
Start writing down/watching, and soon you'll get what you want.

******************

Yes, these models are big, so the longer a computer can be left running the better.
ID: 61298 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 61299 - Posted: 20 Oct 2019, 21:25:44 UTC - in response to Message 61290.  

WB8ILI

No, but then I haven't gone looking for anything.
I just leave them to get on with it. Mostly, anyway.

I'll have a look at that program later.

Jean
Thanks for the link.
ID: 61299 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 61301 - Posted: 21 Oct 2019, 0:24:56 UTC - in response to Message 61297.  

Do you have a figure for time between checkpoints to disk? I guess 2 and a half hours? Preferable to keep those tasks in memory and not shut down too often...

You can enable "Checkpoint debug" under "Event Log Diagnostic Flags" or "Event Log Options" (depending on version of boinc). You can get at that from the "Advanced" or "Options" menu of boinc manager (also depending on the version of boinc). Of course this is probably useless for the hadcm3s models which checkpoint much, much more frequently. Keeping these big models in memory and not interrupting them frequently is definitely the right idea.

My Ryzen 3600X running 4 models checkpoints about every 66 minutes per task. My i7-4970K does so every 106 minutes, also running 4 at a time.
ID: 61301 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 37
Credit: 9,581,380
RAC: 3,853
Message 61303 - Posted: 21 Oct 2019, 1:57:25 UTC

TL;DR - you probably don't want to run more than one of these per 4MB+ of L3 cache...

Jim1328's time estimates for an i7-8700 prompted me to do some tests (see below) as my experience with the Microbiome application (MIP1) at WCG, which is also a memory hog, suggests that one should only run one instance of that per 4MB (or more[1]) of L3 cache; running more results in significant increases in cache misses, with a corresponding drop in overall CPU effectiveness (for any BOINC tasks running, not just the hogs!) -- indeed, running 4 at a time on a machine with 8MB cache resulted in CPU temperatures dropping by 10C or more and run times nearly double that of a single task (which I restricted using the max_concurrent mechanism)

Testing on an i5-7600 (6MB L3 cache, 4 cores, no hyper-threading, 8GB RAM, 3.5GHz clock) has shown HadAM4@N216 to be a cache-wrecker as well (no surprise there). I did tests with 1 HadAM4 task, 2 HadAM4 tasks, 3 HadAM4 tasks, and my normal workload if I have a CPDN task - 1 CPDN, 2 WCG.

Running a single HadAM4 task with no company yields a checkpoint every 81 minutes; running two at once yields checkpoints every 91 minutes; running three, checkpoints are about 110 minutes apart. This is consistent with changes in the number of instructions run in a fixed time interval, which I monitored with the perf stat command. As checkpoints seem to be taken once per model day and there are about 120 days per 4-month model I'd reckon these would complete in about 6.8 days (running 1 at a time), 7.6 days (running 2 at a time) or 9.2 days (3 at a time).

By the way, under my usual workload [avoiding MIP1 tasks as they mess up the cache too!], checkpoints are about 83 minutes apart, so it can be seen that the WCG tasks aren't really getting in the way. (If MIP1 tasks get in there, the checkpoints are about 86 minutes apart.)

There's one thing in favour of running lots of these on a multi-core machine - your power draw will drop (as evidenced by CPU temperatures!) as the cores end up waiting for memory accesses more and more often! But I suspect there comes a point where each task takes so long to run that it's just not worth it - I, for one, will continue to treat CPDN as minority work on my Intel machines in order to maximize throughput.

I'm about to take delivery of a Ryzen 3700X (32MB L3 cache, though I gather access is constrained to 8MB per 2 cores (4 threads)); I'll be interested to see how that behaves as and when it gets some CPDN work to do (and will probably do some bulk tests with WCG MIP1 to get an idea if there's no CPDN work available!)

Cheers - Al.

[1] Someone over at WCG seemed to think 5MB cache was what a MIP1 job would like. The user offered no justification for that number but 4MB probably isn't enough for near-optimum performance.
ID: 61303 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 61304 - Posted: 21 Oct 2019, 4:26:56 UTC - in response to Message 61303.  
Last modified: 21 Oct 2019, 4:28:08 UTC

The cache size definitely makes a difference as to how much the model speed slows down when loading more on.

My 4790K has 8 MB of L3 cache and can run 1 N216 model at 13.9 sec/TS and 4 at 22 sec/TS. (58% slower)

My 3600X has 32 MB of L3 cache and can run 1 N216 model at 11.2 sec/TS and 4 at 13.6 sec/TS. (21% slower)
ID: 61304 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 61305 - Posted: 21 Oct 2019, 5:43:56 UTC

There's also this page: Xosview for downloading it in a terminal window.
ID: 61305 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 61306 - Posted: 21 Oct 2019, 5:49:48 UTC

It looks like the cache is the culprit..

This will slow down those 64 and 128 core machines.
Unless they're just crashing them because of the missing lib.
ID: 61306 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 61308 - Posted: 21 Oct 2019, 7:08:42 UTC - in response to Message 61303.  
Last modified: 21 Oct 2019, 7:28:06 UTC

I'm about to take delivery of a Ryzen 3700X (32MB L3 cache, though I gather access is constrained to 8MB per 2 cores (4 threads)); I'll be interested to see how that behaves as and when it gets some CPDN work to do (and will probably do some bulk tests with WCG MIP1 to get an idea if there's no CPDN work available!)

Cheers - Al.

[1] Someone over at WCG seemed to think 5MB cache was what a MIP1 job would like. The user offered no justification for that number but 4MB probably isn't enough for near-optimum performance.

Thanks a lot for the cache info. I was beginning to think that the issues were deeper than I had found.
I just happen to have a Ryzen 3700x, and was wondering what its large L3 cache would do here. But I would need to add more memory. So let us know, and I could do it.
EDIT: I have found that as I add more N216 to my i7-9700, the run time estimates increase, as manually calculated. The first one was 5.5 days, and the last one is now 15.5 days. So the cache is implicated, since they are all full cores and so hyper-threading is not an issue.

(As for MIP1, I have found that I need to limit it to two running at a time on any of my machines - Intel or AMD. Cache could certainly play a role, or how it is accessed.)
ID: 61308 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution

©2024 cpdn.org