climateprediction.net home page
UK Met Office HadAM4 at N144 resolution

UK Met Office HadAM4 at N144 resolution

Message boards : Number crunching : UK Met Office HadAM4 at N144 resolution
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile DoctorNow
Avatar

Send message
Joined: 27 Aug 04
Posts: 15
Credit: 1,132,481
RAC: 0
Message 61103 - Posted: 30 Sep 2019, 8:17:36 UTC - in response to Message 61100.  

Usually there's some clue, however subtle, in the message log when work should be sent but isn't.

Hn, I doubt that, since the debug-option isn't/wasn't activated by default, and they are as far as I know more hintful than the usual boinc manager messages.
As already said, the only message which appeared was "No work sent". If there would have been anything else, I would have said it.
Besides, I can't copy/paste anything here anymore - I turned the whole thing off and am posting from work now.
And since there is no work anymore it's useless anyway later...
Dammit - I would've liked to crunch again, it's so long ago I got cpdn work...
It was far easier as they all did run on Windows... :-(
Life is Science, and Science rules. To the universe and beyond
Proud member of BOINC@Heidelberg
My BOINCstats-Sig
ID: 61103 · Report as offensive     Reply Quote
Profile DoctorNow
Avatar

Send message
Joined: 27 Aug 04
Posts: 15
Credit: 1,132,481
RAC: 0
Message 61126 - Posted: 1 Oct 2019, 6:48:33 UTC

Well, I was quite surprised as I saw the new work this morning - and I finally got some WUs fom the N144 ones.
Thanks a lot! :-)
Life is Science, and Science rules. To the universe and beyond
Proud member of BOINC@Heidelberg
My BOINCstats-Sig
ID: 61126 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7233
Credit: 23,154,247
RAC: 341
Message 61544 - Posted: 16 Nov 2019, 6:52:47 UTC

Finally into the batch 848, which the researcher is keen to get run and returned.
They're going to be about a fifth the run time of the N216, but the zips are about the same size as batch 842.
ID: 61544 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 503
Credit: 22,294,062
RAC: 866
Message 61545 - Posted: 16 Nov 2019, 8:20:30 UTC - in response to Message 61544.  

Finally into the batch 848, which the researcher is keen to get run and returned.

I will do what I can to give them priority. But if we could select the projects, I could do a better job of it.
ID: 61545 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 380
Credit: 14,057,320
RAC: 2,676
Message 61546 - Posted: 16 Nov 2019, 13:57:01 UTC - in response to Message 61545.  

Would using HT with 848 benefit the output or I should keep on using real cores only?
ID: 61546 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2738
Credit: 3,388,112
RAC: 2,679
Message 61547 - Posted: 16 Nov 2019, 14:46:12 UTC - in response to Message 61546.  

Others who have actual experience will correct me if I am wrong but I would guess that hyperthreading will increase total throughput on these.
ID: 61547 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 503
Credit: 22,294,062
RAC: 866
Message 61548 - Posted: 16 Nov 2019, 15:49:20 UTC - in response to Message 61546.  

Would using HT with 848 benefit the output or I should keep on using real cores only?

Hummm. Good question. In general, hyper-threading helps as Dave said. So I would try it.

But on the larger ones (N216), it hurt. That was not because there was anything wrong with HT itself, but you were running out of cache memory (on the CPU) with so many work units running at once.
So in that case, it helped to limit the number running. I will try it both ways shortly.
ID: 61548 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 381
Credit: 3,690,501
RAC: 20
Message 61551 - Posted: 16 Nov 2019, 19:15:37 UTC - in response to Message 61548.  

In general, hyper-threading helps as Dave said. So I would try it.

But on the larger ones (N216), it hurt. That was not because there was anything wrong with HT itself, but you were running out of cache memory (on the CPU) with so many work units running at once.


I used to have a machine with two 32-bit 3.06 GHz Xeon processors that could be hyperthreaded, so it appeared as having 4 processors. I do not recall how much cache those two chips had. I used to run Seti@home, climateprediction, rosetta, and WCG. I tended to run three climatepredictions and one other. Now hyperthreading four processors (i.e., with hyperthreading turned on) would turn out more work than two, but not twice as much. So each task proceeded more slowly that way, but the total tasks per a day was more.

My current (slow 1.800 GHz) processor has four 64-bit cores, but 10240K of cache. I cannot hyperthread them. I run Linux. I am currently running four N216 processes and they are getting 92%, 92%, 96%, and 97% of a processor. It is taking Average (sec/TS) 53.6570, but it runs so slowly that I do not wish to stop two of the processes to see if this would improve the cache hit ratio. It seems to take almost three weeks for me to do an N216 task, and that N144s ran faster. Average (sec/TS) 25.8696, taking me about two weeks.

I suspect that since these processes are in a big loop, that they are probably running the same code, so the instructions in the cache may only be in there once (once the program gets started, say after a few hours). So cache misses may be less of a problem that at first appears. This would not apply if one were running different applications (such as WCG, or even hadcm3s). Of course the data will be different, and that will increase the probability of a cache miss.
ID: 61551 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7233
Credit: 23,154,247
RAC: 341
Message 61552 - Posted: 16 Nov 2019, 19:48:45 UTC

I suspect that we have to start the research again, but I'll see if I can find out anything from The Man himself. AFTER the weekend.

In the meantime, I'm running 3 on each computer.
The 4th on each machine is a N216, which I Suspended while downloading.
I have a horrible feeling that I'm going to have to do a lot fiddling with the pref settings to fend off N216 downloads while I try and get some N244.
ID: 61552 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 503
Credit: 22,294,062
RAC: 866
Message 61553 - Posted: 16 Nov 2019, 21:02:03 UTC - in response to Message 61552.  

I have a horrible feeling that I'm going to have to do a lot fiddling with the pref settings to fend off N216 downloads while I try and get some N244.

I think we are out of N216 at the moment, but that is not a long-term solution. If we can't choose work units, then we will have to find some sort of compromise setting that works most of the time, whatever that is.
I will be curious to see what happens when OpenIFS then comes along.
ID: 61553 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 381
Credit: 3,690,501
RAC: 20
Message 61554 - Posted: 16 Nov 2019, 21:03:29 UTC - in response to Message 61552.  

I have a horrible feeling that I'm going to have to do a lot fiddling with the pref settings to fend off N216 downloads while I try and get some N244.


Not right now; I assume you mean N144. There seem to be no N216 tasks at the moment.

UK Met Office HadAM4 at N144 resolution 1808 1640 126.25 (64.58 - 329.88) 12
UK Met Office HadAM4 at N216 resolution 0 3862 392.06 (128.25 - 655.58) 29
ID: 61554 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2738
Credit: 3,388,112
RAC: 2,679
Message 61558 - Posted: 17 Nov 2019, 7:23:27 UTC

I have a horrible feeling that I'm going to have to do a lot fiddling with the pref settings to fend off N216 downloads while I try and get some N244.


As I type server status page showing no N216 tasks but 3842 of them on various computers and doubtless some of these will fail for the reasons we all know about and as the maximum number of attempts has gone up to 5, some will reappear.
ID: 61558 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7233
Credit: 23,154,247
RAC: 341
Message 61566 - Posted: 18 Nov 2019, 9:39:07 UTC - in response to Message 61548.  

Jim1348

Apparently the N144 should have lower memory requirements than the N216, because of the lower resolution.
ID: 61566 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 503
Credit: 22,294,062
RAC: 866
Message 61567 - Posted: 18 Nov 2019, 13:36:10 UTC - in response to Message 61566.  

Since you asked, I can give you my results thus far.
I am running the N144 on two identical Ryzen 2600 machines (Ubuntu 18.04.3).

Machine 1
is running six cores (50% of the total): at 48% complete, it is estimating 3.15 days total.

Machine 2
is running all 12 cores (100% of the total): at 19% complete, it is estimating 7.56 days total.

So you are better off running with "full" cores (half the total), especially considering the reduced memory requirements will allow for more N216 and also (gasp!) OpenIFS.

That will vary somewhat on different machines, but 50% of the total number of cores is a good guess for machines with hyper-threading.
For machines with full cores, you could probably use more, but I still need to cut it down on my i7-9700, which has 8 full cores. YMMV.
ID: 61567 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 381
Credit: 3,690,501
RAC: 20
Message 61568 - Posted: 18 Nov 2019, 13:39:36 UTC - in response to Message 61566.  
Last modified: 18 Nov 2019, 13:44:21 UTC

Apparently the N144 should have lower memory requirements than the N216, because of the lower resolution.


They do. On my Linux 64-bit machine with 16 GBytes RAM, they each required about 4% of my RAM*; the N216 ones each require between 8.5% and 8.6% (1.3 GBytes).
_____
* I just realized, I do not remember if I was running the N144 tasks with 8 GBytes RAM or 16, but if it was 8, the difference in size would be even greater.
ID: 61568 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 380
Credit: 14,057,320
RAC: 2,676
Message 61572 - Posted: 18 Nov 2019, 19:10:04 UTC - in response to Message 61567.  

Ok, I will experiment on my i7-4790 at 75% or 6 cores.
Currently I have two N216 and four N144 so I would not push it to 100% per cent. I will monitor how sec/TS changes.
On 4 cores only, N144 runs for 3d22h at 13 sec/TS, while N216 is ready in 12 days at 30-31 sec/TS.
ID: 61572 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7233
Credit: 23,154,247
RAC: 341
Message 61581 - Posted: 19 Nov 2019, 19:23:42 UTC

Had my first failure of one of these big ones.

hadam4_a1yz_201410_6_848_011925730_1

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.

So those starting parameters aren't viable. :(
ID: 61581 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 1931
Credit: 41,487,636
RAC: 4,640
Message 61582 - Posted: 20 Nov 2019, 0:53:11 UTC - in response to Message 61581.  

Had my first failure of one of these big ones.

hadam4_a1yz_201410_6_848_011925730_1

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.

So those starting parameters aren't viable. :(

I've had a couple of those from this batch. One right near the beginning\, which also happened to another task in this work unit. The other was after two trickles. Still waiting to see if the wingman will crash at the same progress in that one.
ID: 61582 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 380
Credit: 14,057,320
RAC: 2,676
Message 61595 - Posted: 22 Nov 2019, 18:59:47 UTC - in response to Message 61572.  
Last modified: 22 Nov 2019, 19:00:53 UTC

Ok, I will experiment on my i7-4790 at 75% or 6 cores.
Currently I have two N216 and four N144 so I would not push it to 100% per cent. I will monitor how sec/TS changes.
On 4 cores only, N144 runs for 3d22h at 13 sec/TS, while N216 is ready in 12 days at 30-31 sec/TS.


So the two N216 run differently as expected
the old one (4real core) still runs at around 30 sec/TS after 3 trickles, might drop for the 3rd
the new one (6HT) runs at 39 sec/TS and will end for 16.4 days (12 on 4 cores)

The four N144 also run differently as expected
the two old ones started at 13 sec/TS now are at 18 so 28% slower
the two new ones are at 20 sec/TS from the start so >35% slower

Not sure whether it is worth running HT
ID: 61595 · Report as offensive     Reply Quote
wolfman1360

Send message
Joined: 18 Feb 17
Posts: 71
Credit: 4,946,480
RAC: 451
Message 61742 - Posted: 19 Dec 2019, 23:56:42 UTC

Hello.
A few questions regarding this very interesting topic.
So I've got a Ryzen 1700x here that has just attached to the project and is currently only crunching one N144.
From all I've gathered, I need 4 mb of l3 cache per WU whether it is the n144 or n216 for more efficient runtimes? So I should create my app config to only allow for 4 at once on this machine, since it has 16 total. Or should I try for 8 of the real cores, since I'm also reading that simply using all real cores instead of hyperthreading helps.
With that being said, am I able to let Rosetta and / or WCG use the remaining 8-12 threads without penalizing these workunits? The machine has lots of ram available, 64 gb. I'm assuming n144 have shorter runtimes than the 216? This is my first time crunching these.

How about an i7-4770 - with only 8 MB total. Can I go slightly over the 4 mb each without huge repercussions e.g. running 3 or 4 at once, letting Rosetta take over the other threads?
Oddly enough my 2600 sandy bridge seems to be handling itself well enough with the CPU seeming to peg itself at 99+^ per task, according to boinc tasks, running 8 at once. Unless this will end up dropping off steeply or I'm not reading correctly.
All of these just got tasks today, so of course boinc and its wildly fluctuating estimations won't settle for a while. Still, interesting reading. How does the new WCG rainfall tasks use a CPU and should I be limiting them, too? Not that there are many to grab.
ID: 61742 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : UK Met Office HadAM4 at N144 resolution

©2020 climateprediction.net