climateprediction.net home page
New work discussion - 2

New work discussion - 2

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 14 · Next

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7628
Credit: 24,240,330
RAC: 0
Message 66038 - Posted: 2 Sep 2022, 22:54:19 UTC
Last modified: 6 Sep 2022, 5:27:49 UTC

Please be patient.
The new models will get get here when they're ready.
ID: 66038 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 531
Credit: 15,649,050
RAC: 1,962
Message 66068 - Posted: 7 Sep 2022, 17:30:52 UTC

A couple of unfinished points, continuing from the previous thread - specifically in reply to Glenn Carver's message 66054.

1) I don't think MilkyWay has a separate app version for each core count. Something like that would normally be handled by the plan_class mechanism, and the MilkyWay applications page only shows two application versions for the N-Body simulation - one for Windows, and the other for Linux. Both have the same simple [mt] plan_class.

2) I've set up a basic machine to run MilkyWay nbody tasks, and tracked the messages passing between the machine, the server, and the running science app. I think I've got a possible explanation of how they've done it.

a) the machine is a small 4-core Intel, no hyperthreading, running Windows 10. I've set it, via local preferences, to use 80% of the available CPUs. That calculation is done in integer maths, so the machine has three cores available.
b) the request file from the machine to the server contains these lines:
<working_global_preferences>
<global_preferences>
   <max_ncpus_pct>80.000000</max_ncpus_pct>
</global_preferences>
</working_global_preferences>
<host_info>
    <p_ncpus>4</p_ncpus>
</host_info>
- so the local settings are reported to the server: "use 80% of 4 CPUs".
c) The reply from the server, when new work is allocated, contains these lines:
<app_version>
    <app_name>milkyway_nbody</app_name>
    <avg_ncpus>3.000000</avg_ncpus>
</app_version>
d) The allocated tasks are shown in BOINC Manager, and marked (3 CPUs)
3) When the BOINC client starts a new task, it populates an empty slot directory with the required files, and also creates its own file called "init_data.xml". That contains the lines:
<ncpus>3.000000</ncpus>
<host_info>
    <p_ncpus>4</p_ncpus>
</host_info>

"init_data.xml" can be read by the BOINC API library linked into a project app at compile time. I think that's how the app must be getting its threading instructions: "Although the processor has 4 cores (p_ncpus), only use 3 of them (ncpus). I can't find any other way it can be passed, and I've eyeballed every single occurrence of the digit '3' in these files. And still it reports "Using OpenMP 3 max threads on a system with 4 processors".
ID: 66068 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 15 Jul 17
Posts: 75
Credit: 17,409,387
RAC: 0
Message 66069 - Posted: 7 Sep 2022, 17:41:38 UTC
Last modified: 7 Sep 2022, 18:00:59 UTC

Will the new work have user-friendly checkpointing?
I sure would love to run climate & weather models. I searched for "checkpoint" and found nothing about it.
As I recall checkpoints were 4 hours or so apart. That makes it very difficult to deal with heatwaves and TOU metering.
Also looks like CPDN wants to use every CPU thread on your computer. Hopefully they'll fix that bug too.

Edit: Found some info but not sure how old it is: https://www.climateprediction.net/getting-started/support/technical-faq/#no_tasks_available
How long does a Timestep take in real time?
"A Timestep represents a 1/2 hour of model time (not realtime)."
"Climateprediction.net checkpoints every 144 Timesteps..."

How do we make backups of a WU in-progress?
"More worrying is that a computation error loses more work. What is the appropriate reaction to this? Complaining is unlikely to be useful as trying to make the Work Unit smaller has been considered and rejected as not practical. A better reaction would be to decide to make a backup from time to time so if you do suffer an error, you can recover without losing too much work."
ID: 66069 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7628
Credit: 24,240,330
RAC: 0
Message 66070 - Posted: 7 Sep 2022, 20:48:43 UTC - in response to Message 66069.  

Aurum

These are things that we'll learn about when they arrive.
ID: 66070 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 3560
Credit: 10,761,686
RAC: 5,528
Message 66072 - Posted: 7 Sep 2022, 20:58:39 UTC

How do we make backups of a WU in-progress?
Making backups is a hangover from when tasks often took 9 months to complete. The time taken to back up individual tasks really isn't worth it these days.
ID: 66072 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 694
Credit: 10,133,158
RAC: 6,135
Message 66073 - Posted: 8 Sep 2022, 1:33:32 UTC - in response to Message 66069.  

Also looks like CPDN wants to use every CPU thread on your computer. Hopefully they'll fix that bug too.


I am runnng Red Hat Enterprise Linux release 8.6 (Ootpa) on mu Linux box:

Computer 1511241

CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16

Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.6 (Ootpa) [4.18.0-372.19.1.el8_6.x86_64|libc 2.28 (GNU libc)]
BOINC version 	7.16.11
Memory 	62.28 GB
Cache 	16896 KB
Swap space 	15.62 GB
Total disk space 	488.04 GB
Free Disk Space 	482.7 GB
Measured floating point speed 	6.58 billion ops/sec
Measured integer speed 	30.58 billion ops/sec
Average upload rate 	738.83 KB/sec
Average download rate 	25591.7 KB/sec
Average turnaround time 	2.47 days

With this in there, it does not use all the processors for CPDN, but only the 4 specified.
[/var/lib/boinc/projects/climateprediction.net]$ cat app_config.xml 
<app_config>
    <project_max_concurrent>4</project_max_concurrent>
</app_config>

ID: 66073 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 148
Credit: 1,883,629
RAC: 252
Message 66075 - Posted: 8 Sep 2022, 9:56:37 UTC - in response to Message 66068.  
Last modified: 8 Sep 2022, 9:56:48 UTC

"init_data.xml" can be read by the BOINC API library linked into a project app at compile time. I think that's how the app must be getting its threading instructions: "Although the processor has 4 cores (p_ncpus), only use 3 of them (ncpus). I can't find any other way it can be passed, and I've eyeballed every single occurrence of the digit '3' in these files. And still it reports "Using OpenMP 3 max threads on a system with 4 processors".
Richard, thanks. That suggests MilkyWay will always run an app that fits into the available CPUs (which I think I've seen it do on my machine).

For OpenIFS, that approach may not work. We will have 1-4 core versions available. If the init_data.xml tells the client I'm making 8 cores out of 16 total on my machine available, then the client will give OpenIFS wrapper code the wrong number. We'll probably have to use a different approach then to encode the correct number of threads to use. There's also the project preferences to consider when CPDN add in the ability for the user to restrict apps to below a certain core count. I am not sure how that mechanism works. Quite a few boinc issues to deal with before we can get the multicore work out to everyone.

I don't want to populate this thread with a technical discussion, perhaps we can take this offline if need be.

Many thanks for digging into MilkyWay's setup - that's useful.

Cheers, Glenn
ID: 66075 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 148
Credit: 1,883,629
RAC: 252
Message 66076 - Posted: 8 Sep 2022, 9:59:55 UTC - in response to Message 66072.  
Last modified: 8 Sep 2022, 10:00:02 UTC

How do we make backups of a WU in-progress?
Making backups is a hangover from when tasks often took 9 months to complete. The time taken to back up individual tasks really isn't worth it these days.
The CPDN models create restart dumps at frequent intervals (which we configure on the server side). If the machine is powered down or boinc shutdown, the model restarts from these dumps when the client is restarted. There's absolutely no need to create your own backups of the work units.
ID: 66076 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 148
Credit: 1,883,629
RAC: 252
Message 66077 - Posted: 8 Sep 2022, 15:50:42 UTC

Planned OpenIFS configurations and memory

Some info on memory requirements on upcoming OpenIFS forecasts. As mentioned previously, we're aiming to increase the model resolution to be more scientifically valuable. These resolutions come with higher memory requirements:

    N80 grid, 125km spacing. Peak RAM = 8Gb
    O96 grid, 100km " . Peak RAM = 10Gb

    N128 grid, 78km " . Peak RAM = 19Gb
    O160 grid, 61km " . Peak RAM = 24Gb

All the above use 91 model levels. Previously CPDN has only used the 125km version with 60 model levels. Obviously these will be significantly more demanding than seen previously (I mentioned there will be additional credit for these and we'll use multicore for the higher resolutions). I would hope the first two to fit in 16Gb machines, the others will need 32Gb minimum (assuming of course people want to run these). Only machines which specify enough resource will get workunits. Timescale for testing is the next couple of months.

This only refers to OpenIFS which runs globally at these resolutions and not the Hadley Centre models.

Explanation of resolutions
OpenIFS has 3 resolutions at play: the vertical resolution - number of discrete model levels; the grid spacing between discrete points on the globe; and the number of retained waves in 'spectral space'. We can alter these numbers (within some limits) to achieve a balance between model efficiency and scientific performance.

The N number. The globe is cut into squares, with a rectangular grid. The number of points around a latitude is always double the number of points from pole to pole. The N number refers to the number of points between a pole and the equator. So a N80 grid will have 160 latitude points and 320 longitude points on the grid. The grid spacing is then just circumference of the Earth divided by 4*N.

The O number. Also specifies the number of points between pole & equator but in this case it implies what's called a 'cubic octahedral' grid. This is a different arrangement of the grid cells on the globe and it also implies fewer retained spectral waves in the model.

Spectral resolution. This refers to the number of retained waves. It's analogous to a Fourier wave transform for, say, a sound spectrum, except on a globe. By modelling the wind & temperature as waves on the globe rather than at discrete gridpts we get a more accurate solution. You might also see the resolution expressed as T159L91. The T number '159' refers to the number of waves solved by the model, the 91 is the levels. 'O' grids are written as Tco159.

Hope that's a useful reference.


---
CPDN Visiting Scientist

ID: 66077 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 128
Credit: 20,896,832
RAC: 24,045
Message 66079 - Posted: 8 Sep 2022, 18:50:33 UTC - in response to Message 66069.  

As I recall checkpoints were 4 hours or so apart. That makes it very difficult to deal with heatwaves and TOU metering.


Task suspend/resume, with the task remaining in memory, seems to work just fine. I've not had issues with this.

Suspending the entire machine also works fine. My compute nodes are solar powered in my office, so they all sleep, every night, and I power them back on every morning. This doesn't cause any problems either - machine suspend/resume is invisible to tasks. The downside is that my stuff takes longer to complete than if it were running 24/7, but it's run entirely on surplus generation from an off grid system.

I just try very hard not to crash the machines or tasks... "Suspend from the last checkpoint" has about a 50-75% success rate in my experience.
ID: 66079 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 694
Credit: 10,133,158
RAC: 6,135
Message 66080 - Posted: 9 Sep 2022, 6:04:25 UTC - in response to Message 66075.  

Here is some more data on my Linux machine running an nbody milkyway task. This is part of the associated init_data.xml file in the slots directory.

<app_init_data>

<ncpus>4.000000</ncpus>  <---<<<This is the number of tasks a work unit may use.

<host_info>
    <p_ncpus>16</p_ncpus> <----<<< This is the number of cores the machine has.

   </host_info>

<app_file>milkyway_nbody_1.82_x86_64-pc-linux-gnu__mt</app_file>
</app_init_data>


For a more uisual single processor milkyway task, it says



<app_init_data>
<ncpus>1.000000</ncpus>

<host_info>

    <p_ncpus>16</p_ncpus>

<app_file>milkyway_1.46_x86_64-pc-linux-gnu</app_file>
</app_init_data>


So that is how they do it.
ID: 66080 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 148
Credit: 1,883,629
RAC: 252
Message 66081 - Posted: 9 Sep 2022, 14:33:02 UTC - in response to Message 66080.  

Yep, thanks. From Richard's earlier message, the server sends the app_version data with <avg_ncpus> set to > 1 (that's the key part) and the client creates the init_data.xml from this information when it starts the task, where the value from <avg_ncpus> is copied into <ncpus> (would be nice if the naming was consistent).

I also need to understand how credit it worked out with multithreaded apps. i.e. is it just 4x 1 core credit or does it take the scaling efficiency into account. i.e. if 4 threads gives a 3.5 speedup, is credit then 3.5x 1 core credit or still 4x 1 core?
ID: 66081 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 531
Credit: 15,649,050
RAC: 1,962
Message 66082 - Posted: 9 Sep 2022, 15:18:41 UTC - in response to Message 66081.  

According to ye ancient scrolls of yore, a BOINC credit is also known as a cobblestone, defined as:

By definition, 200 cobblestones are awarded for one day of work on a computer that can meet either of two benchmarks:

    * 1,000 double-precision MFLOPS based on the Whetstone benchmark
    * 1,000 VAX MIPS based on the Dhrystone benchmark

That's all. Nothing else. Pure CPU grunt. No brownie points for complexity, cleverness, memory usage, disk usage, efficiency of execution, artistic merit, ..., ...

In reality, the purity of that definition was abandoned over 10 years ago. Projects can, and do, award whatever credit they wish.Some of us old crusties hanker after the old days, when credits really meant something scientifically measurable, but those days have gone.
ID: 66082 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 3560
Credit: 10,761,686
RAC: 5,528
Message 66083 - Posted: 9 Sep 2022, 15:31:59 UTC - in response to Message 66082.  

In reality, the purity of that definition was abandoned over 10 years ago. Projects can, and do, award whatever credit they wish.Some of us old crusties hanker after the old days, when credits really meant something scientifically measurable, but those days have gone.
You found it more quickly than I did Richard. These days, I think here Andy normally estimates the credit on the testing site and if credits awarded are significantly high or low for the amount of crunching time on the same computer George or less often one of the rest of us lets Andy know and he adjusts accordingly. There was a time when testing branch on CPDN gave double credits but that was before I joined the testing side of things.
ID: 66083 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 148
Credit: 1,883,629
RAC: 252
Message 66085 - Posted: 9 Sep 2022, 19:11:39 UTC - in response to Message 66083.  
Last modified: 9 Sep 2022, 19:11:47 UTC

In reality, the purity of that definition was abandoned over 10 years ago. Projects can, and do, award whatever credit they wish.Some of us old crusties hanker after the old days, when credits really meant something scientifically measurable, but those days have gone.
You found it more quickly than I did Richard. These days, I think here Andy normally estimates the credit on the testing site and if credits awarded are significantly high or low for the amount of crunching time on the same computer George or less often one of the rest of us lets Andy know and he adjusts accordingly. There was a time when testing branch on CPDN gave double credits but that was before I joined the testing side of things.
Hmm. I'm used to a supercomputer environment where I would pay for how many compute nodes (CPU & memory), storage & archive. I see no reason why the same shouldn't apply to a boinc project using my machines. If it wants the faster cpu it should 'pay' more (i.e. give more credit). If it wants multiple cores & alot more memory, it should 'pay' by awarding more credit. My 2p worth.

I know credit always gives Andy headaches. I'll have a chat to him. Whatever we do it should be broadly consistent with the credits awarded for the hadley centre models (openifs is a faster model).
ID: 66085 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 694
Credit: 10,133,158
RAC: 6,135
Message 66086 - Posted: 9 Sep 2022, 20:29:40 UTC - in response to Message 66082.  

In reality, the purity of that definition was abandoned over 10 years ago. Projects can, and do, award whatever credit they wish.Some of us old crusties hanker after the old days, when credits really meant something scientifically measurable, but those days have gone.


I must be an old crusty. I do not care want a credit means but universe and milky-way award way too much credit for the work done. My three other projects (CPDN, Rosetta, WCG) award a somewhat "reasonable" amount of credit for each work unit. I think that milky-way awards credits for the time * number of cores effectively used. Since my machine is set up to run the multiprocessor tasks with four cores it credits about 3.65 cores for each work unit.
ID: 66086 · Report as offensive     Reply Quote
Daniel

Send message
Joined: 16 Feb 12
Posts: 2
Credit: 154,366
RAC: 0
Message 66109 - Posted: 16 Sep 2022, 18:12:01 UTC

When will we see more work for our computers?
ID: 66109 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 148
Credit: 1,883,629
RAC: 252
Message 66110 - Posted: 16 Sep 2022, 21:13:51 UTC - in response to Message 66109.  
Last modified: 16 Sep 2022, 21:14:53 UTC

When will we see more work for our computers?
Hi Daniel, my windows machine has just picked up a Weather@Home task (I thought they had all gone out but maybe some are being sent out again).

If you have linux (bare metal or virtual box), then I will be sending out some OpenIFS project work in October. Just waiting for some code updates on the boinc side and tests. There are also two other projects I know of with OpenIFS that will be submitting work later in the year. Hard to give more exact dates because it's a small team who have other project commitments.
---
CPDN Visiting Scientist

ID: 66110 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 3560
Credit: 10,761,686
RAC: 5,528
Message 66112 - Posted: 17 Sep 2022, 7:07:33 UTC

Hi Daniel, my windows machine has just picked up a Weather@Home task (I thought they had all gone out but maybe some are being sent out again).


The Windows task you got will be a resend with _1 or _2 at the end of the task name meaning it is on its second or third try after failing on one or two machines, or possibly being aborted.
ID: 66112 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 694
Credit: 10,133,158
RAC: 6,135
Message 66113 - Posted: 17 Sep 2022, 13:35:01 UTC - in response to Message 66110.  

If you have linux (bare metal or virtual box), then I will be sending out some OpenIFS project work in October.


Oh goody? Can you let us know a day ahead of time so I can tell my boinc client to allow new tasks from ClimatePrediction? As it is, I have new tasks refused because otherwise my boinc client will get no tasks from my other projects, so determined it is to get something from ClimatePrediction.
ID: 66113 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 14 · Next

Message boards : Number crunching : New work discussion - 2

©2022 climateprediction.net