climateprediction.net home page
OpenIFS Discussion

OpenIFS Discussion

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 31 · Next

AuthorMessage
wateroakley

Send message
Joined: 6 Aug 04
Posts: 185
Credit: 27,123,458
RAC: 3,218
Message 68050 - Posted: 25 Jan 2023, 20:11:11 UTC - in response to Message 68039.  
Last modified: 25 Jan 2023, 20:39:20 UTC

Thank you for the update Glenn.

EDIT: PS. I've doubled the ubuntu VM disc to 200GB. That should give the VM enough disc headroom for all zips from the task backlog to be kept locally and upload at some point.
Update:
Data backup has been reduced sufficiently that the batch & upload servers will be restarted today, if not tomorrow (depending on some last checks).

ID: 68050 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,254,274
RAC: 10,513
Message 68051 - Posted: 25 Jan 2023, 20:14:43 UTC - in response to Message 68048.  

I'm not sure I can go along with that statement completely. Prior to this most recent upload outage, I was computing 10 tasks simultaneously on two machines, and uploading the intermediate files over a single internet link. And my uploads were fully up-to-date, with no backlog - I was uploading in real time.

I think the key to running a project like CPDN is, as far as possible, to have a well-balanced overall system. I think the key components are:

    * internet connection speed
    * CPU power
    * Memory (RAM)
    * Disk space

If any single item from that list is below the balance point, the system as a whole will be less than ideally efficient.

You say, CPDN "_never_ was able to take our result data as fast as we were able to compute them". Compared with my system, that suggests that something was out of balance: may I ask how many tasks you were processing simultaneously? It may be that the total throughput was limited by something other than the speed at which upload11 could accept incoming files.

ID: 68051 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 68052 - Posted: 25 Jan 2023, 22:13:05 UTC - in response to Message 68051.  
Last modified: 25 Jan 2023, 22:16:28 UTC

I did not make this statement after looking at one or two hosts.
I looked at recorded server_status.php history. (Sum of 'tasks ready to send' and 'tasks in progress', plotted over time, oifs_43r3_ps only. grafana.kiska.pw has got the record.)
In other words, be "we" I don't refer to myself, but to all combined who are, or have been, computing oifs_43r3_ps.

We had three modes of progress in January:
– upload11 was down. Progress rate was 0. (We had four periods of this in January so far.)
– upload11 was online and ran without notable limit of connections. Progress rate was ~3,300 results during 14 hours, followed by upload11 going down again. (This mode was played out twice in January.)
– upload11 was online and ran with throttled connection limit. Progress rate was quite constant ~1,500...~2,000 results/day. (There was a single period in this mode. It lasted 8d3h until the tape storage issue.)

The latter constant progress rate cannot be the rate at which we are actually producing. If it was, then there would have been a noticeably steeper progress at the start of that stage when everybody had previously stuck files to upload.
ID: 68052 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,254,274
RAC: 10,513
Message 68057 - Posted: 26 Jan 2023, 8:25:44 UTC - in response to Message 68052.  
Last modified: 26 Jan 2023, 8:31:21 UTC

Interesting figures, but I don't think you're comparing like with like.

Looking at my own machines:

In 'recovery' mode, with a backlog of files after an outage, I can upload a file on average in about 10 seconds - file after file after file, continuously without a break.
In 'production' mode, uploading files as they're produced, I can generate about one file every minute and a half on average.

So, in very round figures, my production bandwidth needs are roughly 10% of my recovery bandwidth.

I think that the server can cope OK with 'production' levels, but struggles with 'recovery' levels. Provided we can raise the mtbf (mean time between failures) to a more comfortable level, we'll be OK.
ID: 68057 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 805
Credit: 13,593,584
RAC: 7,495
Message 68059 - Posted: 26 Jan 2023, 10:44:36 UTC - in response to Message 68051.  
Last modified: 26 Jan 2023, 10:44:58 UTC

I think the key to running a project like CPDN is, as far as possible, to have a well-balanced overall system. I think the key components are:

    * internet connection speed
    * CPU power
    * Memory (RAM)
    * Disk space

If any single item from that list is below the balance point, the system as a whole will be less than ideally efficient.

I'm going to add more flesh to that list. Like most computational fluid dynamics codes, OpenIFS moves alot of data around in memory. So 'memory bandwidth' is really key to throughput, not just 'memory' nor 'L3 cache size'. Unless you are lucky enough to have a octa channel EPYC, dual channel memory MBs saturate pretty quickly running multiple OFS tasks (which is why I keep saying don't put 1 task per thread even if you have the RAM space for it). For 'cpu power', read 'single core speed', not 'core count' (we haven't got multicore apps yet). I overclock to get that extra 5-10% if I have the MB for it. Had the upload server been working fine, internet connection should have been less of an issue, even for slower lines as the uploads were designed for a slower connection.

There's one key item missing from that list - human resources. CPDN gets by with a skeleton crew in Oxford. It would not survive without the support of volunteers and all the help from forum folk.
ID: 68059 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 805
Credit: 13,593,584
RAC: 7,495
Message 68061 - Posted: 26 Jan 2023, 10:52:40 UTC

Reminder to reset <ncpus> tag in cc_config.xml if you changed it

If you altered the <ncpus> tag in cc_config.xml from -1 to a large number, as a way of bypassing the 'no more tasks too many uploads in progress' problem when upload11 was down, could I please remind everyone to change that tag back to <ncpus>-1</ncpus>.

There are some more OpenIFS batches coming soon and we don't want 100+ tasks landing on volunteer machines that really don't have 100 cores: e.g. https://www.cpdn.org/show_host_detail.php?hostid=1524863.

It would save CPDN trawling through their database to find these hosts and contact their owners.

Thanks!
ID: 68061 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4345
Credit: 16,523,697
RAC: 5,963
Message 68062 - Posted: 26 Jan 2023, 10:54:05 UTC

internet connection should have been less of an issue, even for slower lines as the uploads were designed for a slower connection.
For those of us with very slow speeds, it is an issue. Mine can not quite keep up with running two tasks at a time but I get that my situation is far from the norm now at least in UK. I do check on a regular basis when I am due an upgrade but no hints so far.
ID: 68062 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,254,274
RAC: 10,513
Message 68063 - Posted: 26 Jan 2023, 12:09:55 UTC - in response to Message 68062.  

For those of us with very slow speeds, it is an issue.
Understood. Like you, I'm already at the maximum speed easily and affordably available in my location - luckily, BT reached me before it reached you (though it did take them a long while for them to work out how to cross the canal). Getting anything faster would involve moving house to a new location, and I don't think either of us is likely to consider doing that for BOINC!
ID: 68063 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 68086 - Posted: 27 Jan 2023, 19:03:09 UTC - in response to Message 68061.  
Last modified: 27 Jan 2023, 19:12:08 UTC

@Richard Haselgrove,
note, CPDN's overall oifs_43r3_ps progress _right now_ is likely not subject to one of the three modes which I described, because a few things were apparently changed after the tape storage disaster.

________

Glenn Carver wrote:
If you altered the <ncpus> tag in cc_config.xml from -1 to a large number, as a way of bypassing the 'no more tasks too many uploads in progress' problem when upload11 was down, could I please remind everyone to change that tag back to <ncpus>-1</ncpus>.

There are some more OpenIFS batches coming soon and we don't want 100+ tasks landing on volunteer machines that really don't have 100 cores: e.g. https://www.cpdn.org/show_host_detail.php?hostid=1524863.
This host (it is not one of mine) has got ncpus set to 100 when I looked at this link just now.
This *may* have been done due to a desire to download new work while lots of uploads were pending. (Fetching new work in such a situation is risky though, given the history of upload11's operations.) However, there is also another possible explanation why the user did this:

Boinc-client and its control interfaces (web control, boincmgr, global_config_override.xml, you name it) offer to control the number of CPUs usable by boinc only as a percentage, not as the absolute number of CPUs. Hence some users apply this simple trick: Set <ncpus> to 100, et voilà, <max_ncpus_pct> suddenly becomes equal to the actual absolute CPU count which boinc shall use.

So if you see such hosts and wonder if their operator is doing something silly or undesirable: It's very well possible that this host is in fact configured well and proper. (I guess project admins could check the scheduler logs; <max_ncpus_pct> is sent by the host in each scheduler request.)
ID: 68086 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,254,274
RAC: 10,513
Message 68087 - Posted: 27 Jan 2023, 19:14:10 UTC - in response to Message 68086.  

We need also to consider, and check, how a <max_concurrent> in app_config.xml is reported back to the server - probably not at all, because at work fetch time, the emphasis is on allocation, rather than progress.

I did point out to Glenn privately (yesterday) that CPDN has the data from the regular BOINC trickles available on the scheduling server. It would take some effort, but they could yield precise information on the number of tasks actually being processed at a given time on a given host - and that data continues to be transferred even when the upload server is baulked.
ID: 68087 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4345
Credit: 16,523,697
RAC: 5,963
Message 68089 - Posted: 27 Jan 2023, 19:38:31 UTC

Interestingly I had a testing OIFS perturbed surface task that got stuck on 99.990% for over an hour today. I copied the slot directory in case any information might prove useful but after stopping and restarting BOINC the task completed successfully. I have no idea what the issue was.
ID: 68089 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 68090 - Posted: 27 Jan 2023, 19:53:55 UTC - in response to Message 68061.  

About next OpenIFS batches:
One or another frequenter of this board mentioned it already: Consider to increase the "max # of error tasks" workunit parameter (and total tasks of course). 3 as in the current WUs isn't a lot. (Too high "max # of error tasks" would of course be bad if crashes were highly repeatable on independent hosts, such as with bad input parameters, but that's evidently not a problem at least currently.)
ID: 68090 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 805
Credit: 13,593,584
RAC: 7,495
Message 68092 - Posted: 27 Jan 2023, 22:22:56 UTC - in response to Message 68086.  

So if you see such hosts and wonder if their operator is doing something silly or undesirable: It's very well possible that this host is in fact configured well and proper. (I guess project admins could check the scheduler logs; <max_ncpus_pct> is sent by the host in each scheduler request.)
I appreciate that, I also find %age cpus a pain (why wasn't it just a plain number). But there are other cases where that's not the case.

About next OpenIFS batches:
One or another frequenter of this board mentioned it already: Consider to increase the "max # of error tasks" workunit parameter (and total tasks of course). 3 as in the current WUs isn't a lot. (Too high "max # of error tasks" would of course be bad if crashes were highly repeatable on independent hosts, such as with bad input parameters, but that's evidently not a problem at least currently.)
It's staying at 3. Some of the model perturbations lead to the model aborting, negative theta, levels crossing, too short timestep etc. We don't want to send out too many repeats of tasks which will always fail. 3 is usually enough to get past any wobbles and if necessary another small batch to rerun can be sent out.
ID: 68092 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 10,234,196
RAC: 31,591
Message 68093 - Posted: 27 Jan 2023, 22:40:43 UTC - in response to Message 68092.  

I appreciate that, I also find %age cpus a pain (why wasn't it just a plain number). But there are other cases where that's not the case.
Nope, having only a plain number, but boxes with different Core-Counts, it is a real pain.

I have all my boxes set to "Use only 75% of the real existing Cores, and I really want exactly this behaviour as a maximum for BOINC


Supporting BOINC, a great concept !
ID: 68093 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4345
Credit: 16,523,697
RAC: 5,963
Message 68097 - Posted: 28 Jan 2023, 8:01:48 UTC - in response to Message 68093.  

Nope, having only a plain number, but boxes with different Core-Counts, it is a real pain.

I have all my boxes set to "Use only 75% of the real existing Cores, and I really want exactly this behaviour as a maximum for BOINC


One of the many issues where those who write the code are never going to please everyone. I personally would have gone for a plain number but it isn't a biggie. Currently I only have one machine and unless dementia sets in, working out what percentage I need for a particular number of cores isn't arduous but then doing it the other way around wouldn't be either!
ID: 68097 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 805
Credit: 13,593,584
RAC: 7,495
Message 68123 - Posted: 30 Jan 2023, 14:34:38 UTC

Forthcoming batches

Just out of a meeting this morning 30/1/23. There will be some 6500 workunits coming for the OpeniFS Baroclinic Lifecycle app (oifs_43r3_bl) for an experiment run by the University of Helsinki, hopefully in 2 weeks time. They will go as soon as we complete testing on some code changes to fix issues thrown up by the last batches (so should see less task fails). These runs will be shorter, runtimes ~half of the PS OpenIFS app (YMMV). Further scientific & technical details, as requested by forum folk, are being prepared and will be made available soon. Expect less total I/O and smaller upload sizes as the runs are shorter. Memory requirement will be the same as model resolution is unchanged.
ID: 68123 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 68127 - Posted: 30 Jan 2023, 16:55:28 UTC - in response to Message 68092.  

Glenn Carver wrote:
xii5ku wrote:
About next OpenIFS batches:
One or another frequenter of this board mentioned it already: Consider to increase the "max # of error tasks" workunit parameter (and total tasks of course). 3 as in the current WUs isn't a lot. (Too high "max # of error tasks" would of course be bad if crashes were highly repeatable on independent hosts, such as with bad input parameters, but that's evidently not a problem at least currently.)
It's staying at 3. Some of the model perturbations lead to the model aborting, negative theta, levels crossing, too short timestep etc. We don't want to send out too many repeats of tasks which will always fail. 3 is usually enough to get past any wobbles and if necessary another small batch to rerun can be sent out.
Thanks, sounds good! If it's feasible to filter out such triple-errors which were not repeats of one and the same reproducible model failure, and turn these back into extra workunits, then that's obviously a lot better than a higher "max # of error tasks" setting.
ID: 68127 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 805
Credit: 13,593,584
RAC: 7,495
Message 68135 - Posted: 30 Jan 2023, 18:50:21 UTC - in response to Message 68127.  

Glenn Carver wrote:
xii5ku wrote:
Consider to increase the "max # of error tasks" workunit parameter (and total tasks of course). 3 as in the current WUs isn't a lot.
Some of the model perturbations lead to the model aborting, negative theta, levels crossing, etc. 3 is usually enough to get past any wobbles, if necessary another small batch to rerun can be sent out.
Thanks, sounds good! If it's feasible to filter out such triple-errors which were not repeats of one and the same reproducible model failure, and turn these back into extra workunits, then that's obviously a lot better than a higher "max # of error tasks" setting.
It's not possible to 'filter-out' the triple-errors (if I understand what you mean). We don't know a priori what the model will do with applied perturbations until it's run. Also we can get different answers from identical runs from different hardware, so that a run that fails on an old Intel chip (for example), might work on a newer AMD machine. I have seen examples like this from the recent batches though can't show you an example.

One day, when I get more time, I intend to look at the model perturbations coming from running identical tasks across the range of different hardware connected to CPDN. This was done in the early days of CPDN when they ran the very long, big batch, climate simulations.
ID: 68135 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1059
Credit: 16,533,368
RAC: 1,871
Message 68137 - Posted: 30 Jan 2023, 19:42:18 UTC - in response to Message 68135.  

Also we can get different answers from identical runs from different hardware, so that a run that fails on an old Intel chip (for example), might work on a newer AMD machine. I have seen examples like this from the recent batches though can't show you an example.


I have to agree. I have received tasks that had failed on 4 previous users, but completed successfully on my machine. With these Oifs tasks, I have had no trouble at all. In recent memory, I have been #3 of a bunch of tasks and completed successfully. And usually the ones before me died of different problems.
ID: 68137 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 11,944,739
RAC: 23,061
Message 68140 - Posted: 31 Jan 2023, 9:03:14 UTC - in response to Message 68123.  

Forthcoming batches

Just out of a meeting this morning 30/1/23. There will be some 6500 workunits coming for the OpeniFS Baroclinic Lifecycle app (oifs_43r3_bl) for an experiment run by the University of Helsinki, hopefully in 2 weeks time. ....

Aren't there still at least 12000 new tasks to be processed from the current run by the end of February? I believe that was the number when sending out of new work was turned off a week or so ago. Any idea as to how close things are for it to be turned back on?
ID: 68140 · Report as offensive     Reply Quote
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 31 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 climateprediction.net