OpenIFS Discussion

Author	Message
wateroakley Send message Joined: 6 Aug 04 Posts: 185 Credit: 27,123,458 RAC: 3,218	Message 68050 - Posted: 25 Jan 2023, 20:11:11 UTC - in response to Message 68039. Last modified: 25 Jan 2023, 20:39:20 UTC Thank you for the update Glenn. EDIT: PS. I've doubled the ubuntu VM disc to 200GB. That should give the VM enough disc headroom for all zips from the task backlog to be kept locally and upload at some point. Update: Data backup has been reduced sufficiently that the batch & upload servers will be restarted today, if not tomorrow (depending on some last checks). ID: 68050 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,254,274 RAC: 10,513	Message 68051 - Posted: 25 Jan 2023, 20:14:43 UTC - in response to Message 68048. I'm not sure I can go along with that statement completely. Prior to this most recent upload outage, I was computing 10 tasks simultaneously on two machines, and uploading the intermediate files over a single internet link. And my uploads were fully up-to-date, with no backlog - I was uploading in real time. I think the key to running a project like CPDN is, as far as possible, to have a well-balanced overall system. I think the key components are: * internet connection speed * CPU power * Memory (RAM) * Disk space If any single item from that list is below the balance point, the system as a whole will be less than ideally efficient. You say, CPDN "_never_ was able to take our result data as fast as we were able to compute them". Compared with my system, that suggests that something was out of balance: may I ask how many tasks you were processing simultaneously? It may be that the total throughput was limited by something other than the speed at which upload11 could accept incoming files. ID: 68051 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077	Message 68052 - Posted: 25 Jan 2023, 22:13:05 UTC - in response to Message 68051. Last modified: 25 Jan 2023, 22:16:28 UTC I did not make this statement after looking at one or two hosts. I looked at recorded server_status.php history. (Sum of 'tasks ready to send' and 'tasks in progress', plotted over time, oifs_43r3_ps only. grafana.kiska.pw has got the record.) In other words, be "we" I don't refer to myself, but to all combined who are, or have been, computing oifs_43r3_ps. We had three modes of progress in January: – upload11 was down. Progress rate was 0. (We had four periods of this in January so far.) – upload11 was online and ran without notable limit of connections. Progress rate was ~3,300 results during 14 hours, followed by upload11 going down again. (This mode was played out twice in January.) – upload11 was online and ran with throttled connection limit. Progress rate was quite constant ~1,500...~2,000 results/day. (There was a single period in this mode. It lasted 8d3h until the tape storage issue.) The latter constant progress rate cannot be the rate at which we are actually producing. If it was, then there would have been a noticeably steeper progress at the start of that stage when everybody had previously stuck files to upload. ID: 68052 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,254,274 RAC: 10,513	Message 68057 - Posted: 26 Jan 2023, 8:25:44 UTC - in response to Message 68052. Last modified: 26 Jan 2023, 8:31:21 UTC Interesting figures, but I don't think you're comparing like with like. Looking at my own machines: In 'recovery' mode, with a backlog of files after an outage, I can upload a file on average in about 10 seconds - file after file after file, continuously without a break. In 'production' mode, uploading files as they're produced, I can generate about one file every minute and a half on average. So, in very round figures, my production bandwidth needs are roughly 10% of my recovery bandwidth. I think that the server can cope OK with 'production' levels, but struggles with 'recovery' levels. Provided we can raise the mtbf (mean time between failures) to a more comfortable level, we'll be OK. ID: 68057 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 805 Credit: 13,593,584 RAC: 7,495	Message 68059 - Posted: 26 Jan 2023, 10:44:36 UTC - in response to Message 68051. Last modified: 26 Jan 2023, 10:44:58 UTC I think the key to running a project like CPDN is, as far as possible, to have a well-balanced overall system. I think the key components are: * internet connection speed * CPU power * Memory (RAM) * Disk space If any single item from that list is below the balance point, the system as a whole will be less than ideally efficient. I'm going to add more flesh to that list. Like most computational fluid dynamics codes, OpenIFS moves alot of data around in memory. So 'memory bandwidth' is really key to throughput, not just 'memory' nor 'L3 cache size'. Unless you are lucky enough to have a octa channel EPYC, dual channel memory MBs saturate pretty quickly running multiple OFS tasks (which is why I keep saying don't put 1 task per thread even if you have the RAM space for it). For 'cpu power', read 'single core speed', not 'core count' (we haven't got multicore apps yet). I overclock to get that extra 5-10% if I have the MB for it. Had the upload server been working fine, internet connection should have been less of an issue, even for slower lines as the uploads were designed for a slower connection. There's one key item missing from that list - human resources. CPDN gets by with a skeleton crew in Oxford. It would not survive without the support of volunteers and all the help from forum folk. ID: 68059 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 805 Credit: 13,593,584 RAC: 7,495	Message 68061 - Posted: 26 Jan 2023, 10:52:40 UTC Reminder to reset <ncpus> tag in cc_config.xml if you changed it If you altered the <ncpus> tag in cc_config.xml from -1 to a large number, as a way of bypassing the 'no more tasks too many uploads in progress' problem when upload11 was down, could I please remind everyone to change that tag back to <ncpus>-1</ncpus>. There are some more OpenIFS batches coming soon and we don't want 100+ tasks landing on volunteer machines that really don't have 100 cores: e.g. https://www.cpdn.org/show_host_detail.php?hostid=1524863. It would save CPDN trawling through their database to find these hosts and contact their owners. Thanks! ID: 68061 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,523,697 RAC: 5,963	Message 68062 - Posted: 26 Jan 2023, 10:54:05 UTC internet connection should have been less of an issue, even for slower lines as the uploads were designed for a slower connection. For those of us with very slow speeds, it is an issue. Mine can not quite keep up with running two tasks at a time but I get that my situation is far from the norm now at least in UK. I do check on a regular basis when I am due an upgrade but no hints so far. ID: 68062 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,254,274 RAC: 10,513	Message 68063 - Posted: 26 Jan 2023, 12:09:55 UTC - in response to Message 68062. For those of us with very slow speeds, it is an issue. Understood. Like you, I'm already at the maximum speed easily and affordably available in my location - luckily, BT reached me before it reached you (though it did take them a long while for them to work out how to cross the canal). Getting anything faster would involve moving house to a new location, and I don't think either of us is likely to consider doing that for BOINC! ID: 68063 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077	Message 68086 - Posted: 27 Jan 2023, 19:03:09 UTC - in response to Message 68061. Last modified: 27 Jan 2023, 19:12:08 UTC @Richard Haselgrove, note, CPDN's overall oifs_43r3_ps progress _right now_ is likely not subject to one of the three modes which I described, because a few things were apparently changed after the tape storage disaster. ________ Glenn Carver wrote: If you altered the <ncpus> tag in cc_config.xml from -1 to a large number, as a way of bypassing the 'no more tasks too many uploads in progress' problem when upload11 was down, could I please remind everyone to change that tag back to <ncpus>-1</ncpus>. There are some more OpenIFS batches coming soon and we don't want 100+ tasks landing on volunteer machines that really don't have 100 cores: e.g. https://www.cpdn.org/show_host_detail.php?hostid=1524863. This host (it is not one of mine) has got ncpus set to 100 when I looked at this link just now. This may have been done due to a desire to download new work while lots of uploads were pending. (Fetching new work in such a situation is risky though, given the history of upload11's operations.) However, there is also another possible explanation why the user did this: Boinc-client and its control interfaces (web control, boincmgr, global_config_override.xml, you name it) offer to control the number of CPUs usable by boinc only as a percentage, not as the absolute number of CPUs. Hence some users apply this simple trick: Set <ncpus> to 100, et voilà, <max_ncpus_pct> suddenly becomes equal to the actual absolute CPU count which boinc shall use. So if you see such hosts and wonder if their operator is doing something silly or undesirable: It's very well possible that this host is in fact configured well and proper. (I guess project admins could check the scheduler logs; <max_ncpus_pct> is sent by the host in each scheduler request.) ID: 68086 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,254,274 RAC: 10,513	Message 68087 - Posted: 27 Jan 2023, 19:14:10 UTC - in response to Message 68086. We need also to consider, and check, how a <max_concurrent> in app_config.xml is reported back to the server - probably not at all, because at work fetch time, the emphasis is on allocation, rather than progress. I did point out to Glenn privately (yesterday) that CPDN has the data from the regular BOINC trickles available on the scheduling server. It would take some effort, but they could yield precise information on the number of tasks actually being processed at a given time on a given host - and that data continues to be transferred even when the upload server is baulked. ID: 68087 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,523,697 RAC: 5,963	Message 68089 - Posted: 27 Jan 2023, 19:38:31 UTC Interestingly I had a testing OIFS perturbed surface task that got stuck on 99.990% for over an hour today. I copied the slot directory in case any information might prove useful but after stopping and restarting BOINC the task completed successfully. I have no idea what the issue was. ID: 68089 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077	Message 68090 - Posted: 27 Jan 2023, 19:53:55 UTC - in response to Message 68061. About next OpenIFS batches: One or another frequenter of this board mentioned it already: Consider to increase the "max # of error tasks" workunit parameter (and total tasks of course). 3 as in the current WUs isn't a lot. (Too high "max # of error tasks" would of course be bad if crashes were highly repeatable on independent hosts, such as with bad input parameters, but that's evidently not a problem at least currently.) ID: 68090 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 805 Credit: 13,593,584 RAC: 7,495	Message 68092 - Posted: 27 Jan 2023, 22:22:56 UTC - in response to Message 68086. So if you see such hosts and wonder if their operator is doing something silly or undesirable: It's very well possible that this host is in fact configured well and proper. (I guess project admins could check the scheduler logs; <max_ncpus_pct> is sent by the host in each scheduler request.) I appreciate that, I also find %age cpus a pain (why wasn't it just a plain number). But there are other cases where that's not the case. About next OpenIFS batches: One or another frequenter of this board mentioned it already: Consider to increase the "max # of error tasks" workunit parameter (and total tasks of course). 3 as in the current WUs isn't a lot. (Too high "max # of error tasks" would of course be bad if crashes were highly repeatable on independent hosts, such as with bad input parameters, but that's evidently not a problem at least currently.) It's staying at 3. Some of the model perturbations lead to the model aborting, negative theta, levels crossing, too short timestep etc. We don't want to send out too many repeats of tasks which will always fail. 3 is usually enough to get past any wobbles and if necessary another small batch to rerun can be sent out. ID: 68092 · Reply Quote

Yeti Send message Joined: 5 Aug 04 Posts: 171 Credit: 10,234,196 RAC: 31,591	Message 68093 - Posted: 27 Jan 2023, 22:40:43 UTC - in response to Message 68092. I appreciate that, I also find %age cpus a pain (why wasn't it just a plain number). But there are other cases where that's not the case. Nope, having only a plain number, but boxes with different Core-Counts, it is a real pain. I have all my boxes set to "Use only 75% of the real existing Cores, and I really want exactly this behaviour as a maximum for BOINC Supporting BOINC, a great concept ! ID: 68093 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,523,697 RAC: 5,963	Message 68097 - Posted: 28 Jan 2023, 8:01:48 UTC - in response to Message 68093. Nope, having only a plain number, but boxes with different Core-Counts, it is a real pain. I have all my boxes set to "Use only 75% of the real existing Cores, and I really want exactly this behaviour as a maximum for BOINC One of the many issues where those who write the code are never going to please everyone. I personally would have gone for a plain number but it isn't a biggie. Currently I only have one machine and unless dementia sets in, working out what percentage I need for a particular number of cores isn't arduous but then doing it the other way around wouldn't be either! ID: 68097 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 805 Credit: 13,593,584 RAC: 7,495	Message 68123 - Posted: 30 Jan 2023, 14:34:38 UTC Forthcoming batches Just out of a meeting this morning 30/1/23. There will be some 6500 workunits coming for the OpeniFS Baroclinic Lifecycle app (oifs_43r3_bl) for an experiment run by the University of Helsinki, hopefully in 2 weeks time. They will go as soon as we complete testing on some code changes to fix issues thrown up by the last batches (so should see less task fails). These runs will be shorter, runtimes ~half of the PS OpenIFS app (YMMV). Further scientific & technical details, as requested by forum folk, are being prepared and will be made available soon. Expect less total I/O and smaller upload sizes as the runs are shorter. Memory requirement will be the same as model resolution is unchanged. ID: 68123 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077	Message 68127 - Posted: 30 Jan 2023, 16:55:28 UTC - in response to Message 68092. Glenn Carver wrote: xii5ku wrote: About next OpenIFS batches: One or another frequenter of this board mentioned it already: Consider to increase the "max # of error tasks" workunit parameter (and total tasks of course). 3 as in the current WUs isn't a lot. (Too high "max # of error tasks" would of course be bad if crashes were highly repeatable on independent hosts, such as with bad input parameters, but that's evidently not a problem at least currently.) It's staying at 3. Some of the model perturbations lead to the model aborting, negative theta, levels crossing, too short timestep etc. We don't want to send out too many repeats of tasks which will always fail. 3 is usually enough to get past any wobbles and if necessary another small batch to rerun can be sent out. Thanks, sounds good! If it's feasible to filter out such triple-errors which were not repeats of one and the same reproducible model failure, and turn these back into extra workunits, then that's obviously a lot better than a higher "max # of error tasks" setting. ID: 68127 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 805 Credit: 13,593,584 RAC: 7,495	Message 68135 - Posted: 30 Jan 2023, 18:50:21 UTC - in response to Message 68127. Glenn Carver wrote: xii5ku wrote: Consider to increase the "max # of error tasks" workunit parameter (and total tasks of course). 3 as in the current WUs isn't a lot. Some of the model perturbations lead to the model aborting, negative theta, levels crossing, etc. 3 is usually enough to get past any wobbles, if necessary another small batch to rerun can be sent out. Thanks, sounds good! If it's feasible to filter out such triple-errors which were not repeats of one and the same reproducible model failure, and turn these back into extra workunits, then that's obviously a lot better than a higher "max # of error tasks" setting. It's not possible to 'filter-out' the triple-errors (if I understand what you mean). We don't know a priori what the model will do with applied perturbations until it's run. Also we can get different answers from identical runs from different hardware, so that a run that fails on an old Intel chip (for example), might work on a newer AMD machine. I have seen examples like this from the recent batches though can't show you an example. One day, when I get more time, I intend to look at the model perturbations coming from running identical tasks across the range of different hardware connected to CPDN. This was done in the early days of CPDN when they ran the very long, big batch, climate simulations. ID: 68135 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1059 Credit: 16,533,368 RAC: 1,871	Message 68137 - Posted: 30 Jan 2023, 19:42:18 UTC - in response to Message 68135. Also we can get different answers from identical runs from different hardware, so that a run that fails on an old Intel chip (for example), might work on a newer AMD machine. I have seen examples like this from the recent batches though can't show you an example. I have to agree. I have received tasks that had failed on 4 previous users, but completed successfully on my machine. With these Oifs tasks, I have had no trouble at all. In recent memory, I have been #3 of a bunch of tasks and completed successfully. And usually the ones before me died of different problems. ID: 68137 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 247 Credit: 11,944,739 RAC: 23,061	Message 68140 - Posted: 31 Jan 2023, 9:03:14 UTC - in response to Message 68123. Forthcoming batches Just out of a meeting this morning 30/1/23. There will be some 6500 workunits coming for the OpeniFS Baroclinic Lifecycle app (oifs_43r3_bl) for an experiment run by the University of Helsinki, hopefully in 2 weeks time. .... Aren't there still at least 12000 new tasks to be processed from the current run by the end of February? I believe that was the number when sending out of new work was turned off a week or so ago. Any idea as to how close things are for it to be turned back on? ID: 68140 · Reply Quote