climateprediction.net home page
OpenIFS Discussion

OpenIFS Discussion

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 31 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,539,436
RAC: 5,891
Message 66690 - Posted: 1 Dec 2022, 11:57:48 UTC - in response to Message 66687.  
Last modified: 1 Dec 2022, 12:45:24 UTC

I think what's happened is CPDN have not set the memory usage limit high enough and depending on what process does what when, it can hit blow past the limit. It's a working theory I want them to test.
I have had one of those failures,

  06:37:27 STEP 2509 H=2509:00 +CPU= 16.937
  06:37:44 STEP 2510 H=2510:00 +CPU= 16.658
  06:38:11 STEP 2511 H=2511:00 +CPU= 24.246
Suspend request received from the BOINC client, suspending the child process
double free or corruption (out) 
So, if I am understanding you correctly, CPDN specify a maximum amount of memory for the application to use and you get problems when (if) it goes above that? It clearly isn't lack of memory on the hot machine as this one has 32GB and only one task was running at the time.
ID: 66690 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 66691 - Posted: 1 Dec 2022, 13:42:01 UTC - in response to Message 66690.  

So, if I am understanding you correctly, CPDN specify a maximum amount of memory for the application to use and you get problems when (if) it goes above that? It clearly isn't lack of memory on the hot machine as this one has 32GB and only one task was running at the time.
It may be an incorrect assumption, but I am presuming that the client either puts the processes in a 'sandbox' (chroot to a slot & restricts memory), or it's killing the process because it exceeds the memory limit, but then I would expect to see a message in the log that it's done that. Anyway, the limits are wrong so let's try the low-hanging fruit first before we try other things on volunteer machines. I'll be doing more testing on my machine in the meantime.

Alot of the failed tasks with double free happened right after the trickle files were zipped so I was beginning to suspect that was a clue, but further checking showed that's not as common as I thought. Unfortunately there is not enough information coming back from the controlling wrapper when something goes wrong - something else I hope they will change.

I think I've also convinced Andy that we needed to do a more realistic batch test on the dev site, a much bigger batch with more volunteers, to test it as it would go out on the production site. We could have picked up these problems earlier had that been done. Typically the dev test site is used to check the server config as the model has already been tested to run standalone. To that end, the active users on this forum might get an invite soon to join the dev site.
ID: 66691 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,290,785
RAC: 11,182
Message 66692 - Posted: 1 Dec 2022, 13:58:55 UTC - in response to Message 66691.  

What many projects do is to create special short-running tasks for evaluation on their test sites. These would exercise all the major loops in the code, but cover a shorter time simulation. That way, you would start to see the results more quickly, and you might capture the totality of stderr within the 64 KB limit. Would there be any scope for that here?
ID: 66692 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,541,651
RAC: 2,188
Message 66693 - Posted: 1 Dec 2022, 14:02:15 UTC - in response to Message 66691.  

Typically the dev test site is used to check the server config as the model has already been tested to run standalone. To that end, the active users on this forum might get an invite soon to join the dev site.


Would that invite be in our Inbox? Or some other way?

I assume those invited would be given instructions on how to participate.
ID: 66693 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,539,436
RAC: 5,891
Message 66694 - Posted: 1 Dec 2022, 14:49:40 UTC - in response to Message 66693.  

Typically the dev test site is used to check the server config as the model has already been tested to run standalone. To that end, the active users on this forum might get an invite soon to join the dev site.


Would that invite be in our Inbox? Or some other way?

I assume those invited would be given instructions on how to participate.

Yes, it would come via in-box with instructions on how to join the dev site - It will show up as another project cpdn_boinc once anyone invited has joined.
ID: 66694 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 66695 - Posted: 1 Dec 2022, 15:24:27 UTC - in response to Message 66692.  
Last modified: 1 Dec 2022, 15:27:30 UTC

What many projects do is to create special short-running tasks for evaluation on their test sites. These would exercise all the major loops in the code, but cover a shorter time simulation. That way, you would start to see the results more quickly, and you might capture the totality of stderr within the 64 KB limit. Would there be any scope for that here?
Absolutely. No need to run for the full 3 months. I'm more interested in capturing the way volunteers run the tasks on their machine (stuffed to the limit in some cases from what I've read!). I think that's the problem, we haven't tested at the scale we're running on the production site, so the first batch effectively becomes that test.
ID: 66695 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 66696 - Posted: 1 Dec 2022, 15:34:05 UTC
Last modified: 1 Dec 2022, 15:35:52 UTC

Assume we get instructions...
It works much the same as climateprediction.net, you get tasks as usual & credit but they may not always work.
ID: 66696 · Report as offensive     Reply Quote
Steven

Send message
Joined: 28 Jun 14
Posts: 4
Credit: 8,570,955
RAC: 6
Message 66698 - Posted: 1 Dec 2022, 17:10:58 UTC
Last modified: 1 Dec 2022, 17:21:24 UTC

I'm getting all sorts of errors here. Been trying to budget 8GB of RAM per OpenIFS workunit.

This workunit run to the end and then aborted? Did BOINC crash? I was running two at a time on this system with 16GB of RAM.
https://www.cpdn.org/result.php?resultid=22247140

<message>
Process still present 5 min after writing finish file; aborting</message>


This one failed with an upload error. Running one at a time since it only has 8GB of RAM.
https://www.cpdn.org/result.php?resultid=22246386

<message>
upload failure: <file_xfer_error>
  <file_name>oifs_43r3_ps_1304_2021050100_123_946_12164393_0_r264053712_122.zip</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>


This one failed with code 9. Same machine as previous, but this one may have been running more than one for a while before I noticed the OpenIFS tasks were being sent out.
https://www.cpdn.org/result.php?resultid=22247027

<message>
process exited with code 9 (0x9, -247)</message>

double free or corruption (out)


This one ran for 15 hours and somehow has no output file?
https://www.cpdn.org/result.php?resultid=22245680

Same machine as previous, ran to the end and then had an upload failure.
https://www.cpdn.org/result.php?resultid=22245367

<message>
upload failure: <file_xfer_error>
  <file_name>oifs_43r3_ps_0334_2021050100_123_945_12163423_0_r1586639697_122.zip</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>


Meanwhile, this old computer has been running two at a time with no trouble.
https://www.cpdn.org/show_host_detail.php?hostid=1526772

These are all on ethernet, sharing a switch. Don't think I'm running into bandwidth issues. Checked a couple of the machines for disk usage. BOINC has 100GB to play with with ~90GB free.

Quick edit: Another one just failed. Received this morning on a machine with 8GB of RAM. Running just one workunit. Ran for about 5 hours before failing. "Trickle up message pending" in BOINC manager. Hasn't been reported to the server yet. No output file in the folder, but there was this progress file, if it helps:

https://www.cpdn.org/result.php?resultid=22248845
<?xml version="1.0" encoding="utf-8"?>
<running_values>
  <last_cpu_time>19262.910000</last_cpu_time>
  <upload_file_number>44</upload_file_number>
  <last_iter>1059</last_iter>
  <last_upload>3801600</last_upload>
  <model_completed>0</model_completed>
</running_values>
ID: 66698 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,539,436
RAC: 5,891
Message 66699 - Posted: 1 Dec 2022, 18:29:16 UTC - in response to Message 66698.  

The
double free or corruption (out)
error is a problem with the model or the wrapper code. The failed uploads are because the model has crashed before producing the final upload(s) so they are missing when BOINC tries to upload them to the server once the task has finished. These errors are happening on machines of known good pedigree. Glen is on the case and we may be doing a larger than normal batch over on the testing site to try and resolve this.
ID: 66699 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 185
Credit: 27,123,458
RAC: 3,218
Message 66700 - Posted: 1 Dec 2022, 19:48:21 UTC

To get the openIFS tasks to run on this VirtualBox ubuntu host: https://www.cpdn.org/results.php?hostid=1512045, I increased the ubuntu VM disc partition from 40GB to 100GB (gparted).

After five early openIFS successes, the subsequent tasks have crashed with one error or another. The event log has reported a lot of 'file absent' records, with no obvious local reason that I can see,
This afternoon I've increased the memory allocated to the ubuntu VM from 28GB to 32GB and reduced cpus (tasks running) from six to four.

On a positive note, after the reboot all the suspended tasks started up successfully!
ID: 66700 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 77
Credit: 68,010,357
RAC: 8,527
Message 66701 - Posted: 1 Dec 2022, 20:34:15 UTC - in response to Message 66689.  

Update.
After meeting yesterday with CPDN, the disk and memory requirements for these tasks need revising: memory requirement up & disk down. What was not taken into account when setting the memory was the additional amount required by the wrapper code & all the boinc functions it uses (such as zipping). Hopefully this will eliminate some of the memory errors.
The plan is to put out a repeat of the first batch with corrected limits to check how it performs before sending out the rest of this experiment.
Sure this will help!
On trickles, agree these longer (3 month) runs are producing too many trickle files which I'll adjust. However, I looked at the output filesize per output instance and it's reasonable and at the lower limit of what the scientist needs. I am reluctant to change it.
Understood! Hope less tickles might help for smoother uploads.
Question for ADSL people: knowing your bottleneck is network, are you happy just reducing the no. of tasks running concurrently? What's your sustainable data-flow rate you would be happy with (give me a number to work with).
I do not have any problems to reduce the number of tasks running on my computers to fit into my ADSL bandwidth. I have to remind myself, I offer the scientist a certain amount of compute power, but they have to accept the offer – there are a lot of other worthy BOINC projects! (Hopefully I will remind myself of it, when I will go out shipping computer parts for climatepretiction.net I do not need for my personal daily computer requirements!) However, I am still concerned, how many climateprediction.net participants are reading the Forums and how many users are out there, who have installed BOINC and attached to climateprediction.net, but never check their machines. You might end up, with a lot of OpenIFS results piling up on computers with slow internet connections, wasting energy and resources and never help science. I will send you a PM with my ADSL speed, so you have a number of WUs, I am likely to contribute each day. It is not much!
ID: 66701 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,541,651
RAC: 2,188
Message 66703 - Posted: 1 Dec 2022, 20:54:56 UTC - in response to Message 66701.  

However, I am still concerned, how many climateprediction.net participants are reading the Forums and how many users are out there, who have installed BOINC and attached to climateprediction.net, but never check their machines. You might end up, with a lot of OpenIFS results piling up on computers with slow internet connections, wasting energy and resources and never help science.


I notice, with favor, that these Oifs work units come with about a one-month expiry date instead of a one-year one the traditional work units come with.
ID: 66703 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 185
Credit: 27,123,458
RAC: 3,218
Message 66704 - Posted: 1 Dec 2022, 21:10:10 UTC - in response to Message 66701.  

[ADSL people: knowing your bottleneck is network, are you happy just reducing the no. of tasks running concurrently? What's your sustainable data-flow rate you would be happy with (give me a number to work with).
The broadband uplink here is 12Mbps and downlink at 40Mbps. It's pretty consistent at that speed. The event log showed that uploads from six concurrent tasks over the past few days are taking 12-15 seconds each, which is not giving a network headache. A single new task download (3 jf_c... files) is less than two minutes.
ID: 66704 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,539,436
RAC: 5,891
Message 66705 - Posted: 1 Dec 2022, 21:38:11 UTC

On bored band here with a max upload speed of about 100KB/s it can just about keep up with 2 tasks running at a time. Not a problem for me as if they do build up I can just cut down to 1 task running till it catches up. Lower numbers of tasks for testing runs, i sometimes tether my phone to get four times the througput but with a 15GB/month limit I won't be doing that for main site batches of these!
ID: 66705 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 66706 - Posted: 1 Dec 2022, 21:51:31 UTC - in response to Message 66689.  
Last modified: 1 Dec 2022, 21:58:24 UTC

Glenn Carver wrote:
On trickles, agree these longer (3 month) runs are producing too many trickle files which I'll adjust. However, I looked at the output filesize per output instance and it's reasonable and at the lower limit of what the scientist needs. I am reluctant to change it.

Question for ADSL people: knowing your bottleneck is network, are you happy just reducing the no. of tasks running concurrently? What's your sustainable data-flow rate you would be happy with (give me a number to work with).
If the scientist needs 1.72 GB result data per workunit, then that's what I'll be happily producing. After all, it's the data which the scientist desires, not the CPU cycles which produce them.

Going by the task properties of those in the first 3000s batch:

    Based on the CPUs, RAM and disk space which I have available, I could produce >330 results/day = 570 GB/day.
    If I switched on some older gear and let the flat become uncomfortably warm, it'd be >460 results/day = 790 GB/day.

    But based on my Internet uplink, I can deliver at most 8 Mbit/s = 84 GB/day in steady state, minus outages. (That's at most 48 results/day, minus outages.)

I have no trouble partitioning my currently running computers such that I produce ≤48 r/d for CPDN and have the rest of computer capacity busy at other projects.

If everyone had a narrow uplink like me (there are lesser links which they still call "broadband" here), and if you want >42,000 results done until X-Mas 2022, you would obviously need >36 people like me if they manage to nearly saturate their uplink the whole time. server_status.php claims there were 95 users at OpenIFS 43r3 Perturbed Surface during the last 24 hours, so that looks good. OTOH it seems only 1000 of the first 3000 tasks are done yet, so that does not look as good.

Obviously, from the comments in this thread, we have folks here who are bottlenecked by CPU, others by RAM, and others by transfer bandwidth.

ID: 66706 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 66708 - Posted: 1 Dec 2022, 22:34:27 UTC
Last modified: 1 Dec 2022, 22:40:21 UTC

I'll say one more thing about vboxwrapper, and then I'll stay away from this subject: If you look at LHC@home, the highest producers there run the native Linux ATLAS application, not any of the virtualized applications. And that's no coincidence. One of the reasons is a lot lower RAM requirement by the native application. (Also check out the "average computing" column at apps.php. Or anybody who ever took part in a contest at LHC@home knows very well that the native application is the way to go if computing throughput is of any concern at all.)
ID: 66708 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 257
Credit: 31,952,725
RAC: 36,445
Message 66709 - Posted: 1 Dec 2022, 22:39:15 UTC
Last modified: 1 Dec 2022, 22:40:52 UTC

I've had several OOM, despite Boinc being set to use 90% of system RAM, on dedicated hardware (well, a VM dedicated to BOINC tasks in the winter).

https://www.cpdn.org/result.php?resultid=22247094 is one - the rest look identical, just a child task exited.

It's a hex-core VM with 12GB RAM - I would have assumed that BOINC would limit processes based on memory use, but that doesn't seem to be happening. I'll pull a couple cores out of it for future units, but however the math is happening, OpenIFS tasks are OOMing easily.
ID: 66709 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,541,651
RAC: 2,188
Message 66710 - Posted: 1 Dec 2022, 22:53:52 UTC - in response to Message 66706.  

Obviously, from the comments in this thread, we have folks here who are bottlenecked by CPU, others by RAM, and others by transfer bandwidth.


I think I am bottle-necked by the size of my Processor cache. My CPU is pretty fast, 64 GBytes RAM, and I get 75 Megabits per second on my fiber-optic Internet connection. My other computer is a little one running Windows 10, and it spends most of its life doing Boinc, but notOpenIFS.

Memory 	62.28 GB
Cache 	16896 KB
Swap space 	15.62 GB
Total disk space 	488.04 GB
Free Disk Space 	477.76 GB
Measured floating point speed 	6.13 billion ops/sec
Measured integer speed 	26.09 billion ops/sec
Average upload rate 	4480.76 KB/sec
Average download rate 	45235.53 KB/sec


I think the data rates reported by Boinc-CPDN are really Kilobits per second, not KiloBytes per second.)

Right now I am running three Oifs tasks, three Rosetta tasks, three WCG tasks, two Einstein tasks, and one (single-processor) MilkyWay task. This shows my machine's cache-miss ratio, so the hit ratio would be 50.45% , Not too bad, but not wonderful either. Other than the 12 boinc processes, the machine is not doing much else at the moment (following my typing into Firefox that is doing nothing else).

# perf stat -aB -e cache-references,cache-misses
 Performance counter stats for 'system wide':

    20,626,539,435      cache-references                                            
    10,220,773,584      cache-misses              #   49.552 % of all cache refs    

      61.867007273 seconds time elapsed

ID: 66710 · Report as offensive     Reply Quote
Vato

Send message
Joined: 4 Oct 19
Posts: 13
Credit: 7,300,561
RAC: 14,819
Message 66711 - Posted: 2 Dec 2022, 1:21:07 UTC - in response to Message 66691.  

i will happily run tests on the dev server if invited.
so far i have 9 tasks that appear to run well - no credit though
ID: 66711 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,539,436
RAC: 5,891
Message 66712 - Posted: 2 Dec 2022, 5:39:51 UTC

It's a hex-core VM with 12GB RAM - I would have assumed that BOINC would limit processes based on memory use, but that doesn't seem to be happening. I'll pull a couple cores out of it for future units, but however the math is happening, OpenIFS tasks are OOMing easily.
During early stages of these on the testing site, I was able to run 4 tasks on a box that had only 8GB RAM. That laptop is now dead but it did it albeit at a massive hit on speed because it was swapping to disk every timetwo or more tasks peaked in memory usage at the same time. There wasn't much of a hit when only running 2 at once. But, Sadly the client will not limit how many tasks it will run based on memory. I have the whole of the laptop ssd boot disk that I salvaged as swap on this machine so 128GB but am not trying to run 16 or even 8 tasks at once because connection bandwidth is my bottleneck.
ID: 66712 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 31 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 climateprediction.net