climateprediction.net home page
OpenIFS Discussion

OpenIFS Discussion

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 31 · Next

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 815
Credit: 13,663,992
RAC: 8,399
Message 66981 - Posted: 21 Dec 2022, 0:54:33 UTC - in response to Message 66980.  

The 2min sleep is there just to let all the file operations complete and make sure everything is flushed from memory.

I don't understand why you suspended it a few times? It would have worked without? The trickle file is quite small.
New day, new task !
This one had 90 minutes to run but would finish when I was out, it also had a trickle file that had been sitting there for 20 minutes or so.
I suspended the task, the event log shows this and when I resumed it when I got back in (the trickle file had disappeared).

Expecting the final trickle file to also take a while to go I clicked on Suspend task as soon as that trickle file appeared after the last zip file.
The status of the task changed to suspended but I'm aware of what the critical_section code does on these occasions.
After a few minutes, probably the 2 minutes sleep in the wrapper, the status of the task changed to Completed even though the final trickle file was still present in the project folder.
A few more minutes passed before I clicked on Update.
The task disappeared from the list of tasks, the trickle file got deleted, the task is showing as completed in my list of tasks on the server !
ID: 66981 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 66982 - Posted: 21 Dec 2022, 6:29:40 UTC - in response to Message 66978.  
Last modified: 21 Dec 2022, 7:24:56 UTC

Jean-David Beyer wrote:
Glenn Carver wrote:
the server has been moved onto a newer, managed cloud service with better resources.
It sure has. I am now running four Oifs _ps tasks at a time and was watching the Transfers tab as things went up. Often I got 3 to 4 megabytes/second upload speeds each when two tasks were uploading at once, and I notice one that got over 7 megabytes/second when uploading alone. Once in a while, I get one considerably slower. My Internet connection at my end is 75 megabits/second or 9.375 Megabytes/second.

These trickles normally go up in about 6 seconds, but the slowest one I remember took almost two minutes.
As noted before, I still get occasional HTTP transfer timeouts. Not particularly frequent, but at the moment they have nevertheless a considerable negative effect:

– I am located in Germany.
– I have got a 1 MByte/s uplink, which would allow me to run 33 OIFS tasks in parallel (based on 16.5 h average task duration).
– Yet I cautiously set up my currently two connected computers to run only 20 + 2 tasks at once.
– Even though, since new work became available ~11 hours ago, the two computers built up a backlog of a few files to be retried and many new files pending for upload.

That is, the timeouts are frequent enough that the client spends a considerable time not uploading, but waiting for timeout.

On the computer with 20 concurrently running tasks, I now increased cc_config::options::max_file_xfers_per_project from 2 to 4 in hope that the probability of such idle time is much reduced, and that my upload backlog clears eventually. [Edit: Upped it to 6 now.]

(The files are always uploading with my full bandwidth of 1 MByte/s. It's just that a random file will eventually stop transferring at a random point in time, with the client waiting on it until the default transfer timeout. As I mentioned in New work discussion 2, #66939, the client's first several upload retries of such a file will then fail because the server's upload handler still has this file locked. Only a long while later there will eventually be a successful retry. In the mean time, the client will of course successfully transfer many other files. But even so, 2 max_file_xfers_per_project were evidently not enough to get me anywhere near of my theoretic upload capacity of 1 MByte/s sustained.)
ID: 66982 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 6 Jul 06
Posts: 141
Credit: 3,511,752
RAC: 144,072
Message 66983 - Posted: 21 Dec 2022, 7:43:58 UTC

These Oifs _ps tasks really test your system out.

Running 9 at once, each using from 2.7 to 4.2 GB of RAM, after 2 hours run time they have written 11.3 GB of data to disk each (101.7 GB), which is huge.
Hitting 50 GB of RAM in use out of 64 GB, but I am also running LODA tasks which each use 1 GB of RAM. All 24 threads are running.
12% in and running fine so far.

Conan
ID: 66983 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,519,993
RAC: 1,703
Message 66984 - Posted: 21 Dec 2022, 7:48:48 UTC - in response to Message 66981.  

I don't understand why you suspended it a few times? It would have worked without? The trickle file is quite small.

I wanted to be around for when it finished to see what happened, part of that was to give the pending trickle a chance to go.
I was under the impression we were thinking that the trickles failing to upload was a cause for concern and might be why some valid work was being marked invalid.
I've now seen tasks fail and work with trickle files present after the tasks have completed.
ID: 66984 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 815
Credit: 13,663,992
RAC: 8,399
Message 66986 - Posted: 21 Dec 2022, 9:57:06 UTC - in response to Message 66984.  

Ah great, for a second I thought something else was going wrong! I think the trickles are ok. From what I can see the zip file to be trickled gets moved to the main ./projects/climateprediction.net directory and handed over to the client to deal with.

I don't understand why you suspended it a few times? It would have worked without? The trickle file is quite small.

I wanted to be around for when it finished to see what happened, part of that was to give the pending trickle a chance to go.
I was under the impression we were thinking that the trickles failing to upload was a cause for concern and might be why some valid work was being marked invalid.
I've now seen tasks fail and work with trickle files present after the tasks have completed.
ID: 66986 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 815
Credit: 13,663,992
RAC: 8,399
Message 66987 - Posted: 21 Dec 2022, 10:15:17 UTC - in response to Message 66983.  
Last modified: 21 Dec 2022, 10:21:48 UTC

Conan,

I'm puzzled by your filesizes. 11.3Gb each task is too high. Exactly how did you come up with that figure?

For example, if I go into my slot/0 directory with an oifs task running:
% cd slots/0
% du -hs .     #  note the '.'
1.2G    .

and then do the same in the projects/climateprediction.net directory where the trickle uploads live (and will include all the hadley models as well so it's an overestimate, but just for illustration):
% cd projects/climateprediction.net
% du -hs .
1.4G    .
So that's a total of 2.6G, I'm nowhere near your values of 11G after the model has been running for 2hrs as well. Something is not right there.

Can you please check something for me. Go back into your slot directory and type:
du -hs srf*
and report back with the output. The 'srf' files is one of the model's restart (checkpoint) files, the biggest one. If you have more than one of these, it's a sign the model is restarting frequently. In which case, check that you have enabled: "Leave non-GPU in memory while suspended" under the 'Disk & Memory' tab in 'Computing preferences' for boincmgr. This setting is important. If it's NOT selected, every time the boinc client suspends the task, the model is at risk of being pushed from memory which effectively shuts it down and it has to restart again (I think).

The model can accumulate restart files in the slot directory if it is frequently restarted. The model will normally delete the old restart/checkpoint files as it runs, but if it has to restart, it leaves the old one behind as a safeguard. The problem of course is if it frequently restarts these files accumulate. I have seen several tasks reports with disk full errors which makes me think this is happening.

So, bottom line:
1/ Please check how many srf files you have in the slot directory and report back.
2/ You can safely delete the OLDER srf files to recover some space. By old, I mean the oldest dated files as shown by 'ls -l'. But, you MUST leave the most recent one.

Let me know. Hope that makes sense.

Cheers, Glenn

p.s. I did have a smile at your 'test your system' comment. This is the model at it's smallest configuration with minimal output. You haven't seen anything.. :)

These Oifs _ps tasks really test your system out.

Running 9 at once, each using from 2.7 to 4.2 GB of RAM, after 2 hours run time they have written 11.3 GB of data to disk each (101.7 GB), which is huge.
Hitting 50 GB of RAM in use out of 64 GB, but I am also running LODA tasks which each use 1 GB of RAM. All 24 threads are running.
12% in and running fine so far.

Conan
ID: 66987 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,352,910
RAC: 10,281
Message 66988 - Posted: 21 Dec 2022, 10:47:33 UTC

Glenn, can we please be careful to distinguish between 'trickles', which are tiny administrative fragments, and 'uploads', which are substantial scientific data? They each have their own quirks and foibles.

I suspect the query about file write aggregate sizes may be a confusion between 'process' and 'resultant' sizes. If the app writes a group of individual data files to disk, and then compresses them into a single zip, then the process as a whole writes far more to disk than the final size of the uploaded zip.

I'm running the first four from this batch, and finalising the other work which has been keeping them warm between batches. I should be able to do the first full analysis of a file completion later this afternoon.
ID: 66988 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 815
Credit: 13,663,992
RAC: 8,399
Message 66989 - Posted: 21 Dec 2022, 11:01:24 UTC - in response to Message 66988.  

Richard, perhaps there's a misunderstanding. CPDN use 'trickles' to upload model results. Unpack the trickle zip files and it will contain the model output files from the previous steps since the last trickle upload. The final upload file just contains the last remaining model outputs. I don't know how other projects use trickles but they are not administrative fragments, they contain actual model (scientific) data.

Glenn, can we please be careful to distinguish between 'trickles', which are tiny administrative fragments, and 'uploads', which are substantial scientific data? They each have their own quirks and foibles.

I suspect the query about file write aggregate sizes may be a confusion between 'process' and 'resultant' sizes. If the app writes a group of individual data files to disk, and then compresses them into a single zip, then the process as a whole writes far more to disk than the final size of the uploaded zip.

I'm running the first four from this batch, and finalising the other work which has been keeping them warm between batches. I should be able to do the first full analysis of a file completion later this afternoon.
ID: 66989 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 6 Jul 06
Posts: 141
Credit: 3,511,752
RAC: 144,072
Message 66990 - Posted: 21 Dec 2022, 11:17:05 UTC

G'Day Glenn,

You may of miss read what I wrote I think.

The 11.3 GB was not a file size but the amount of disk writes made in that first 2 hours (now after 5 hours well over 30 Gb).
The 2.7 to 4.6 GB were RAM amounts that each work unit was using.

This was all taken from System Monitor.

I did what you have asked and

% cd slots/26
% du -hs . # note the '.'
1.2G .

This is the same as your example.

% cd projects/climateprediction.net
% du -hs .
1.2G .

This is similar to your example.

du -hs srf*

768 MB srf00370000.0001

So all running fine, so maybe just a bit of a misunderstanding I think with data amounts and RAM usage.

Thanks
Conan
ID: 66990 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 815
Credit: 13,663,992
RAC: 8,399
Message 66992 - Posted: 21 Dec 2022, 13:12:57 UTC - in response to Message 66990.  

Ah, excellent. That's clearer.

Ok, so what you're seeing there is the I/O from the restart/checkpoint files (I call them restarts, boinc calls them checkpoints). Each set of restart files is just under 1Gb. The model writes out these files once per model day, every 24 model steps. As it deletes the previous restart files, you only ever see the most recent ones. These PS tasks are quite long & will run for 2952 hrs (steps) so you'll get 123 restart files, or 123Gb in write I/O from the restart files alone.The model output that goes back to CPDN is very much smaller by comparison. If that I/O proves too much, let me know. The model has to write these files in full precision to allow an exact restart (64bit floating point, not to be confused with 64bit/32bit operating systems).

This is one of the things we have to balance when applying any meteorological model to boinc, reduce the I/O which will cause the model to repeat more steps and take longer to finish, or minimise task run time at the expense of more I/O.

I recall an earlier thread in which someone (I forget who) was asking why volunteers can't adjust the checkpointing frequency. Well, you have the explanation. If you turned up the checkpointing frequency so it happened every model step, it would thrash the system and slow down the task as the model blocks until the I/O is complete.

G'Day Glenn,

You may of miss read what I wrote I think.

The 11.3 GB was not a file size but the amount of disk writes made in that first 2 hours (now after 5 hours well over 30 Gb).
The 2.7 to 4.6 GB were RAM amounts that each work unit was using.

This was all taken from System Monitor.

I did what you have asked and

% cd slots/26
% du -hs . # note the '.'
1.2G .

This is the same as your example.

% cd projects/climateprediction.net
% du -hs .
1.2G .

This is similar to your example.

du -hs srf*

768 MB srf00370000.0001

So all running fine, so maybe just a bit of a misunderstanding I think with data amounts and RAM usage.

Thanks
Conan
ID: 66992 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,519,993
RAC: 1,703
Message 66993 - Posted: 21 Dec 2022, 13:20:27 UTC - in response to Message 66989.  

Richard, perhaps there's a misunderstanding. CPDN use 'trickles' to upload model results. Unpack the trickle zip files and it will contain the model output files from the previous steps since the last trickle upload. The final upload file just contains the last remaining model outputs. I don't know how other projects use trickles but they are not administrative fragments, they contain actual model (scientific) data.

The trickle files (named as trickle files) and the zip files are being treated differently.
The zip files appear in the Transfer tab in BOINC Manager whereas the trickle files do not.
The zip files are large and contain model data, the trickle files are tiny and look like this...

<variety>orig</variety>
<wu>oifs_43r3_ps_1351_2021050100_123_946_12164440</wu>
<result>oifs_43r3_ps_1351_2021050100_123_946_12164440_2_r1535625001</result>
<ph></ph>
<ts>10368000</ts>
<cp>60877</cp>
<vr></vr>
ID: 66993 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 815
Credit: 13,663,992
RAC: 8,399
Message 66994 - Posted: 21 Dec 2022, 15:19:47 UTC - in response to Message 66993.  

The trickle & zip upload files are initiated at the same time by a single boinc function call in the code. They are not separate steps. The trickle file contains the filename of the zip to be uploaded. It's terminology largely, but I think of the 'trickle' as this initiation of transfers.

Richard, perhaps there's a misunderstanding. CPDN use 'trickles' to upload model results. Unpack the trickle zip files and it will contain the model output files from the previous steps since the last trickle upload. The final upload file just contains the last remaining model outputs. I don't know how other projects use trickles but they are not administrative fragments, they contain actual model (scientific) data.

The trickle files (named as trickle files) and the zip files are being treated differently.
The zip files appear in the Transfer tab in BOINC Manager whereas the trickle files do not.
The zip files are large and contain model data, the trickle files are tiny and look like this...

<variety>orig</variety>
<wu>oifs_43r3_ps_1351_2021050100_123_946_12164440</wu>
<result>oifs_43r3_ps_1351_2021050100_123_946_12164440_2_r1535625001</result>
<ph></ph>
<ts>10368000</ts>
<cp>60877</cp>
<vr></vr>
ID: 66994 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 815
Credit: 13,663,992
RAC: 8,399
Message 66995 - Posted: 21 Dec 2022, 17:05:25 UTC

Adjusting write I/O from OpenIFS tasks

Further to Conan's point about the amount of write I/O. It can be adjusted but only once a task has started running. The adjustment made will reduce the checkpoint frequency, meaning if the model does have to restart from a shutdown, it will have to repeat more steps. This change does NOT affect the model's scientific output as that's controlled differently.

ONLY make this change if you leave the model running near-continuously with minimal possibility of a restart. Do NOT do it if you often shutdown the PC or boinc client, otherwise it will hamper the progress of the task. If in doubt, just leave it.

To make the change:
1/ shutdown the boinc client & make sure all processes with 'oifs' in their name have gone.
2/ change to the slot directory.
3/ make a backup copy of the fort.4 file (just in case): cp fort.4 fort.4.old
4/ edit the text file fort.4, locate the line:
NFRRES=-24,
and change it to:
NFRRES=-72,
Preserve the minus sign and the comma. This will reduce the checkpoint frequency from 1 day (24 model hrs) to 3 days (72 model hrs). But, it will mean the model might have to repeat as many as 3 model days if it has to restart.
5/ restart the boinc client.

The changes can only be made once the model has started in a slot directory, not before.
ID: 66995 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 66996 - Posted: 21 Dec 2022, 19:15:12 UTC - in response to Message 66982.  

Half a day ago xii5ku wrote:
As noted before, I still get occasional HTTP transfer timeouts. [And this caused a quick buildup of an upload backlog, even though I am running much fewer tasks at once than my Internet connection could sustain if transfers always succeeded.] On the computer with 20 concurrently running tasks, I now increased cc_config::options::max_file_xfers_per_project from 2 to 4 in hope that the probability of such idle time is much reduced, and that my upload backlog clears eventually. [Edit: Upped it to 6 now.]
As desired, the upload backlog is now cleared. Just a certain amount of files to be retried after previous timeouts is still left.
ID: 66996 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 187
Credit: 27,124,308
RAC: 1,166
Message 66997 - Posted: 21 Dec 2022, 21:07:41 UTC - in response to Message 66966.  
Last modified: 21 Dec 2022, 21:12:41 UTC

It's possible AMD chips are triggering memory bugs in the code depending on what else happens to be in memory at the same time (hence the seemingly random nature of the fail). Hard to say exactly at the moment but it could also been something system/hardware related specific to Ryzens. I have never seen the model fail like this before on the processors I've worked with in the past (none of which were AMD unfortunately). I am tempted to turn down the optimization and see what happens....

I did a little bit of searching and found 3 tasks that failed with errors you described on intel processors. I think it might be too early to attribute these errors as ryzen specific.
https://www.cpdn.org/result.php?resultid=22245369
Exit status 5 (0x00000005) Unknown error code
double free or corruption (out)

Ah thanks. That's useful, and the first time I've seen a error code 5 (I/O error) but consistent with something file related. Frustrating there is no traceback from the model - which is why I didn't think it was the model in the first place.

I can spend time looking at this but it's the law of diminishing returns. The OpenIFS batch error rate is now 10%, HadSM4's is about 5%. It would be nice to get it lower but I also need to move the higher resolution work on which I've not been able to start yet. This is quite important to attract scientists to the platform.

Hello Glenn,

https://www.cpdn.org/result.php?resultid=22252269

This task crashed earlier today with double free or corruption (out). It’s an IFS task running in VirtualBox, ubuntu 20.04, on an Intel i7-8700, WIN10 host, running ubuntu VM https://www.cpdn.org/show_host_detail.php?hostid=1512045. The VM has 32GB RAM assigned (40GB physical) and about 100GB disc (2TB physical)


The only touches today around 6:00am:
a) In the Win10 host, I updated our daily energy usage in excel and saved the two files.
b) In the ubuntu VM, I looked to see what had happened overnight in the BOINC event log.

No changes to the ubuntu host or BOINC manager. No stops or restarts, no config changes.

The other five IFS tasks have about two hours to go.


Stderr



06:00:54 STEP 1054 H=1054:00 +CPU= 22.385
06:01:16 STEP 1055 H=1055:00 +CPU= 21.665
06:01:57 STEP 1056 H=1056:00 +CPU= 39.203
Moving to projects directory: /var/lib/boinc-client/slots/0/ICMGGhq0f+001056
Moving to projects directory: /var/lib/boinc-client/slots/0/ICMSHhq0f+001056
Moving to projects directory: /var/lib/boinc-client/slots/0/ICMUAhq0f+001056
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12168039/ICMGGhq0f+001032
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12168039/ICMSHhq0f+001032
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12168039/ICMUAhq0f+001032
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12168039/ICMGGhq0f+001044
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12168039/ICMSHhq0f+001044
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_12168039/ICMUAhq0f+001044
Zipping up the intermediate file: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_0395_1982050100_123_951_12168039_0_r1962384054_43.zip
Uploading the intermediate file: upload_file_43.zip
06:02:19 STEP 1057 H=1057:00 +CPU= 20.970
double free or corruption (out)


If there is anything particular you would like me to look at and report, please let me know.


ID: 66997 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 815
Credit: 13,663,992
RAC: 8,399
Message 66998 - Posted: 21 Dec 2022, 22:22:08 UTC - in response to Message 66997.  
Last modified: 21 Dec 2022, 22:35:28 UTC

06:02:19 STEP 1057 H=1057:00 +CPU= 20.970
double free or corruption (out)[/i]
If there is anything particular you would like me to look at and report, please let me know.
Yes, I've been watching the returned tasks for the errored tasks (seeing error codes 1, 5 & 9 mostly).

If you could kindly check your /var/log/syslog file for an entry around the time the task finished. There should be mention of 'oifs_43r3_' something being killed. Let me know what there is.

If you don't have a syslog file you might have a /var/log/messages file. If you don't have either, it means the syslog service hasn't started (often an issue on WSL), run:
sudo service rsyslog start
which will create the /var/log/syslog file.

Out of interest, how many tasks did you have running compared to how many cores? I have got a 11th & 3rd gen Intel i7 and the model has never crashed like this for me. The only suggestion I can make is not to put too many tasks on the machine. Random memory issues like this can depend on how busy memory is. I have one less task than I have cores running (note cores not threads) i.e. 3 tasks max for a 4 core machine. So far, touch wood, it's never crashed and I'm nowhere near my total ram. I was going to do a test by letting more tasks run to see what happens once I've done a few successfully. It's quite tough to debug without being able to reproduce.

thx.
ID: 66998 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 6 Jul 06
Posts: 141
Credit: 3,511,752
RAC: 144,072
Message 66999 - Posted: 22 Dec 2022, 0:05:21 UTC
Last modified: 22 Dec 2022, 0:07:03 UTC

All 9 work units that I had running overnight have completed successfully.

Running on an AMD Ryzen 9 5900x, 64GB RAM, all 24 threads used to run BOINC programmes at the same time as the ClimatePrediction models.
All took around 17 hours 10 minutes run time.

Conan
ID: 66999 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 815
Credit: 13,663,992
RAC: 8,399
Message 67000 - Posted: 22 Dec 2022, 0:54:29 UTC - in response to Message 66999.  
Last modified: 22 Dec 2022, 0:57:14 UTC

All 9 work units that I had running overnight have completed successfully.

Running on an AMD Ryzen 9 5900x, 64GB RAM, all 24 threads used to run BOINC programmes at the same time as the ClimatePrediction models.
All took around 17 hours 10 minutes run time.

Conan
The memory requirement for the OpenIFS PS app is 5Gb => 9x5 = 45Gb RAM max, with other boinc apps as well, maybe not a lot of memory headroom? With a 5900x I would expect runtimes nearer 12hrs, so maybe there is memory contention (though that depends what the %age CPU usage in boincmgr is set to).

Personally , I wouldn't advise running like this. There's a memory issue with the OpenIFS PS app. The higher the memory pressure, the more likely you are to have a fail at some point I suspect. On the other hand, if you never get a fail in all the tasks you run, that will be interesting (and a puzzle!). Let me know in the New Year!
ID: 67000 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 67001 - Posted: 22 Dec 2022, 1:26:30 UTC - in response to Message 67000.  

My machine is running four oifs_43r3_ps_1.05 tasks at a time. If some other tasks oifs_43r3_BaroclinicLifecycle_v1.07 showed up, I could run them at the same time. That happened only once.

The three with around 14 hours on them will complete within about an hour. The last one in slot 2, started later, so it will need about an additional hour. I have had no failures with any of the Oifs model tasks (so far, anyway).

My machine is Computer 1511241. It is running eight more Boinc processes, but little else. I am nowhere near running out of RAM. It is true that there are only 4.337 GBytes of RAM listed as free, but the number that really matters is the 43.454 GBytes listed in avail Mem.

top - 20:00:30 up 5 days, 11:38,  1 user,  load average: 12.37, 12.35, 12.47
Tasks: 469 total,  14 running, 455 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  6.4 sy, 68.5 ni, 24.3 id,  0.1 wa,  0.2 hi,  0.1 si,  0.0 st
MiB Mem :  63772.8 total,   4337.9 free,  19480.7 used,  39954.2 buff/cache
MiB Swap:  15992.0 total,  15756.7 free,    235.2 used.  43454.1 avail Mem 

    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
 477613  477610 boinc     39  19 R   3.9g   6.2  98.9  1 746:00.42 /var/lib/boinc/slots/2/oifs_43r3_model.exe                                
 472285  472281 boinc     39  19 R   3.8g   6.2  98.9  5 853:39.76 /var/lib/boinc/slots/10/oifs_43r3_model.exe                               
 472215  472212 boinc     39  19 R   3.7g   5.9  98.9 10 855:28.23 /var/lib/boinc/slots/9/oifs_43r3_model.exe                                
 472332  472329 boinc     39  19 R   3.3g   5.3  98.9  9 852:20.58 /var/lib/boinc/slots/11/oifs_43r3_model.exe   
 

ID: 67001 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67002 - Posted: 22 Dec 2022, 6:15:42 UTC
Last modified: 22 Dec 2022, 6:58:05 UTC

Typical RAM usage is quite a bit less than peak RAM usage. The peak happens briefly and, I presume, periodically during every nth(?) timestep. Concurrently running tasks are not very likely to reach such a peak simultaneously.

From this follows:
– On hosts with not a lot of RAM, the number of concurrently running tasks should be sized with the peak RAM demand in mind.
– On hosts with a lot of RAM, the number of concurrently running tasks can be sized for a figure somewhere between average and peak RAM demand per task.

The boinc client watches overall RAM usage and puts work into waiting state if the configured RAM limit is exceeded, but from what I understand, this built-in control mechanism has difficulties to cope with fast fluctuations of RAM usage, like OIFS's.

(Edit: And let's not forget about disk usage per task. Sometimes, computers or VMs which are dedicated to distributed computing have comparably small mass storage attached. Though I suppose the workunit parameter of rsc_disk_bound = 7 GiB is already set suitably. From what I understand, this covers a worst case of a longer period of disconnected network, i.e. accumulated result data. I am not sure though in which way the boinc client takes rsc_disk_bound into account when starting tasks.)

(Edit 2: Exactly 7.0 GiB is not enough though when all of the 122 intermediate result files – already compressed – aren't uploaded yet and the final result data is written and needs to be compressed into the 123rd file, plus the input data etc. are still around. Though that's a quite theoretic case.)
ID: 67002 · Report as offensive     Reply Quote
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 31 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 climateprediction.net