climateprediction.net home page
OpenIFS Discussion

OpenIFS Discussion

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 31 · Next

AuthorMessage
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 11,831,258
RAC: 20,177
Message 67003 - Posted: 22 Dec 2022, 10:11:14 UTC - in response to Message 66995.  
Last modified: 22 Dec 2022, 10:11:37 UTC

Adjusting write I/O from OpenIFS tasks

Glenn,
Do you think doing this would be beneficial in any way for older PCs that still have an HDD and not an SSD, provided of course that BOINC restarts are rare? I have an older PC: i7-4790 16GB RAM 1TB HDD Win10 WSL2 Ubuntu, running 2 OIFSs at a time. Your post got me wondering if reducing write I/O a little or even a lot could provide significant speed up and maybe be a little easier on the disk on an older PC. Restarts are not a concern and BOINC is set to not task switch, i.e. run a task from start to finish before starting another one.
ID: 67003 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 147
Credit: 12,814,088
RAC: 261,385
Message 67004 - Posted: 22 Dec 2022, 11:18:26 UTC - in response to Message 66995.  

Adjusting write I/O from OpenIFS tasks

Further to Conan's point about the amount of write I/O. It can be adjusted but only once a task has started running. The adjustment made will reduce the checkpoint frequency, meaning if the model does have to restart from a shutdown, it will have to repeat more steps. This change does NOT affect the model's scientific output as that's controlled differently.

ONLY make this change if you leave the model running near-continuously with minimal possibility of a restart. Do NOT do it if you often shutdown the PC or boinc client, otherwise it will hamper the progress of the task. If in doubt, just leave it.

To make the change:
1/ shutdown the boinc client & make sure all processes with 'oifs' in their name have gone.
2/ change to the slot directory.
3/ make a backup copy of the fort.4 file (just in case): cp fort.4 fort.4.old
4/ edit the text file fort.4, locate the line:
NFRRES=-24,
and change it to:
NFRRES=-72,
Preserve the minus sign and the comma. This will reduce the checkpoint frequency from 1 day (24 model hrs) to 3 days (72 model hrs). But, it will mean the model might have to repeat as many as 3 model days if it has to restart.
5/ restart the boinc client.

The changes can only be made once the model has started in a slot directory, not before.


Is there any way that this could be made a user selectable option to set the default value before it is downloaded? I would want this on every WU I process and I can imagine so would all the other volunteers who process 24/7.
ID: 67004 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 803
Credit: 13,568,734
RAC: 7,025
Message 67005 - Posted: 22 Dec 2022, 11:32:25 UTC - in response to Message 67002.  

Typical RAM usage is quite a bit less than peak RAM usage. The peak happens briefly and, I presume, periodically during every nth(?) timestep. Concurrently running tasks are not very likely to reach such a peak simultaneously.
I've got another post somewhere which explains the model's use of memory, but I can't find it now. There are a couple of points in the model timestep which are cause the peak RAM; one is the radiation step when it has to recompute some large matrixes every nth step. The second is in the dynamics part of the code where it uses a large 'stencil' of gridpoints to compute the wind trajectories at every step.

But the memory_bound task value is the total of the memory requirement of the controlling wrapper process oifs_43r3_ps_1.05_x86_64-pc-linux-gnu and the model. Don't forget that.

From this follows:
– On hosts with not a lot of RAM, the number of concurrently running tasks should be sized with the peak RAM demand in mind.
– On hosts with a lot of RAM, the number of concurrently running tasks can be sized for a figure somewhere between average and peak RAM demand per task.
The boinc client watches overall RAM usage and puts work into waiting state if the configured RAM limit is exceeded, but from what I understand, this built-in control mechanism has difficulties to cope with fast fluctuations of RAM usage, like OIFS's.
I disagree with your second bullet there for exactly the reason you give below it. I tested this myself and indeed the client was too slow to repond to pausing the model when all 4 (test) tasks, started simultaneously, hit their high water RAM, and several tasks crashed due to lack of memory. This was on my WSL test container. So, my advice would be to honour the peak RAM set by the task, if you try to play games you may get unstuck.

(Edit: ........I am not sure though in which way the boinc client takes rsc_disk_bound into account when starting tasks.)
The server takes it into account when deciding whether to give the task to a machine or not. I'm also not sure if the client is actively monitoring disk usage. I've never seen any tasks fail with an error message that indicates that.

(Edit 2: Exactly 7.0 GiB is not enough though when all of the 122 intermediate result files – already compressed – aren't uploaded yet and the final result data is written and needs to be compressed into the 123rd file, plus the input data etc. are still around. Though that's a quite theoretic case.)
It should be, it's calculated on the server side with additional headroom added just for good measure.
ID: 67005 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 803
Credit: 13,568,734
RAC: 7,025
Message 67006 - Posted: 22 Dec 2022, 11:38:51 UTC - in response to Message 67004.  

Is there any way that this could be made a user selectable option to set the default value before it is downloaded? I would want this on every WU I process and I can imagine so would all the other volunteers who process 24/7.
We've had this discussion about adjusting the checkpointing already in this (or another) thread - if I wasn't supposed to be wrapping Christmas presents I'd find it.

This is never going to be a user selectable option because it requires an understanding of how the model works and if you get it wrong, it could both seriously thrash your filesystem and delay your task. The model is capable of generating very large amounts of output, which have to be tuned carefully to run on home PCs. We might tweak it after these batches if it proves to be causing problems, which is why feedback is always welcome.
ID: 67006 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 803
Credit: 13,568,734
RAC: 7,025
Message 67007 - Posted: 22 Dec 2022, 11:55:50 UTC - in response to Message 67003.  
Last modified: 22 Dec 2022, 11:57:39 UTC

Adjusting write I/O from OpenIFS tasks
Do you think doing this would be beneficial in any way for older PCs that still have an HDD and not an SSD, provided of course that BOINC restarts are rare? I have an older PC: i7-4790 16GB RAM 1TB HDD Win10 WSL2 Ubuntu, running 2 OIFSs at a time. Your post got me wondering if reducing write I/O a little or even a lot could provide significant speed up and maybe be a little easier on the disk on an older PC. Restarts are not a concern and BOINC is set to not task switch, i.e. run a task from start to finish before starting another one.

It's a good question. I've got an old i7 3rd gen with my /var/lib/boinc a link to an internal HDD. I did test leaving it on a SATA SSD (which is the system disk) and only saw a 5% difference in model runtime. As the model runs slower on the older chip, the extra bit of time spend on I/O barely made any difference. And because of the amount of model output I was happier keeping the model off the SATA drive. Of course, if the model has to do a restart, it will take longer to rerun the extra steps on the slower chips, so I opted to keep the restart frequency as it was. We assumed that most volunteers are running on older hardware anyway, so aimed to minimize the cost of a restart. For my faster machines, I leave /var/lib/boinc on the M.2 drives. Maybe test it, just be aware of the caveats of making changes (which I hope I've made clear).

When I look through the logs, there's definitely a good percentage of tasks that run restarted (I restart mine on my power-hungry PC to avoid paying for electric and keep on day-time solar). It's very hard to come up with something that works best for everyone, we settled on something that seemed reasonable but there's always scope for some tweaking. But I won't do it based on someone's personal experience. It has to be from looking across all the tasks on the range of volunteer machines in CPDN. I also won't provide options for people to adjust model parameters that I know to be potentially risky, but if someone has a serious problem then I'm happy to help.
ID: 67007 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1056
Credit: 16,520,115
RAC: 1,176
Message 67008 - Posted: 22 Dec 2022, 12:59:03 UTC - in response to Message 67003.  

Do you think doing this would be beneficial in any way for older PCs that still have an HDD and not an SSD, provided of course that BOINC restarts are rare?


My machine has a 512 GByte SSD drive, but I do not keep my Boinc stuff on it. I have two 4 TByte 7200rpm SATA hard drives with a 512 TByte partition for Boinc on ione of them. (I did not want to run out of space for Boinc.) It does not seem to slow down my processing much.
According to the system monitor, most of the time, my network connection is not doing anything, although it does very high peaks (6 GByte) once in a while.

Memory 	62.28 GB
Cache 	16896 KB
Swap space 	15.62 GB
Total disk space 	488.04 GB
Free Disk Space 	477.29 GB
Measured floating point speed 	6.05 billion ops/sec
Measured integer speed 	24.32 billion ops/sec
Average upload rate 	3082.89 KB/sec
Average download rate 	3956.48 KB/sec
Average turnaround time 	1.22 days

OpenIFS 43r3 Perturbed Surface 1.05 x86_64-pc-linux-gnu
Number of tasks completed 	21
Max tasks per day 	25
Number of tasks today 	4
Consecutive valid tasks 	21
Average processing rate 	30.61 GFLOPS
Average turnaround time 	0.98 days

ID: 67008 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 803
Credit: 13,568,734
RAC: 7,025
Message 67009 - Posted: 22 Dec 2022, 14:19:35 UTC

At last I've found a task that gives more output in the task report about these crashes we're seeing.
https://www.cpdn.org/cpdnboinc/result.php?resultid=22250857
Note it fails with disk limit exceeded, I'm guessing that's the client stopping the process ultimately because the model is constantly restarting and creating too many restart/checkpoint files. So that answers the previous question about what the client does with regard to disk_bound (maybe).

Now I have a more detailed stack trace and can see better what's happening. Looks like the control code is failing with the 'double free' error we keep seeing and not the model. Because the model is a child process, when the control wrapper process oifs_43r3_ps_1.01 dies it takes the model with it and there's a lovely stack trace from the model in the above result (it had been puzzling me why there wasn't one in previous tasks I'd looked at). The model restarts, runs on for a bit, then we get another double free error and the model gets killed again (in a different place) etc.

I have been able to get a double free error by corrupting the progress_file.xml in the slot directory, so this will be the focus of debugging next year.

Because it's the control wrapper dying, we probably lose the stderr output from the model, but in the above case, because of the restarts, we get a couple of fails before the final disk exceeded.

It's possible this person did not have 'Keep non-GPU in memory' enabled, that might explain the restarts (but not the double free errors). Anyway, I'm fairly confident now the model itself is fine and we (CPDN) need to take a good look at the controlling wrapper, in particular the code reading/writing the XML file.

If anyone does get a result with similar output to the above link, do please let me know. It's a pain trying to debug without much to go on.

Many thanks and best wishes to all for the holiday season and 2023.
ID: 67009 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 147
Credit: 12,814,088
RAC: 261,385
Message 67010 - Posted: 22 Dec 2022, 18:08:53 UTC - in response to Message 67006.  

Is there any way that this could be made a user selectable option to set the default value before it is downloaded? I would want this on every WU I process and I can imagine so would all the other volunteers who process 24/7.
We've had this discussion about adjusting the checkpointing already in this (or another) thread - if I wasn't supposed to be wrapping Christmas presents I'd find it.

This is never going to be a user selectable option because it requires an understanding of how the model works and if you get it wrong, it could both seriously thrash your filesystem and delay your task. The model is capable of generating very large amounts of output, which have to be tuned carefully to run on home PCs. We might tweak it after these batches if it proves to be causing problems, which is why feedback is always welcome.


So don’t make it infinitely variable, just give the users the choice between 2 or 3 “safe” values?
ID: 67010 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67011 - Posted: 22 Dec 2022, 19:12:53 UTC
Last modified: 22 Dec 2022, 19:23:48 UTC

On peak vs. average RAM use: In post 67002, I was thinking of hosts with, as a ballpark, 128 MB or more when I wrote "On hosts with a lot of RAM…".

On the discussion about disk writes: Unless a really huge number of OIFS tasks is run simultaneously on the same filesystem, reducing the checkpointing frequency is unlikely to make a real performance impact. Disk writes can be cached very easily by the operating system, and typical Linux installations are set up to do so.

Edit: Here is a simple check: On a completed task, compare Run time and CPU time. If Run time is only a little more than CPU time, then it is evident that only a small part of the Run time could have been spent waiting on disk I/O. (Besides I/O waits, there are several more reasons why Run time of a single threaded process exceeds CPU time. E.g. preemption when another process is scheduled on the same logical CPU.)

BTW @Glenn Carver, thanks a lot for your frequent and detailed explanations.
ID: 67011 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 185
Credit: 27,110,205
RAC: 2,542
Message 67012 - Posted: 22 Dec 2022, 19:36:45 UTC - in response to Message 66998.  
Last modified: 22 Dec 2022, 19:41:50 UTC

If you could kindly check your /var/log/syslog file for an entry around the time the task finished. There should be mention of 'oifs_43r3_' something being killed. Let me know what there is.

Out of interest, how many tasks did you have running compared to how many cores? I have got a 11th & 3rd gen Intel i7 and the model has never crashed like this for me. The only suggestion I can make is not to put too many tasks on the machine. Random memory issues like this can depend on how busy memory is. I have one less task than I have cores running (note cores not threads) i.e. 3 tasks max for a 4 core machine. So far, touch wood, it's never crashed and I'm nowhere near my total ram. I was going to do a test by letting more tasks run to see what happens once I've done a few successfully. It's quite tough to debug without being able to reproduce.

thx.
Glenn, as requested, I've pasted the syslog file below, from the time that the WU 12168039 task_o (https://www.cpdn.org/result.php?resultid=22252269) appears to behave (at 05:51am with zip 42 upload) to when it failed (immediately after zip 43) and decided that it had finished at 06:02am. Nothing being killed, but other things do start to go wrong around 05:55am.

The i7-8700 host has six physical cores, 12 virtual cores, The VirtualBox ubuntu VM is configured with 6 cpu and running 6 cpdn tasks. Boinc preferences are configured to use 100% of the cpus and 'Suspend' when non-BOINC cpu is above 75%. The VM is only used for boinc/cpdn work. Should I drop the VM back to 5 CPU?

Interestingly, today we had a power brownout at 1pm and the PC unceremoniously crashed. After restarting, all six IFS tasks restarted successfully :)) That's a first - I've rarely had 100% restart success for non-IFS tasks after a crash.

syslog
Dec 21 05:51:27 ih2-VirtualBox boinc[834]: 21-Dec-2022 05:51:27 [climateprediction.net] Started upload of oifs_43r3_ps_0395_1982050100_123_951_12168039_0_r1962384054_42.zip
Dec 21 05:51:27 ih2-VirtualBox boinc[834]: 21-Dec-2022 05:51:27 [climateprediction.net] Started upload of oifs_43r3_ps_0401_1982050100_123_951_12168045_0_r692153421_41.zip
Dec 21 05:51:47 ih2-VirtualBox boinc[834]: 21-Dec-2022 05:51:47 [climateprediction.net] Finished upload of oifs_43r3_ps_0395_1982050100_123_951_12168039_0_r1962384054_42.zip
Dec 21 05:51:49 ih2-VirtualBox boinc[834]: 21-Dec-2022 05:51:49 [climateprediction.net] Finished upload of oifs_43r3_ps_0401_1982050100_123_951_12168045_0_r692153421_41.zip
Dec 21 05:52:52 ih2-VirtualBox boinc[834]: 21-Dec-2022 05:52:52 [climateprediction.net] Started upload of oifs_43r3_ps_0425_1982050100_123_951_12168069_0_r777869862_42.zip
Dec 21 05:53:04 ih2-VirtualBox boinc[834]: 21-Dec-2022 05:53:04 [climateprediction.net] Finished upload of oifs_43r3_ps_0425_1982050100_123_951_12168069_0_r777869862_42.zip
Dec 21 05:54:52 ih2-VirtualBox boinc[834]: 21-Dec-2022 05:54:52 [climateprediction.net] Started upload of oifs_43r3_ps_0422_1982050100_123_951_12168066_0_r1003053545_42.zip
Dec 21 05:55:03 ih2-VirtualBox boinc[834]: 21-Dec-2022 05:55:03 [climateprediction.net] Finished upload of oifs_43r3_ps_0422_1982050100_123_951_12168066_0_r1003053545_42.zip
Dec 21 05:55:08 ih2-VirtualBox dbus-daemon[992]: [session uid=1000 pid=992] Activating via systemd: service name='org.freedesktop.Tracker1' unit='tracker-store.service' requested by ':1.3' (uid=1000 pid=970 comm="/usr/libexec/tracker-miner-fs " label="unconfined")
Dec 21 05:55:08 ih2-VirtualBox systemd[917]: Starting Tracker metadata database store and lookup manager...
Dec 21 05:55:08 ih2-VirtualBox dbus-daemon[992]: [session uid=1000 pid=992] Successfully activated service 'org.freedesktop.Tracker1'
Dec 21 05:55:08 ih2-VirtualBox systemd[917]: Started Tracker metadata database store and lookup manager.
Dec 21 05:55:39 ih2-VirtualBox tracker-store[11393]: OK
Dec 21 05:55:39 ih2-VirtualBox systemd[917]: tracker-store.service: Succeeded.
Dec 21 05:55:54 ih2-VirtualBox dbus-daemon[660]: [system] Activating via systemd: service name='net.reactivated.Fprint' unit='fprintd.service' requested by ':1.44' (uid=1000 pid=1289 comm="/usr/bin/gnome-shell " label="unconfined")
Dec 21 05:55:54 ih2-VirtualBox systemd[1]: Starting Fingerprint Authentication Daemon...
Dec 21 05:55:54 ih2-VirtualBox dbus-daemon[660]: [system] Successfully activated service 'net.reactivated.Fprint'
Dec 21 05:55:54 ih2-VirtualBox systemd[1]: Started Fingerprint Authentication Daemon.
Dec 21 05:55:58 ih2-VirtualBox gnome-shell[1289]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
Dec 21 05:55:58 ih2-VirtualBox gnome-shell[1289]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
Dec 21 05:55:58 ih2-VirtualBox NetworkManager[663]: <info>  [1671602158.6148] agent-manager: agent[a34c130ad9aefeac,:1.44/org.gnome.Shell.NetworkAgent/1000]: agent registered
Dec 21 05:55:58 ih2-VirtualBox dbus-daemon[992]: [session uid=1000 pid=992] Activating service name='org.freedesktop.FileManager1' requested by ':1.37' (uid=1000 pid=1289 comm="/usr/bin/gnome-shell " label="unconfined")
Dec 21 05:55:58 ih2-VirtualBox gnome-shell[1289]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
Dec 21 05:55:58 ih2-VirtualBox gnome-shell[1289]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
Dec 21 05:55:58 ih2-VirtualBox dbus-daemon[992]: [session uid=1000 pid=992] Activating service name='org.gnome.Nautilus' requested by ':1.37' (uid=1000 pid=1289 comm="/usr/bin/gnome-shell " label="unconfined")
Dec 21 05:55:58 ih2-VirtualBox gnome-shell[1289]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
Dec 21 05:55:58 ih2-VirtualBox gnome-shell[1289]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
Dec 21 05:55:58 ih2-VirtualBox dbus-daemon[992]: [session uid=1000 pid=992] Successfully activated service 'org.gnome.Nautilus'
Dec 21 05:55:58 ih2-VirtualBox org.gnome.Nautilus[11423]: Failed to register: Unable to acquire bus name 'org.gnome.Nautilus'
Dec 21 05:55:59 ih2-VirtualBox dbus-daemon[992]: [session uid=1000 pid=992] Successfully activated service 'org.freedesktop.FileManager1'
Dec 21 05:55:59 ih2-VirtualBox gnome-shell[1289]: cr_parser_new_from_buf: assertion 'a_buf && a_len' failed
Dec 21 05:55:59 ih2-VirtualBox gnome-shell[1289]: cr_declaration_parse_list_from_buf: assertion 'parser' failed
Dec 21 05:55:59 ih2-VirtualBox gnome-shell[1289]: Window manager warning: Overwriting existing binding of keysym 31 with keysym 31 (keycode a).
Dec 21 05:55:59 ih2-VirtualBox gnome-shell[1289]: Window manager warning: Overwriting existing binding of keysym 38 with keysym 38 (keycode 11).
Dec 21 05:55:59 ih2-VirtualBox gnome-shell[1289]: Window manager warning: Overwriting existing binding of keysym 39 with keysym 39 (keycode 12).
Dec 21 05:55:59 ih2-VirtualBox gnome-shell[1289]: Window manager warning: Overwriting existing binding of keysym 32 with keysym 32 (keycode b).
Dec 21 05:55:59 ih2-VirtualBox gnome-shell[1289]: Window manager warning: Overwriting existing binding of keysym 33 with keysym 33 (keycode c).
Dec 21 05:55:59 ih2-VirtualBox gnome-shell[1289]: Window manager warning: Overwriting existing binding of keysym 34 with keysym 34 (keycode d).
Dec 21 05:55:59 ih2-VirtualBox gnome-shell[1289]: Window manager warning: Overwriting existing binding of keysym 35 with keysym 35 (keycode e).
Dec 21 05:55:59 ih2-VirtualBox gnome-shell[1289]: Window manager warning: Overwriting existing binding of keysym 36 with keysym 36 (keycode f).
Dec 21 05:55:59 ih2-VirtualBox gnome-shell[1289]: Window manager warning: Overwriting existing binding of keysym 37 with keysym 37 (keycode 10).
Dec 21 05:56:25 ih2-VirtualBox systemd[1]: fprintd.service: Succeeded.
Dec 21 05:57:38 ih2-VirtualBox systemd[1]: Starting Ubuntu Advantage Timer for running repeated jobs...
Dec 21 05:57:39 ih2-VirtualBox systemd[1]: ua-timer.service: Succeeded.
Dec 21 05:57:39 ih2-VirtualBox systemd[1]: Finished Ubuntu Advantage Timer for running repeated jobs.
Dec 21 05:58:52 ih2-VirtualBox boinc[834]: 21-Dec-2022 05:58:52 [climateprediction.net] Started upload of oifs_43r3_ps_0645_1981050100_123_950_12167289_1_r1207618552_42.zip
Dec 21 05:58:56 ih2-VirtualBox boinc[834]: 21-Dec-2022 05:58:56 [climateprediction.net] Started upload of oifs_43r3_ps_0248_1981050100_123_950_12166892_1_r549240307_42.zip
Dec 21 05:59:04 ih2-VirtualBox boinc[834]: 21-Dec-2022 05:59:04 [climateprediction.net] Finished upload of oifs_43r3_ps_0645_1981050100_123_950_12167289_1_r1207618552_42.zip
Dec 21 05:59:13 ih2-VirtualBox boinc[834]: 21-Dec-2022 05:59:13 [climateprediction.net] Finished upload of oifs_43r3_ps_0248_1981050100_123_950_12166892_1_r549240307_42.zip
Dec 21 06:00:08 ih2-VirtualBox dbus-daemon[992]: [session uid=1000 pid=992] Activating via systemd: service name='org.freedesktop.Tracker1' unit='tracker-store.service' requested by ':1.3' (uid=1000 pid=970 comm="/usr/libexec/tracker-miner-fs " label="unconfined")
Dec 21 06:00:08 ih2-VirtualBox systemd[917]: Starting Tracker metadata database store and lookup manager...
Dec 21 06:00:08 ih2-VirtualBox dbus-daemon[992]: [session uid=1000 pid=992] Successfully activated service 'org.freedesktop.Tracker1'
Dec 21 06:00:08 ih2-VirtualBox systemd[917]: Started Tracker metadata database store and lookup manager.
Dec 21 06:00:39 ih2-VirtualBox tracker-store[11464]: OK
Dec 21 06:00:39 ih2-VirtualBox systemd[917]: tracker-store.service: Succeeded.
Dec 21 06:02:22 ih2-VirtualBox boinc[834]: 21-Dec-2022 06:02:22 [climateprediction.net] Started upload of oifs_43r3_ps_0395_1982050100_123_951_12168039_0_r1962384054_43.zip
Dec 21 06:02:31 ih2-VirtualBox boinc[834]: 21-Dec-2022 06:02:31 [climateprediction.net] Started upload of oifs_43r3_ps_0401_1982050100_123_951_12168045_0_r692153421_42.zip
Dec 21 06:02:34 ih2-VirtualBox boinc[834]: 21-Dec-2022 06:02:34 [climateprediction.net] Finished upload of oifs_43r3_ps_0395_1982050100_123_951_12168039_0_r1962384054_43.zip
Dec 21 06:02:37 ih2-VirtualBox boinc[834]: 21-Dec-2022 06:02:37 [climateprediction.net] Computation for task oifs_43r3_ps_0395_1982050100_123_951_12168039_0 finished
Dec 21 06:02:37 ih2-VirtualBox boinc[834]: 21-Dec-2022 06:02:37 [climateprediction.net] Output file oifs_43r3_ps_0395_1982050100_123_951_12168039_0_r1962384054_44.zip for task oifs_43r3_ps_0395_1982050100_123_951_12168039_0 absent
Dec 21 06:02:37 ih2-VirtualBox boinc[834]: 21-Dec-2022 06:02:37 [climateprediction.net] Output file oifs_43r3_ps_0395_1982050100_123_951_12168039_0_r1962384054_45.zip for task oifs_43r3_ps_0395_1982050100_123_951_12168039_0 absent
….. File absent ... 46-120
Dec 21 06:02:37 ih2-VirtualBox boinc[834]: 21-Dec-2022 06:02:37 [climateprediction.net] Output file oifs_43r3_ps_0395_1982050100_123_951_12168039_0_r1962384054_121.zip for task oifs_43r3_ps_0395_1982050100_123_951_12168039_0 absent
Dec 21 06:02:37 ih2-VirtualBox boinc[834]: 21-Dec-2022 06:02:37 [climateprediction.net] Output file oifs_43r3_ps_0395_1982050100_123_951_12168039_0_r1962384054_122.zip for task oifs_43r3_ps_0395_1982050100_123_951_12168039_0 absent
ID: 67012 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1056
Credit: 16,520,115
RAC: 1,176
Message 67013 - Posted: 22 Dec 2022, 22:56:22 UTC - in response to Message 67011.  

Disk writes can be cached very easily by the operating system, and typical Linux installations are set up to do so.


They sure are. First, they are put by the OS into the in-RAM output buffer from which they are later dispatched to the hard drive(s). This dispatching is usually done not in the order in which they were written by the program, but usually using an elevator algorithm or a shortest-seek-time-first algorithm. This minimizes the delays involved in seeking the heads to the proper cylinders of the drive. But once the data get to the drive itself, the drive also buffers things too.

In the past, the OS would notice sequential writes and combine them into one big write to minimize rotational latency time too. These days that is probably not used because the buffering in the drive can do it better than the OS can.

One of my HDDs has a 256 Megabyte cache buffer in it and a spindle speed of 7200 rpm. The other has a 64 Megabyte buffer and a spindle speed of 5400 rpm. It is the second one where the Boinc partition is located.
ID: 67013 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 32
Credit: 9,526,696
RAC: 109,831
Message 67014 - Posted: 23 Dec 2022, 5:56:00 UTC
Last modified: 23 Dec 2022, 5:58:59 UTC

I happened to try a couple of these tasks to see what effect they would have on the rest of my BOINC work-load. No problems there, but...

I usually have checkpoint debug turned on if I'm running certain WCG tasks or if I'm doing perf stat analyses (trying to dodge genuine checkpoints!). Imagine my surprise when I found that my BOINC log was being "spammed" with a checkpoint message once a second (or, more accurately, 9 or 10 times in every 10 or 11 seconds), with gaps of a few seconds whenever it was consolidating files or arranging an upload. Given that the BOINC standard checkpoint mechanism is apparently not being used by the application, this seems a bit strange :-)

[If this has already been discussed I missed it; that said, I don't suppose that many people do checkpoint debug most of the time!...]

Here's the front of the first task starting up...

22-Dec-2022 03:24:54 [climateprediction.net] Starting task oifs_43r3_ps_0497_1993050100_123_962_12179141_0
22-Dec-2022 03:24:54 [climateprediction.net] [cpu_sched] Starting task oifs_43r3_ps_0497_1993050100_123_962_12179141_0 using oif
s_43r3_ps version 105 in slot 6
22-Dec-2022 03:25:00 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed
22-Dec-2022 03:25:01 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed
22-Dec-2022 03:25:03 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed
22-Dec-2022 03:25:04 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed

and here's one of the intervals where I believe it was doing file movement/uploading...

22-Dec-2022 03:36:40 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed
22-Dec-2022 03:36:41 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed
22-Dec-2022 03:37:00 [World Community Grid] [checkpoint] result MCM1_0193439_9713_3 checkpointed
22-Dec-2022 03:37:04 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed
22-Dec-2022 03:37:05 [climateprediction.net] [checkpoint] result oifs_43r3_ps_0497_1993050100_123_962_12179141_0 checkpointed

Now, the writing of these lines isn't a major I/O nuisance, but it is a space-consuming one! So eventually I got fed up and turned off checkpoint debug logging :-) -- fortunately, I'm not running WCG work that I want to monitor at present, and I would quite like to see what happens to throughput with one of these in a machine at the same time as a WCG ARP1 task (though there aren't any at present...) so I'll carry on with my [infinitely small] contribution for now.

If this shouldn't be happening, I hope it can be stopped... If, however, it's a natural part of how the programs are designed, I'd be interested to know why it happens.

Cheers - Al.
ID: 67014 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 484
Credit: 29,599,157
RAC: 2,031
Message 67015 - Posted: 23 Dec 2022, 9:58:56 UTC

Interesting segment from my event log this morning

Fri 23 Dec 2022 08:32:17 GMT | climateprediction.net | Started upload of oifs_43r3_ps_0852_1981050100_123_950_12167496_0_r1643238008_63.zip
Fri 23 Dec 2022 08:33:05 GMT | climateprediction.net | Finished upload of oifs_43r3_ps_0852_1981050100_123_950_12167496_0_r1643238008_63.zip
Fri 23 Dec 2022 08:33:15 GMT | climateprediction.net | [task] task_state=QUIT_PENDING for oifs_43r3_ps_0852_1981050100_123_950_12167496_0 from request_exit()
Fri 23 Dec 2022 08:33:15 GMT | | request_exit(): PID 5839 has 1 descendants
Fri 23 Dec 2022 08:33:15 GMT | | PID 5842
Fri 23 Dec 2022 08:34:15 GMT | climateprediction.net | [task] Process for oifs_43r3_ps_0852_1981050100_123_950_12167496_0 exited, status 256, task state 8
Fri 23 Dec 2022 08:34:15 GMT | climateprediction.net | [task] task_state=UNINITIALIZED for oifs_43r3_ps_0852_1981050100_123_950_12167496_0 from handle_exited_app
Fri 23 Dec 2022 08:34:15 GMT | climateprediction.net | [task] ACTIVE_TASK::start(): forked process: pid 7134
Fri 23 Dec 2022 08:34:15 GMT | climateprediction.net | [task] task_state=EXECUTING for oifs_43r3_ps_0852_1981050100_123_950_12167496_0 from start
Fri 23 Dec 2022 08:35:55 GMT | climateprediction.net | Started upload of oifs_43r3_ps_0872_1981050100_123_950_12167516_0_r1281185810_77.zip
Fri 23 Dec 2022 08:36:06 GMT | climateprediction.net | Finished upload of oifs_43r3_ps_0872_1981050100_123_950_12167516_0_r1281185810_77.zip

Running 3 tasks, but a fourth apparently started.
Any comments?
ID: 67015 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,180,510
RAC: 6,451
Message 67016 - Posted: 23 Dec 2022, 10:26:06 UTC - in response to Message 67015.  

I hadn't seen those task states before. The full list from source code (in boinc_db.py) is

PROCESS_UNINITIALIZED = 0
PROCESS_EXECUTING = 1
PROCESS_SUSPENDED = 9
PROCESS_ABORT_PENDING = 5
PROCESS_QUIT_PENDING = 8
PROCESS_COPY_PENDING = 10
PROCESS_EXITED = 2
PROCESS_WAS_SIGNALED = 3
PROCESS_EXIT_UNKNOWN = 4
PROCESS_ABORTED = 6
PROCESS_COULDNT_START = 7
(don't ask me why they're not in numerical order)

But it looks as if the same task - _ps_0852_ - has exited, and then restarted. But it's jumped from upload file 63 to 77. Have you been having communications problems - did upload 63 fail earlier, and been retried here?
ID: 67016 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 147
Credit: 7,934,429
RAC: 10,120
Message 67017 - Posted: 23 Dec 2022, 17:45:36 UTC

Hi
One of mine just ended but not accepted:
https://www.cpdn.org/result.php?resultid=22272660



<core_client_version>7.17.0</core_client_version>
<![CDATA[
<message>
Process still present 5 min after writing finish file; aborting</message>
<stderr_txt>


Looks silly
ID: 67017 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 147
Credit: 12,814,088
RAC: 261,385
Message 67018 - Posted: 23 Dec 2022, 20:46:26 UTC - in response to Message 67017.  

Hi
One of mine just ended but not accepted:
https://www.cpdn.org/result.php?resultid=22272660



<core_client_version>7.17.0</core_client_version>
<![CDATA[
<message>
Process still present 5 min after writing finish file; aborting</message>
<stderr_txt>


Looks silly


I’m glad I’m not alone - I just came in to report the same error.
ID: 67018 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 484
Credit: 29,599,157
RAC: 2,031
Message 67019 - Posted: 23 Dec 2022, 23:49:55 UTC - in response to Message 67016.  

I hadn't seen those task states before. The full list from source code (in boinc_db.py) is

PROCESS_UNINITIALIZED = 0
PROCESS_EXECUTING = 1
PROCESS_SUSPENDED = 9
PROCESS_ABORT_PENDING = 5
PROCESS_QUIT_PENDING = 8
PROCESS_COPY_PENDING = 10
PROCESS_EXITED = 2
PROCESS_WAS_SIGNALED = 3
PROCESS_EXIT_UNKNOWN = 4
PROCESS_ABORTED = 6
PROCESS_COULDNT_START = 7
(don't ask me why they're not in numerical order)

But it looks as if the same task - _ps_0852_ - has exited, and then restarted. But it's jumped from upload file 63 to 77. Have you been having communications problems - did upload 63 fail earlier, and been retried here?


Unfortunately I don't know. The computer has been "misbehaving" in that it unexpectedly freezes so I have to do a hard restart, and this is more frequent recently, however the tasks seem to restart OK. I had to do a restart earlier this evening so lost the event log and I'm not sure whether the details will be elsewhere. Not sure if I am pushing the RAM to its limit (24Gb) with running 3 tasks at once (75% core usage). I'll go back to 2 cores.
ID: 67019 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,498,761
RAC: 5,627
Message 67020 - Posted: 24 Dec 2022, 8:02:56 UTC

Got this on a failed task
oifs_43r3_ps_0472_1998050100_123_967_12184116_0 exited, status 2304, task state 1
Dec 23 07:44:00 swarm boinc[2906]: 23-Dec-2022 07:44:00 [climateprediction.net] [task] process exited with status 9
Dec 23 07:44:00 swarm boinc[2906]: 23-Dec-2022 07:44:00 [climateprediction.net] [task] task_state=EXITED for oifs_43r3_ps_0472_1998050100_123_967_12184116_0 from handle_exited_app
Dec 23 07:44:00 swarm boinc[2906]: 23-Dec-2022 07:44:00 [climateprediction.net] [task] result state=COMPUTE_ERROR for oifs_43r3_ps_0472_1998050100_123_967_12184116_0 from CS::report_result_error
Dec 23 07:44:00 swarm boinc[2906]: 23-Dec-2022 07:44:00 [climateprediction.net] Computation for task oifs_43r3_ps_0472_1998050100_123_967_12184116_0 finished
There followed the expected messages about the subsequent output files missing. This wasTask 22269337 which finished with
double free or corruption (out)
ID: 67020 · Report as offensive     Reply Quote
[AF] Kalianthys

Send message
Joined: 20 Dec 20
Posts: 11
Credit: 33,546,183
RAC: 25,576
Message 67022 - Posted: 24 Dec 2022, 8:48:26 UTC
Last modified: 24 Dec 2022, 8:48:44 UTC

Hello,

I have a calculation task that is considered "Error" but I have the impression that it ended well.

Do you have an explanation ?

Log here : https://www.cpdn.org/result.php?resultid=22268451

Kali.
ID: 67022 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,498,761
RAC: 5,627
Message 67023 - Posted: 24 Dec 2022, 9:40:48 UTC - in response to Message 67022.  

Task will get full credit. Glen did post an explanation about tasks that finish successfully but appear to fail a few days ago. I will see if I can find it later. From what I recall, I didn't read it carefully enough to fully understand it.

Hello,

I have a calculation task that is considered "Error" but I have the impression that it ended well.

Do you have an explanation ?

Log here : https://www.cpdn.org/result.php?resultid=22268451

Kali.
ID: 67023 · Report as offensive     Reply Quote
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 31 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 climateprediction.net