Message boards :
Number crunching :
New work discussion - 2
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 42 · Next
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,194,632 RAC: 2,780 |
I doubt using all 32 virtual cores will give a good throughput, but I'm Team Blue with little experience of 'the other side' :) I am currently running 5 hadsm4_um_8.02_i686-pc-lin... tasks, one wcgrid_arp1_wrf_7.32_x86_64-pc... and 5 less important ones. With those, I get a so-so processor cache hit ratio. Computer 1511241 CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Memory 62.28 GB Cache 16896 KB # perf stat -aB -e cache-references,cache-misses Performance counter stats for 'system wide': 41,547,407,402 cache-references 23,459,791,917 cache-misses # 56.465 % of all cache refs 62.202908556 seconds time elapsed |
Send message Joined: 15 May 09 Posts: 4531 Credit: 18,679,839 RAC: 15,318 |
They'll be a steady stream of HadSM4 tasks now with OpenIFS tasks appearing soon.Yes, I saw there were two more batches in the queue already. |
Send message Joined: 29 Oct 17 Posts: 1044 Credit: 16,196,312 RAC: 12,647 |
With those, I get a so-so processor cache hit ratio.But that's system wide? So difficult to know how much the boinc tasks are affected? I think it's easier to monitor wall-clock run times to judge how well the tasks are running when altering preferences, and quieten the machine as much as possible by killing or suspending any process that will cause system jitters. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,194,632 RAC: 2,780 |
But that's system wide? So difficult to know how much the boinc tasks are affected? Yes, but the rest of my system was pretty-much idle. I did have Firefox up, but I was not doing anything with it. My processor has 16 cores and 11 of them were saturated with Boinc work. Pretty much like this: PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 273653 16165 boinc 39 19 R 765984 1.2 99.0 2 362:55.38 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-+ 278785 16165 boinc 39 19 R 765936 1.2 99.3 7 504:23.35 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-+ 272462 16165 boinc 39 19 R 761424 1.2 99.1 9 624:27.90 ../../projects/www.worldcommunitygrid.org/wcgrid_arp1_wrf_7.32_x86_64-pc+ 279494 279405 boinc 39 19 R 675768 1.0 98.8 6 491:05.20 /var/lib/boinc/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-lin+ 280047 280043 boinc 39 19 R 675452 1.0 99.1 4 479:16.71 /var/lib/boinc/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-lin+ 279909 279896 boinc 39 19 R 675372 1.0 98.8 11 481:47.97 /var/lib/boinc/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-lin+ 281984 281979 boinc 39 19 R 674972 1.0 99.1 5 459:16.00 /var/lib/boinc/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-lin+ 186760 186753 boinc 39 19 R 669356 1.0 98.9 1 1986:07 /var/lib/boinc/projects/climateprediction.net/hadsm4_um_8.02_i686-pc-lin+ 304836 16165 boinc 39 19 R 213132 0.3 99.2 10 111:13.14 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP4G_1.33_x86_64-pc+ 306110 16165 boinc 39 19 R 161796 0.2 99.0 0 85:47.34 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_+ 307781 16165 boinc 39 19 R 143168 0.2 99.1 13 47:31.30 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_+ 16165 1 boinc 30 10 S 44264 0.1 0.1 8 43401:04 /usr/bin/boinc 186753 16165 boinc 39 19 S 11228 0.0 0.1 14 1:14.01 ../../projects/climateprediction.net/hadsm4_8.02_i686-pc-linux-gnu hadsm+ 279896 16165 boinc 39 19 S 10468 0.0 0.1 14 0:18.02 ../../projects/climateprediction.net/hadsm4_8.02_i686-pc-linux-gnu hadsm+ 281979 16165 boinc 39 19 S 10280 0.0 0.1 13 0:16.89 ../../projects/climateprediction.net/hadsm4_8.02_i686-pc-linux-gnu hadsm+ 280043 16165 boinc 39 19 S 10244 0.0 0.0 14 0:17.99 ../../projects/climateprediction.net/hadsm4_8.02_i686-pc-linux-gnu hadsm+ 279405 16165 boinc 39 19 S 10216 0.0 0.1 14 0:17.71 ../../projects/climateprediction.net/hadsm4_8.02_i686-pc-linux-gnu hadsm+ S is state (R is running; S is sleeping; %CPU is how busy that CPU is; P is the processor number (0-15). Since these are just my Boinc tasks, the ones much less than 98% might be working on other things. |
Send message Joined: 12 Apr 21 Posts: 314 Credit: 14,567,328 RAC: 18,257 |
They'll be a steady stream of HadSM4 tasks now with OpenIFS tasks appearing soon. How are they related? Does the data from HadSM4 serve as input for OpenIFS? |
Send message Joined: 15 May 09 Posts: 4531 Credit: 18,679,839 RAC: 15,318 |
How are they related? Does the data from HadSM4 serve as input for OpenIFS?Not in this instance, they are separate research projects. It is not unusual for HadSM4 batches to provide inputs for future batches of the same type though. I don't remember seeing this done with any testing batches of OpenIFS but I couldn't say for certain that no batches ever provided input for subsequent ones. |
Send message Joined: 8 Jan 22 Posts: 9 Credit: 1,780,471 RAC: 3,152 |
I am getting a lot of invalid units. They produce some calculation error somewhere on the way. That usually happens when I restart my hosts in the morning. In the evening I always pause all work, then I wait 30 seconds before I shut em down to make sure all data is written to the ssd correctly.The next morning I get some aborts. Happens on two AMD hosts running Linux Mint 20 and Ubuntu 20. I presume there is some issue with the checkpoints. Did anybody else notice that, too? |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
I am getting a lot of invalid units. They produce some calculation error somewhere on the way. That usually happens when I restart my hosts in the morning. In the evening I always pause all work, then I wait 30 seconds before I shut em down to make sure all data is written to the ssd correctly.The next morning I get some aborts. Happens on two AMD hosts running Linux Mint 20 and Ubuntu 20. I presume there is some issue with the checkpoints. Did anybody else notice that, too?Why on earth are you shutting them down? I leave everything on 24/7, otherwise why bother with Boinc? |
Send message Joined: 29 Oct 17 Posts: 1044 Credit: 16,196,312 RAC: 12,647 |
No, they are completely different models being used for completely different experiments. There will be alot of OpenIFS tasks coming soon, we're just going slowly to make sure everything is correct. They will be single core, 5-6Gb RAM requirement (about the same as LHC Atlas).They'll be a steady stream of HadSM4 tasks now with OpenIFS tasks appearing soon. |
Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918 |
There will be alot of OpenIFS tasks coming soon, we're just going slowly to make sure everything is correct. They will be single core, 5-6Gb RAM requirement (about the same as LHC Atlas).Except LHC Atlas is up to 8 cores, so uses a lot less RAM overall. |
Send message Joined: 29 Oct 17 Posts: 1044 Credit: 16,196,312 RAC: 12,647 |
I am getting a lot of invalid units. They produce some calculation error somewhere on the way. That usually happens when I restart my hosts in the morning. In the evening I always pause all work, then I wait 30 seconds before I shut em down to make sure all data is written to the ssd correctly.The next morning I get some aborts. Happens on two AMD hosts running Linux Mint 20 and Ubuntu 20. I presume there is some issue with the checkpoints. Did anybody else notice that, too?I shutdown, sometimes suspend, my machines too (as I prefer to use free solar elec during the day & not pay for boinc at night - to answer P.Hucker). I have not seen this behaviour normally and I don't bother to suspend the tasks first. The client should know how to do that. The only issue I've had is with HadAM (if that's the right one on Windows), which errored on a PC reboot. But that only happened once. I'm not entirely sure what 'checkpointing' really means. It may only be the client that's checkpointing. OpenIFS does it's own checkpointing and doesn't know anything any checkpointing frequency set in the client. I can't say how the Hadley models behave. Checkpointing is relatively expensive, it's a big I/O dump of the model's internal state, so we wanted to control that ourselves and not let boinc attempt it. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,194,632 RAC: 2,780 |
I am getting a lot of invalid units. They produce some calculation error somewhere on the way. That usually happens when I restart my hosts in the morning. In the evening I always pause all work, then I wait 30 seconds before I shut em down to make sure all data is written to the ssd correctly.The next morning I get some aborts. Happens on two AMD hosts running Linux Mint 20 and Ubuntu 20. I presume there is some issue with the checkpoints. Did anybody else notice that, too? I am run Red Hat Enterprise Linux release 8.6 (Ootpa) on my Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] machine. While I only reboot my machine only every week or two (and sometimes only a little longer), I do not get problems like you describe. Right now my machine has been up only up 3 days, 19:43, but I am getting no errors. My most recent "errors" are all like this, which were not really a crashes at all. Task 22238436 Name hadsm4_a08x_201402_1_939_012156968_0 Workunit 12156968 Created 14 Nov 2022, 10:37:54 UTC Sent 14 Nov 2022, 12:23:58 UTC Report deadline 27 Oct 2023, 17:43:58 UTC Received 15 Nov 2022, 2:23:18 UTC <core_client_version>7.20.2</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Model crashed: ATM_DYN : NEGATIVE THETA DETECTED. tmp/xnnuj.pipe_dummy Sorry, too many model crashes! :-( 20:33:46 (322618): called boinc_finish(22) </stderr_txt> ]]> When I get ready to accept updates to my machine that require a reboot, I stop all new tasks, and let most Boinc tasks run to completion. I usually cannot get CPDN tasks to complete because most recent tasks (except in the last few days) take around a week to complete. So I suspend those that have not even started, then those that are running one at a time. I do not think I have gotten any crashes with this procedure, perhaps in a year (but I cannot remember). I did get a few crashes that had nothing to do with rebooting with a bad batch of tasks with segmentation violations, but these seem to have been bad batches of the tasks that are no longer supplied to Linux systems. |
Send message Joined: 1 Jan 07 Posts: 1058 Credit: 36,592,247 RAC: 15,721 |
I'm not entirely sure what 'checkpointing' really means.'Checkpointing' as a concept applies primarily to the scientific data being processed by a project's scientific application. The idea is to record a complete and consistent state of the application's internal processes on non-volatile memory, in a form that the same application can read back and use as a starting point after a pause. BOINC itself is aware of the process, but in general can't control it: it can't demand that a checkpoint is taken at a particular moment. But it can set some constraints on the process. For example, some people are concerned about the longevity of their SSDs in terms of lifetime write cycles. They may choose to extend the time interval between checkpoints, on the basis that they will only shut down their machine very rarely, and they are content to accept the risk that computing effort will be wasted in the event of an unplanned power outage. BOINC also takes account of the length of time that has elapsed from the last checkpoint when deciding to pause one project's application, and give a turn at the trough for a different one. If a task has never checkpointed, BOINC will try to avoid pausing it unless absolutely necessary. CPDN has a particular problem with checkpoints. The amount of data that has to be recorded to catch the complete internal state of the model so far is much greater than for most other projects. In some cases - slower drives or interfaces, heavily contended devices, or cached 'lazy write' drives - it can take a significant amount of time before the stored data is complete and usable. I think the majority of problems in the past will have been caused by one or more of these delays causing the image on disk to be incomplete and unreadable on restart. |
Send message Joined: 29 Oct 17 Posts: 1044 Credit: 16,196,312 RAC: 12,647 |
Thanks Richard. I should have said I understand the concept of checkpointing but not how the implementation is applied for CPDN. OpenIFS when it's running has no knowledge of what checkpointing is set on the boinc client side and I don't intend to implement it, exactly for the reasons you describe.I'm not entirely sure what 'checkpointing' really means.CPDN has a particular problem with checkpoints. The amount of data that has to be recorded to catch the complete internal state of the model so far is much greater than for most other projects. In some cases - slower drives or interfaces, heavily contended devices, or cached 'lazy write' drives - it can take a significant amount of time before the stored data is complete and usable. I think the majority of problems in the past will have been caused by one or more of these delays causing the image on disk to be incomplete and unreadable on restart. I had to alter the checkpointing in OpenIFS to only keep one set of checkpoint files on the machine. Normally we would not delete the 'older' checkpoint files so that if the most recent one is corrupt, we can fall back to an earlier one. Unfortunately, this puts too much data in the slot directory. So if the single checkpoint does corrupt, that's the end of the task. Never seen it happen to date though with OIFS under boinc. |
Send message Joined: 29 Oct 17 Posts: 1044 Credit: 16,196,312 RAC: 12,647 |
I'm hoping someone can help me make sense of this. What appears to be happening is a resource backoff which counts down, but it reaches zero & I get no tasks even though I know today there are plenty in the queue, and there's no tasks running on my machine. I've browsed the forums but not found a satisfactory answer. If I turn on 'work_fetch_debug' (thanks Richard), I see this sequence (I've deleted a few lines for brevity): ... Tue 15 Nov 2022 16:05:53 GMT | climateprediction.net | [work_fetch] share 0.000 project is backed off (resource backoff: 44.33, inc 9600.00) <<< about to go to zero Tue 15 Nov 2022 16:05:53 GMT | climateprediction.net | can't fetch CPU: project is backed off ...... Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | choose_project: scanning Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | can fetch CPU Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | CPU needs work - buffer low Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | checking CPU Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | [work_fetch] set_request() for CPU: ninst 4 nused_total 0.00 nidle_now 1.00 fetch share 1.00 req_inst 4.00 req_secs 51840.00 Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | CPU set_request: 51840.000000 Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | [work_fetch] request: CPU (51840.00 sec, 4.00 inst) Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | Sending scheduler request: To fetch work. Tue 15 Nov 2022 16:06:54 GMT | climateprediction.net | Requesting new tasks for CPU Tue 15 Nov 2022 16:06:55 GMT | climateprediction.net | Scheduler request completed: got 0 new tasks << server says no! Tue 15 Nov 2022 16:06:55 GMT | climateprediction.net | No tasks sent Tue 15 Nov 2022 16:06:55 GMT | climateprediction.net | Project requested delay of 3636 seconds << and we go around again.This is a Mint21 machine which has successfully run HadSM4 before. I've checked resources given to boinc. The only disk limit is to leave 100Gb free on the disk (there's 204Gb free). It has 32Gb RAM and boinc is allowed to use 75%. Any thoughts from the experts? Thanks. Would be nice if the server returned an error code I could look up, instead of just saying 'no'. |
Send message Joined: 1 Jan 07 Posts: 1058 Credit: 36,592,247 RAC: 15,721 |
It would be appreciated by users - even if only in v1.01 - if you could listen for and obey the setting "Request tasks to checkpoint at most every xxx seconds". In other words, if you've checkpointed within the last xxx seconds, skip the checkpoint loop this time round. That saves wear and tear on the hardware, and (by skipping code) might even speed things up slightly. Every little saving helps. |
Send message Joined: 29 Oct 17 Posts: 1044 Credit: 16,196,312 RAC: 12,647 |
It would be appreciated by users - even if only in v1.01 - if you could listen for and obey the setting "Request tasks to checkpoint at most every xxx seconds". In other words, if you've checkpointed within the last xxx seconds, skip the checkpoint loop this time round. That saves wear and tear on the hardware, and (by skipping code) might even speed things up slightly. Every little saving helps.The model takes 30secs -> 2 mins to complete a timestep depending on resolution & machine speed. The user decides to set a checkpoint every 2mins. That will make the model dump it's memory every timestep, which will kill performance and result in a lot of unnecessary I/O to the hardware. It will also trigger extra code to be run (not less). We did tests in the early days to find the best balance between I/O load and cost of repeating a few extra model steps should a restart be needed. It's not something I want volunteers to be altering without a good understanding of how the model works. The model checkpoint frequency is fixed at run start, it's not dynamic as you suggest. |
Send message Joined: 1 Jan 07 Posts: 1058 Credit: 36,592,247 RAC: 15,721 |
Thoughts on work fetch. The first obvious point is: "if you don't ask, you can't get". The main reason for not asking is "resource backoff" - BOINC applies this, ever more aggressively - if you ask, but the server gives you nothing - without any thought given to the reason for the failure. Sometimes a reason is given in the server reply, but usually not. As you've found out. And the other main reason is "not highest priority project" - only appears if you have multiple projects active at the same time. You've dodged that one. Next point. How do you maximise your chances of receiving work, once you're asking? I always suggest that asking for a smaller amount at a time helps. The more you ask for, the more work the server has to do, looping through the lists of available tasks and doing quite complicated tests on each one to see if it's suitable. BOINC servers tend to have multiple scheduler instances running at the same time and querying the same cached list of maybe 200 tasks offered by a process called the 'feeder'. The quicker you can nip in and out of that melee, the better. The server does log its activity, with reasons. The only project that exposes that information is Einstein, so far as I know. For each host, the 'computer details' page on their website has a live link for 'Last time contacted server'. That leads to the server log for that transaction. Try my https://einsteinathome.org/host/12808716/log. I'm still working with David and Laurence to track down that 'MT task oversupply' bug. This morning, I herded all the cats into line, and managed to send David this evidence and analysis: For the deadline check, the units are "wallclock time per task", so the calculation is correct. But for the 'need more work' check, the units are "cpu-core time per task", which for MT tasks is six times larger. 2022-11-15 09:50:33.2143 [PID=189622] [send_job] [WU#2242433] est delay 0, skipping deadline check 2022-11-15 09:50:33.2376 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS 2022-11-15 09:50:33.2378 [PID=189622] [send_job] est. duration for WU 2242433: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2378 [PID=189622] [send] [HOST#4741] sending [RESULT#3147325 SsXLDmBy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDmk7MKDmQ02bSo_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71)) 2022-11-15 09:50:33.2382 [PID=189622] [send_job] est. duration for WU 2242442: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2382 [PID=189622] [send_job] [WU#2242442] meets deadline: 621.10 + 621.10 < 604800 2022-11-15 09:50:33.2447 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS 2022-11-15 09:50:33.2449 [PID=189622] [send_job] est. duration for WU 2242442: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2449 [PID=189622] [send] [HOST#4741] sending [RESULT#3147334 BOMLDmLy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDmt7MKDm6ueZmn_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71)) 2022-11-15 09:50:33.2454 [PID=189622] [send_job] est. duration for WU 2242460: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2454 [PID=189622] [send_job] [WU#2242460] meets deadline: 1242.19 + 621.10 < 604800 2022-11-15 09:50:33.2527 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS 2022-11-15 09:50:33.2530 [PID=189622] [send_job] est. duration for WU 2242460: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2530 [PID=189622] [send] [HOST#4741] sending [RESULT#3147352 LriLDmfy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDmB8MKDmQpCRXm_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71)) 2022-11-15 09:50:33.2536 [PID=189622] [send_job] est. duration for WU 2242454: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2536 [PID=189622] [send_job] [WU#2242454] meets deadline: 1863.29 + 621.10 < 604800 2022-11-15 09:50:33.2604 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS 2022-11-15 09:50:33.2605 [PID=189622] [send_job] est. duration for WU 2242454: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2605 [PID=189622] [send] [HOST#4741] sending [RESULT#3147346 648LDmZy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDm57MKDm3ZnlUm_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71)) 2022-11-15 09:50:33.2607 [PID=189622] [send_job] est. duration for WU 2242451: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2607 [PID=189622] [send_job] [WU#2242451] meets deadline: 2484.38 + 621.10 < 604800 2022-11-15 09:50:33.2670 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS 2022-11-15 09:50:33.2674 [PID=189622] [send_job] est. duration for WU 2242451: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2674 [PID=189622] [send] [HOST#4741] sending [RESULT#3147343 t98LDmWy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDm27MKDmi4tacn_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71)) 2022-11-15 09:50:33.2684 [PID=189622] [send_job] est. duration for WU 2242457: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2684 [PID=189622] [send_job] [WU#2242457] meets deadline: 3105.48 + 621.10 < 604800 2022-11-15 09:50:33.2757 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS 2022-11-15 09:50:33.2758 [PID=189622] [send_job] est. duration for WU 2242457: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2758 [PID=189622] [send] [HOST#4741] sending [RESULT#3147349 hYXMDmcy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDm87MKDmmC5GUn_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71)) 2022-11-15 09:50:33.2760 [PID=189622] [send_job] est. duration for WU 2242461: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2760 [PID=189622] [send_job] [WU#2242461] meets deadline: 3726.58 + 621.10 < 604800 2022-11-15 09:50:33.2836 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS 2022-11-15 09:50:33.2837 [PID=189622] [send_job] est. duration for WU 2242461: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2837 [PID=189622] [send] [HOST#4741] sending [RESULT#3147353 IUJMDmgy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDmC8MKDmEJVcBn_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71)) 2022-11-15 09:50:33.2839 [PID=189622] [send_job] est. duration for WU 2242444: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2839 [PID=189622] [send_job] [WU#2242444] meets deadline: 4347.67 + 621.10 < 604800 2022-11-15 09:50:33.2896 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS 2022-11-15 09:50:33.2897 [PID=189622] [send_job] est. duration for WU 2242444: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2898 [PID=189622] [send] [HOST#4741] sending [RESULT#3147336 egONDmNy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDmv7MKDmCPtgMn_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71)) 2022-11-15 09:50:33.2900 [PID=189622] [send_job] est. duration for WU 2242446: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.2900 [PID=189622] [send_job] [WU#2242446] meets deadline: 4968.77 + 621.10 < 604800 2022-11-15 09:50:33.2992 [PID=189622] [send] Sending app_version ATLAS 4 125 native_mt; projected 17.42 GFLOPS 2022-11-15 09:50:33.2999 [PID=189622] [send_job] est. duration for WU 2242446: unscaled 620.12 scaled 621.10 2022-11-15 09:50:33.3000 [PID=189622] [send] [HOST#4741] sending [RESULT#3147338 rWONDmQy5E2n7Olcko1bjSoqABFKDmABFKDmFymXDmx7MKDm2Qitwm_0] (est. dur. 621.10s (0h10m21s09)) (max time 344509795.71s (95697h09m55s71)) 2022-11-15 09:50:33.3018 [PID=189622] [send] don't need more work 2022-11-15 09:50:33.3064 [PID=189622] Sending reply to [HOST#4741]: 9 results, delay req 61.00Most of those are debug elements which David had asked for, chosen from https://boinc.berkeley.edu/trac/wiki/ProjectOptions#Logging. If you ask Andy nicely ... |
Send message Joined: 1 Jan 07 Posts: 1058 Credit: 36,592,247 RAC: 15,721 |
The model checkpoint frequency is fixed at run start, it's not dynamic as you suggest.I was describing the generic schema for BOINC projects as a whole, which in general can checkpoint very quickly and easily. Your case is different, and you quite understandably want to set a longer minimum interval to allow the main app to get on with it. The default minimum is 60 seconds. But some users might want to increase even your extended minimum. I was only suggesting that the minimum might be extensible outwards, not reduced. |
Send message Joined: 29 Oct 17 Posts: 1044 Credit: 16,196,312 RAC: 12,647 |
Your case is different, and you quite understandably want to set a longer minimum interval to allow the main app to get on with it. The default minimum is 60 seconds. But some users might want to increase even your extended minimum. I was only suggesting that the minimum might be extensible outwards, not reduced.In practise we've found few instances when restarts are needed if the model task completes in 10-12 hrs. If the checkpoint frequency is increased by the user, that will result in longer runtimes as it will have to repeat more timesteps from the previous checkpoint. As each step is relatively time-consuming that needs to be balanced against reducing I/O. It also duplicates model output resulting in bigger output files. This is why we spent time experimenting with different checkpointing options to get one that was optimum as near as possible for different model resolutions. |
©2024 cpdn.org