|
Message boards : Number crunching : Why does this task fail ?
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 7 Jun 17 Posts: 23 Credit: 44,434,789 RAC: 2,600,991 |
Hello. Could I please ask for clarification? I am generating several 'Error while Computing' results per day per host. Here is a typical one: 22286062 12199858 1534812 12 Jan 2023, 5:38:05 UTC 18 Jan 2023, 15:26:09 UTC Error while computing 73,196.55 73,196.55 --- OpenIFS 43r3 Perturbed Surface v1.05 x86_64-pc-linux-gnu The last few lines of the stderr output for this task are these [...} Zipping up the final file: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_0214_2014050100_123_983_12199858_0_r1018304785_122.zip Uploading the final file: upload_file_122.zip Uploading trickle at timestep: 10623600 14:47:17 (32691): called boinc_finish(0) </stderr_txt> ]]> Could someone please explain why the model finishes with a 'final file' ***_122.zip. Are these errors only detectable after the model has run to completion and been uploaded? I'm not sure where to start trying to reduce this high rate of errors. I have restricted all hosts with app_configs that allow 5.5G memory per task leaving ~10G free for the system. Where should I start unpicking these errors? Many thanks fraser |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 20,265,870 RAC: 32,121 |
And once again this task has failed: https://www.cpdn.org/result.php?resultid=22262774 Here is the log-snipped from finishing: 18-Jan-2023 10:55:53 [climateprediction.net] Computation for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 finished 18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_115.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent 18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_116.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent 18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_117.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent 18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_118.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent 18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_119.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent 18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_120.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent 18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_121.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent 18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_122.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent Before starting this task, the box was completly rebooted. And I never did anything like pausing or suspending the WU or BOINC. I have the complete BOINC stdoutdae.txt from the run, so if anyone is interested to take a look through I can provide this. I have checked syslog but couldn't find anything related. I kept a copy of the syslog if someone is interested I can provide it. Some more thaugts: At the moment, I'm running OpenIFS on three virtual boxes. They all are a clone from one single Master Ubuntu 22.04 LTS and sit on different (hardware-) hosts. All machines should have more memory than needed for their tasks and although enough free space on HD, leave Application in memory is selected / activated. BOINC may use 100%/100% of available RAM ATLAS1_L1 works fine, 8 OpenIFS succesfull, 0 failed, runs 1 OpenIFS and 1x4-Core Atlas-Native, 16 GB RAM ATLAS5_L1 works fine, 7 OpenIFS succesfull, 0 failed, runs 1 OpenIFS and 1x4-Core Atlas-Native, 32 GB RAM ATLAS7_L1 struggles, 3 OpenIFS succesfull, 7 failed, runs 1 OpenIFS and 2x4-Core Atlas-Native, 32GB RAM What I still could test is running only 1x4-Core Atlas-Native on Atlas7_L1, another thing would be to go with OpenIFS in a second instance and check, if this brings any progress Any Thaughts or Ideas ? @Dave: Thanks for your links, I already checked them, but could so far find nothing that helped Supporting BOINC, a great concept ! |
Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207 |
Yeti, Are you still getting that system time mismatch issue in the Event log, like you posted last time? |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 20,265,870 RAC: 32,121 |
Yeti,18-Jan-2023 01:22:30 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip 18-Jan-2023 01:22:30 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_36.zip 18-Jan-2023 01:22:30 [---] New system time (1674001351) < old system time (1674004963); clearing timeouts 18-Jan-2023 01:26:24 [climateprediction.net] Temporarily failed upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip: transient HTTP error 18-Jan-2023 01:26:24 [climateprediction.net] Backing off 00:56:55 on upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip 18-Jan-2023 01:26:51 [climateprediction.net] Finished upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_36.zip now CPDN seems to sleep for nearly an hour. Look at the time in relation to the new system time: 18-Jan-2023 02:22:45 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_37.zip 18-Jan-2023 02:23:21 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip 18-Jan-2023 02:26:13 [climateprediction.net] Finished upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_37.zip 18-Jan-2023 02:26:57 [climateprediction.net] Finished upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip and here comes the next trickle 18-Jan-2023 02:27:22 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_38.zip 18-Jan-2023 02:30:25 [climateprediction.net] Finished upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_38.zip 18-Jan-2023 02:34:14 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_39.zip 18-Jan-2023 02:34:14 [climateprediction.net] Sending scheduler request: To send trickle-up message. 18-Jan-2023 02:34:14 [climateprediction.net] Not requesting tasks: don't need () 18-Jan-2023 02:34:15 [climateprediction.net] Scheduler request completed 18-Jan-2023 02:34:15 [climateprediction.net] Project requested delay of 3636 seconds Supporting BOINC, a great concept ! |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 20,265,870 RAC: 32,121 |
18-Jan-2023 01:22:30 [---] New system time (1674001351) < old system time (1674004963); clearing timeoutsAt the moment I have no idea why this happens, but the difference seems to be 60,2 Minutes, could it have something to do with UTC versus GMT or similar ? Supporting BOINC, a great concept ! |
Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207 |
Yeti, I feel like time mismatch is a VBox issue. Maybe more specifically running CPDN on VBox the way you have it set up? Not exactly sure. I've seen this error multiple times and it crashed Hadley models. I wanted to run mac only Hadley models and so set up macOS Mojave on VBox, installed BOINC, and added CPDN as project. Models ran fine but that time mismatch error kept coming up. I wasn't able to find a reason or a good fix for it. The only fix that let me run the models without that error and subsequent crashes is turning off internet time sync in the macOS operating system. You could try turning off time sync in Ubuntu and see if the errors and cashes go away. Finding out the reason for the time mismatch would be another project. |
Send message Joined: 9 Mar 22 Posts: 30 Credit: 1,065,239 RAC: 556 |
Most time sync issues in a mixed OS environment can be avoided if the hardware clock of each computer (hosts as well as VMs) is set to UTC/GMT and the computer runs an NTP service. Popular OSs set their hardware clock to UTC/GMT by default, except Windows which uses local time. But even Windows can (and should) be told via a registry key to use UTC/GMT. In case of VirtualBox VMs this can be configured while a VM is created. VMs created by vboxwrapper >= 26204 forward the host setting to the guest. See a more detailed explanation here: https://github.com/BOINC/boinc/pull/4631 |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 20,265,870 RAC: 32,121 |
The HOST from ATLAS7_L1 had a wrong time (1 hour difference) and it seems as if this has made it into the VM although this feature was deactivated. We have corrected this and now a new try begins Thanks to all for your help Supporting BOINC, a great concept ! |
Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533 |
In my experience, Linux machines keep better time than Windows machines - I suspect Linux checks more often with an NTP server than Windows, which only does it once a week by default. |
Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207 |
I posted the time issue on LHC forum too and computezrmle also pointed me in this direction. This would've been the most advanced solution I'd have tried. I don't think I ever tried it as turning off time sync fixed the problem of model crashes due to time mismatch and the only reason I had macOS VM was for to run the mac only Hadley models which are rare and not numerous. For Linux work I much prefer WSL2. There's a VBox setting "Hardware Clock in UTC Time" and I tried both off and on and both produced time mismatch BOINC issues. It seemed to me that the setting would compensate between how Windows and UNIX based OSs expect RTC to be set. |
Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207 |
The HOST from ATLAS7_L1 had a wrong time (1 hour difference) and it seems as if this has made it into the VM although this feature was deactivated. I wish you good results. For me the time on both host and guest OSs always matched up and was correct but yet the BOINC time mismatch kept showing up. |
Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533 |
For me the time on both host and guest OSs always matched up and was correct but yet the BOINC time mismatch kept showing up.What was the scale of your mis-match? If it was something close to a full hour (or several), it was probably a mis-match between the underlying hardware clock in BIOS (which should always be UTC), and the user-level corrections applied for time zone and daylight saving. If it was only a few seconds, it's probably down to drift between NTP synchronisations. |
Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207 |
What was the scale of your mis-match? It was significant, about 2 days. One of the Event log lines from back then: New system time (1647911207) < old system time (1648063920); clearing timeouts I kept an eye on things for a bit and both OS times were matching to a second and I checked the time via other sources to make sure it was correct which it was. |
Send message Joined: 9 Mar 22 Posts: 30 Credit: 1,065,239 RAC: 556 |
IIRC most NTP services start with a delay of 8 s (may be 16 s) between consecutive time queries. In case of a successful reply they increase the delay by a power of 2 until 1024 s. A delay of a week seems to be way too much to get a stable time base. In addition the manuals at ntp.org suggest to either configure 1 time server (or pool) or 4+ time servers (or pools) to get the time from. Details can be found here: http://ntp.org/ |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 20,265,870 RAC: 32,121 |
I wish you good results. For me the time on both host and guest OSs always matched up and was correct but yet the BOINC time mismatch kept showing up.I'm very optimistic because on HOST01 / ATLAS1_L1 and HOST05 / ATLAS5_L1 this never happened and at the moment they have a success-rate of 100% Supporting BOINC, a great concept ! |
Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533 |
I got some experience of this in a previous life, when I installed a number of Microsoft 'Small Business Servers' (Windows-based). My eventual technique was to deliberately set up a NTP service regularly checking a local public NTP authority - I think my nearest one was an atomic clock at Manchester University. After that, workstations attached to the SBS server were (by default, I think) kept in sync with the on-premises server. I developed that habit after an experience at one company. This was back in the days of Windows 98 or Windows 2000: the boss of the company developed the habit of using the Windows clock as a holiday planner - and clicking 'OK' after his session. After that, every customer record that he annotated had timestamps weeks or even months into the future... |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 20,265,870 RAC: 32,121 |
I'm very optimistic because on HOST01 / ATLAS1_L1 and HOST05 / ATLAS5_L1 this never happened and at the moment they have a success-rate of 100%This was really the right solution. Since I fixed this on my side, I have already crunched 7 or more WUs with 100% success Supporting BOINC, a great concept ! |
Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207 |
This was really the right solution. Since I fixed this on my side, I have already crunched 7 or more WUs with 100% success Good to hear. At least I have a workaround for this issue with the macOS VM even if I don't know where the problem is. |
©2025 cpdn.org