climateprediction.net home page
Why does this task fail ?

Why does this task fail ?

Message boards : Number crunching : Why does this task fail ?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
leloft

Send message
Joined: 7 Jun 17
Posts: 23
Credit: 44,434,789
RAC: 2,600,991
Message 67860 - Posted: 18 Jan 2023, 18:41:50 UTC

Hello. Could I please ask for clarification? I am generating several 'Error while Computing' results per day per host. Here is a typical one:
22286062 12199858 1534812 12 Jan 2023, 5:38:05 UTC 18 Jan 2023, 15:26:09 UTC Error while computing 73,196.55 73,196.55 --- OpenIFS 43r3 Perturbed Surface v1.05
x86_64-pc-linux-gnu

The last few lines of the stderr output for this task are these

[...}
Zipping up the final file: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_0214_2014050100_123_983_12199858_0_r1018304785_122.zip
Uploading the final file: upload_file_122.zip
Uploading trickle at timestep: 10623600
14:47:17 (32691): called boinc_finish(0)

</stderr_txt>
]]>

Could someone please explain why the model finishes with a 'final file' ***_122.zip. Are these errors only detectable after the model has run to completion and been uploaded? I'm not sure where to start trying to reduce this high rate of errors.

I have restricted all hosts with app_configs that allow 5.5G memory per task leaving ~10G free for the system. Where should I start unpicking these errors?

Many thanks
fraser
ID: 67860 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 10,364,481
RAC: 21,716
Message 67861 - Posted: 18 Jan 2023, 20:24:32 UTC
Last modified: 18 Jan 2023, 20:24:50 UTC

And once again this task has failed: https://www.cpdn.org/result.php?resultid=22262774

Here is the log-snipped from finishing:

18-Jan-2023 10:55:53 [climateprediction.net] Computation for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 finished
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_115.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_116.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_117.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_118.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_119.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_120.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_121.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_122.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent

Before starting this task, the box was completly rebooted. And I never did anything like pausing or suspending the WU or BOINC. I have the complete BOINC stdoutdae.txt from the run, so if anyone is interested to take a look through I can provide this.

I have checked syslog but couldn't find anything related. I kept a copy of the syslog if someone is interested I can provide it.

Some more thaugts:

At the moment, I'm running OpenIFS on three virtual boxes. They all are a clone from one single Master Ubuntu 22.04 LTS and sit on different (hardware-) hosts.

All machines should have more memory than needed for their tasks and although enough free space on HD, leave Application in memory is selected / activated. BOINC may use 100%/100% of available RAM

ATLAS1_L1 works fine, 8 OpenIFS succesfull, 0 failed, runs 1 OpenIFS and 1x4-Core Atlas-Native, 16 GB RAM
ATLAS5_L1 works fine, 7 OpenIFS succesfull, 0 failed, runs 1 OpenIFS and 1x4-Core Atlas-Native, 32 GB RAM
ATLAS7_L1 struggles, 3 OpenIFS succesfull, 7 failed, runs 1 OpenIFS and 2x4-Core Atlas-Native, 32GB RAM

What I still could test is running only 1x4-Core Atlas-Native on Atlas7_L1, another thing would be to go with OpenIFS in a second instance and check, if this brings any progress

Any Thaughts or Ideas ?

@Dave: Thanks for your links, I already checked them, but could so far find nothing that helped


Supporting BOINC, a great concept !
ID: 67861 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,049,958
RAC: 13,765
Message 67869 - Posted: 18 Jan 2023, 22:18:13 UTC - in response to Message 67861.  

Yeti,
Are you still getting that system time mismatch issue in the Event log, like you posted last time?
ID: 67869 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 10,364,481
RAC: 21,716
Message 67872 - Posted: 18 Jan 2023, 22:44:37 UTC - in response to Message 67869.  
Last modified: 18 Jan 2023, 22:44:52 UTC

Yeti,
Are you still getting that system time mismatch issue in the Event log, like you posted last time?
18-Jan-2023 01:22:30 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip
18-Jan-2023 01:22:30 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_36.zip
18-Jan-2023 01:22:30 [---] New system time (1674001351) < old system time (1674004963); clearing timeouts
18-Jan-2023 01:26:24 [climateprediction.net] Temporarily failed upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip: transient HTTP error
18-Jan-2023 01:26:24 [climateprediction.net] Backing off 00:56:55 on upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip
18-Jan-2023 01:26:51 [climateprediction.net] Finished upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_36.zip

now CPDN seems to sleep for nearly an hour. Look at the time in relation to the new system time:

18-Jan-2023 02:22:45 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_37.zip
18-Jan-2023 02:23:21 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip
18-Jan-2023 02:26:13 [climateprediction.net] Finished upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_37.zip
18-Jan-2023 02:26:57 [climateprediction.net] Finished upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip

and here comes the next trickle

18-Jan-2023 02:27:22 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_38.zip
18-Jan-2023 02:30:25 [climateprediction.net] Finished upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_38.zip
18-Jan-2023 02:34:14 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_39.zip
18-Jan-2023 02:34:14 [climateprediction.net] Sending scheduler request: To send trickle-up message.
18-Jan-2023 02:34:14 [climateprediction.net] Not requesting tasks: don't need ()
18-Jan-2023 02:34:15 [climateprediction.net] Scheduler request completed
18-Jan-2023 02:34:15 [climateprediction.net] Project requested delay of 3636 seconds


Supporting BOINC, a great concept !
ID: 67872 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 10,364,481
RAC: 21,716
Message 67875 - Posted: 18 Jan 2023, 23:19:06 UTC - in response to Message 67872.  
Last modified: 18 Jan 2023, 23:19:19 UTC

18-Jan-2023 01:22:30 [---] New system time (1674001351) < old system time (1674004963); clearing timeouts
At the moment I have no idea why this happens, but the difference seems to be 60,2 Minutes, could it have something to do with UTC versus GMT or similar ?


Supporting BOINC, a great concept !
ID: 67875 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,049,958
RAC: 13,765
Message 67884 - Posted: 19 Jan 2023, 6:45:40 UTC - in response to Message 67875.  
Last modified: 19 Jan 2023, 7:20:49 UTC

Yeti,
I feel like time mismatch is a VBox issue. Maybe more specifically running CPDN on VBox the way you have it set up? Not exactly sure. I've seen this error multiple times and it crashed Hadley models. I wanted to run mac only Hadley models and so set up macOS Mojave on VBox, installed BOINC, and added CPDN as project. Models ran fine but that time mismatch error kept coming up. I wasn't able to find a reason or a good fix for it. The only fix that let me run the models without that error and subsequent crashes is turning off internet time sync in the macOS operating system. You could try turning off time sync in Ubuntu and see if the errors and cashes go away. Finding out the reason for the time mismatch would be another project.
ID: 67884 · Report as offensive     Reply Quote
computezrmle

Send message
Joined: 9 Mar 22
Posts: 30
Credit: 963,113
RAC: 46,932
Message 67885 - Posted: 19 Jan 2023, 7:43:00 UTC - in response to Message 67875.  

Most time sync issues in a mixed OS environment can be avoided if the hardware clock of each computer (hosts as well as VMs) is set to UTC/GMT and the computer runs an NTP service.
Popular OSs set their hardware clock to UTC/GMT by default, except Windows which uses local time.
But even Windows can (and should) be told via a registry key to use UTC/GMT.

In case of VirtualBox VMs this can be configured while a VM is created.
VMs created by vboxwrapper >= 26204 forward the host setting to the guest.

See a more detailed explanation here:
https://github.com/BOINC/boinc/pull/4631
ID: 67885 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 10,364,481
RAC: 21,716
Message 67889 - Posted: 19 Jan 2023, 8:54:45 UTC

The HOST from ATLAS7_L1 had a wrong time (1 hour difference) and it seems as if this has made it into the VM although this feature was deactivated.

We have corrected this and now a new try begins

Thanks to all for your help


Supporting BOINC, a great concept !
ID: 67889 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,358,709
RAC: 9,735
Message 67891 - Posted: 19 Jan 2023, 9:08:03 UTC - in response to Message 67889.  

In my experience, Linux machines keep better time than Windows machines - I suspect Linux checks more often with an NTP server than Windows, which only does it once a week by default.
ID: 67891 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,049,958
RAC: 13,765
Message 67892 - Posted: 19 Jan 2023, 9:13:02 UTC - in response to Message 67885.  
Last modified: 19 Jan 2023, 9:15:11 UTC

I posted the time issue on LHC forum too and computezrmle also pointed me in this direction. This would've been the most advanced solution I'd have tried. I don't think I ever tried it as turning off time sync fixed the problem of model crashes due to time mismatch and the only reason I had macOS VM was for to run the mac only Hadley models which are rare and not numerous. For Linux work I much prefer WSL2.

There's a VBox setting "Hardware Clock in UTC Time" and I tried both off and on and both produced time mismatch BOINC issues. It seemed to me that the setting would compensate between how Windows and UNIX based OSs expect RTC to be set.
ID: 67892 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,049,958
RAC: 13,765
Message 67893 - Posted: 19 Jan 2023, 9:18:41 UTC - in response to Message 67889.  
Last modified: 19 Jan 2023, 9:22:28 UTC

The HOST from ATLAS7_L1 had a wrong time (1 hour difference) and it seems as if this has made it into the VM although this feature was deactivated.

We have corrected this and now a new try begins

Thanks to all for your help

I wish you good results. For me the time on both host and guest OSs always matched up and was correct but yet the BOINC time mismatch kept showing up.
ID: 67893 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,358,709
RAC: 9,735
Message 67894 - Posted: 19 Jan 2023, 9:27:58 UTC - in response to Message 67893.  

For me the time on both host and guest OSs always matched up and was correct but yet the BOINC time mismatch kept showing up.
What was the scale of your mis-match?

If it was something close to a full hour (or several), it was probably a mis-match between the underlying hardware clock in BIOS (which should always be UTC), and the user-level corrections applied for time zone and daylight saving.

If it was only a few seconds, it's probably down to drift between NTP synchronisations.
ID: 67894 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,049,958
RAC: 13,765
Message 67896 - Posted: 19 Jan 2023, 9:46:05 UTC - in response to Message 67894.  

What was the scale of your mis-match?

It was significant, about 2 days. One of the Event log lines from back then: New system time (1647911207) < old system time (1648063920); clearing timeouts
I kept an eye on things for a bit and both OS times were matching to a second and I checked the time via other sources to make sure it was correct which it was.
ID: 67896 · Report as offensive     Reply Quote
computezrmle

Send message
Joined: 9 Mar 22
Posts: 30
Credit: 963,113
RAC: 46,932
Message 67897 - Posted: 19 Jan 2023, 10:02:26 UTC - in response to Message 67891.  

IIRC most NTP services start with a delay of 8 s (may be 16 s) between consecutive time queries.
In case of a successful reply they increase the delay by a power of 2 until 1024 s.
A delay of a week seems to be way too much to get a stable time base.

In addition the manuals at ntp.org suggest to either configure 1 time server (or pool) or 4+ time servers (or pools) to get the time from.
Details can be found here:
http://ntp.org/
ID: 67897 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 10,364,481
RAC: 21,716
Message 67898 - Posted: 19 Jan 2023, 10:07:47 UTC - in response to Message 67893.  
Last modified: 19 Jan 2023, 10:07:58 UTC

I wish you good results. For me the time on both host and guest OSs always matched up and was correct but yet the BOINC time mismatch kept showing up.
I'm very optimistic because on HOST01 / ATLAS1_L1 and HOST05 / ATLAS5_L1 this never happened and at the moment they have a success-rate of 100%


Supporting BOINC, a great concept !
ID: 67898 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,358,709
RAC: 9,735
Message 67899 - Posted: 19 Jan 2023, 10:09:23 UTC - in response to Message 67896.  

I got some experience of this in a previous life, when I installed a number of Microsoft 'Small Business Servers' (Windows-based). My eventual technique was to deliberately set up a NTP service regularly checking a local public NTP authority - I think my nearest one was an atomic clock at Manchester University.

After that, workstations attached to the SBS server were (by default, I think) kept in sync with the on-premises server.

I developed that habit after an experience at one company. This was back in the days of Windows 98 or Windows 2000: the boss of the company developed the habit of using the Windows clock as a holiday planner - and clicking 'OK' after his session. After that, every customer record that he annotated had timestamps weeks or even months into the future...
ID: 67899 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 10,364,481
RAC: 21,716
Message 67953 - Posted: 21 Jan 2023, 21:47:40 UTC - in response to Message 67898.  

I'm very optimistic because on HOST01 / ATLAS1_L1 and HOST05 / ATLAS5_L1 this never happened and at the moment they have a success-rate of 100%
This was really the right solution. Since I fixed this on my side, I have already crunched 7 or more WUs with 100% success


Supporting BOINC, a great concept !
ID: 67953 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,049,958
RAC: 13,765
Message 67960 - Posted: 22 Jan 2023, 8:37:28 UTC - in response to Message 67953.  

This was really the right solution. Since I fixed this on my side, I have already crunched 7 or more WUs with 100% success

Good to hear.
At least I have a workaround for this issue with the macOS VM even if I don't know where the problem is.
ID: 67960 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Why does this task fail ?

©2024 climateprediction.net