climateprediction.net home page
My first four tasks failed on new machine.

My first four tasks failed on new machine.

Questions and Answers : Unix/Linux : My first four tasks failed on new machine.
Message board moderation

To post messages, you must log in.

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 63064 - Posted: 2 Dec 2020, 11:45:04 UTC

I just completed my first four tasks on my new machine running Red Hat Enterprise Linux 8. Each completed 5 trickles and did 5 uploads. They seem to have the right libraries:

[root@localhost climateprediction.net]# ldd hadam4_um_8.52_i686-pc-linux-gnu
linux-gate.so.1 (0xf7ef4000)
libdl.so.2 => /lib/libdl.so.2 (0xf7ed4000)
libm.so.6 => /lib/libm.so.6 (0xf7e02000)
libpthread.so.0 => /lib/libpthread.so.0 (0xf7de1000)
libc.so.6 => /lib/libc.so.6 (0xf7c3a000)
/lib/ld-linux.so.2 (0xf7ef6000)
[root@localhost climateprediction.net]# ldd hadam4_se_8.52_i686-pc-linux-gnu.so
linux-gate.so.1 (0xf7f4a000)
libnsl.so.1 => /lib/libnsl.so.1 (0xf7df5000)
libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7c62000)
libm.so.6 => /lib/libm.so.6 (0xf7b90000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7b73000)
libc.so.6 => /lib/libc.so.6 (0xf79cc000)
/lib/ld-linux.so.2 (0xf7f4c000)

tasks are 219230, 21967787, 21973109, and 21973090.

They report REPLANCS IO error for the first one and
zipfile not found for the last three.

Is it you, or is it me? I assume it is me, but how should I fix this? I have 12 more work units and four of them are just beginning.
ID: 63064 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 63065 - Posted: 2 Dec 2020, 11:55:50 UTC - in response to Message 63064.  

tasks are 219230, 21967787, 21973109, and 21973090.


OOPS: first task is 21923056.
ID: 63065 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 63066 - Posted: 2 Dec 2020, 12:09:02 UTC

Just looked at the task pages, new one on me so afraid I can't offer any help. Doesn't look like missing library problem to me though.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63066 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 63068 - Posted: 2 Dec 2020, 16:15:20 UTC - in response to Message 63064.  

It looks like some glitch when the three failed models tried to access the .so file for creating the first zip file. Only the first monthly file failed to upload, the others monthly zips for those three models did upload. I can't know what caused it for only the first month for the 3 tasks, but we can hope it was a one time deal. Likely the pertinent lines from stderr from the failed models' task pages:

Unable to load library hadam4_se_8.52_i686-pc-linux-gnu.so
dlopen error: /var/lib/boinc/projects/climateprediction.net/hadam4_se_8.52_i686-pc-linux-gnu.so: cannot restore segment prot after reloc: Permission denied
ID: 63068 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 63070 - Posted: 2 Dec 2020, 19:06:30 UTC - in response to Message 63068.  

It looks like some glitch when the three failed models tried to access the .so file for creating the first zip file. Only the first monthly file failed to upload, the others monthly zips for those three models did upload. I can't know what caused it for only the first month for the 3 tasks, but we can hope it was a one time deal. Likely the pertinent lines from stderr from the failed models' task pages:

Unable to load library hadam4_se_8.52_i686-pc-linux-gnu.so
dlopen error: /var/lib/boinc/projects/climateprediction.net/hadam4_se_8.52_i686-pc-linux-gnu.so: cannot restore segment prot after reloc: Permission denied


1.) Red Hat Enterprise Linux 8 (RHEL8) that I am running uses an SELinux operating system kernel that is very fussy about who can do what. If a program tries to violate one of the restrictions, the operation is aborted. On the other hand, so did RHEL6 that I used to use.

2.) RHEL8 uses a very different approach to initiating daemon processes, such as the BOINC client. (systemd). In the past, made a separate disk partition and mounted i as just another user (/home./boinc). But with RHEL8, it goes in /var/lib/boinc.

3.) It seems that SELinux permissions are a little different, and it may have violated one of the rules. It is a shame that it took about two weeks time for the problems to show up. In any case, looking at system logs indicated a possible problem, so I changed a rule. So I will find out sometime.
ID: 63070 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 63071 - Posted: 2 Dec 2020, 19:13:50 UTC - in response to Message 63070.  

It looks like some glitch when the three failed models tried to access the .so file for creating the first zip file. Only the first monthly file failed to upload, the others monthly zips for those three models did upload. I can't know what caused it for only the first month for the 3 tasks, but we can hope it was a one time deal. Likely the pertinent lines from stderr from the failed models' task pages:

Unable to load library hadam4_se_8.52_i686-pc-linux-gnu.so
dlopen error: /var/lib/boinc/projects/climateprediction.net/hadam4_se_8.52_i686-pc-linux-gnu.so: cannot restore segment prot after reloc: Permission denied


1.) Red Hat Enterprise Linux 8 (RHEL8) that I am running uses an SELinux operating system kernel that is very fussy about who can do what. If a program tries to violate one of the restrictions, the operation is aborted. On the other hand, so did RHEL6 that I used to use.

2.) RHEL8 uses a very different approach to initiating daemon processes, such as the BOINC client. (systemd). In the past, made a separate disk partition and mounted i as just another user (/home./boinc). But with RHEL8, it goes in /var/lib/boinc.

3.) It seems that SELinux permissions are a little different, and it may have violated one of the rules. It is a shame that it took about two weeks time for the problems to show up. In any case, looking at system logs indicated a possible problem, so I changed a rule. So I will find out sometime.


Do post the answer if this does turn out to be the issue and I will put something in the BOINC fora as well.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63071 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63072 - Posted: 2 Dec 2020, 19:17:21 UTC

It may have run into memory problems, if the models all tried to do the same thing at once.
You may need double what you have.

Look at the Properties of one that's running, and see have much memory it's using.
ID: 63072 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 63073 - Posted: 2 Dec 2020, 22:27:53 UTC - in response to Message 63072.  

It may have run into memory problems, if the models all tried to do the same thing at once.
You may need double what you have.

Look at the Properties of one that's running, and see have much memory it's using.


I currently have 64 Gigiabytes or RAM and the four instances running are taking 1.3 Gbytes each. Once they get going, this goes up to 1.4 GBytes each.

top - 17:25:01 up 1 day, 14 min, 1 user, load average: 13.71, 13.52, 13.46
Tasks: 482 total, 14 running, 467 sleeping, 0 stopped, 1 zombie
%Cpu(s): 3.0 us, 0.5 sy, 80.6 ni, 15.4 id, 0.1 wa, 0.3 hi, 0.0 si, 0.0 st
MiB Mem : 63944.0 total, 12570.7 free, 11707.1 used, 39666.2 buff/cache
MiB Swap: 15992.0 total, 15992.0 free, 0.0 used. 51386.3 avail Mem

PID PPID USER PR NI S RES SHR %MEM %CPU P TIME+ COMMAND
25133 25130 boinc 39 19 R 1.3g 19972 2.1 6.0 4 753:19.10 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+
22591 22588 boinc 39 19 R 1.3g 19872 2.1 6.2 0 829:50.82 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+
25164 25162 boinc 39 19 R 1.3g 19984 2.1 6.2 15 751:17.98 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+
5527 5508 boinc 39 19 R 1.3g 19932 2.1 6.2 14 1438:55 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+
ID: 63073 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 63074 - Posted: 2 Dec 2020, 23:02:26 UTC

I currently have 64 Gigiabytes or RAM and the four instances running are taking 1.3 Gbytes each. Once they get going, this goes up to 1.4 GBytes each.


How much level 3 cache does that CPU have? These tasks take a fraction over 4MB each at peak I think. Another good reason for staggering tasks a bit so they don't peak at the same time though my main reason is my upload bandwidth!
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63074 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63079 - Posted: 3 Dec 2020, 3:35:37 UTC - in response to Message 63073.  

Your task list shows 12 running, so I got that wrong.
It looks like you're right about an OS setting.
Touchy version of Linux. :(
ID: 63079 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 63080 - Posted: 3 Dec 2020, 4:34:09 UTC - in response to Message 63074.  

How much level 3 cache does that CPU have? These tasks take a fraction over 4MB each at peak I think. Another good reason for staggering tasks a bit so they don't peak at the same time though my main reason is my upload bandwidth!


It has more L3 cach than my first computer, and IBM 704, had more main memory. That is why I am limiting my working to only 4 CPDN processes at a time, 4 WCG processes at a time, two Rosetta processes at a time, and three Universe@home processes at a time. This is a little too much and I propose to lower some of the others a bit after the initial starting transient dies down. This may take several weeks.

Operating System Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.2 (Ootpa) [4.18.0-193.28.1.el8_2.x86_64|libc 2.28 (GNU libc)]
BOINC version 7.16.6
Memory 62.45 GB
Cache 16896 KB <---<<<
Swap space 15.62 GB
Total disk space 117.21 GB
Free Disk Space 68.38 GB
Measured floating point speed 5.84 billion ops/sec
Measured integer speed 22.79 billion ops/sec
Average upload rate 107.97 KB/sec
Average download rate 10660.89 KB/sec
Average turnaround time 11.87 days
ID: 63080 · Report as offensive     Reply Quote
ian Dell white

Send message
Joined: 6 Dec 15
Posts: 3
Credit: 4,487,311
RAC: 0
Message 65092 - Posted: 6 Feb 2022, 17:20:38 UTC

Problem : Many failures of workloads with similar problems
Computer : Specification
OS - 20.04 Ubuntu
Memory 32GB
Cores 8 - Hyper threading switched off
BOINC Manager - 7.16.6

I think l have copied below what the system is trying to tell me :


Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
Signal 3 received, exiting...
05:56:26 (1444): called boinc_finish(193)

</stderr_txt>
]]>
I have been very careful switching the system off at night and allow Virtualbox box to exit before switching the computer off.
Anyone have any answers out there.
Thanks for your help
ID: 65092 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 65094 - Posted: 6 Feb 2022, 17:51:11 UTC - in response to Message 65092.  

I have been very careful switching the system off at night and allow Virtualbox box to exit before switching the computer off.


You allow VB to exit. I presume you have also stopped the boinc tasks from running and exited boinc first.

Secondly, closing the computer down and restarting increases the failure rate even without using virtualisation. I don't know if running in VB increases the risk of shutting down causing failure?
ID: 65094 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 65100 - Posted: 6 Feb 2022, 20:39:45 UTC - in response to Message 63064.  
Last modified: 6 Feb 2022, 21:40:29 UTC

"REPLANCA error" is when some physical result of the series of calculations has returned something way outside of normal.
e.g. the atmosphere has heated to a million degrees, or the oceans have all disappeared.

There are value constraints for numerous parameters which will stop the program if exceeded.

So just normal for some climate models.

(I haven't had one of these for a few years now.)
ID: 65100 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 185
Credit: 27,083,655
RAC: 6,161
Message 65137 - Posted: 10 Feb 2022, 17:54:15 UTC - in response to Message 65094.  
Last modified: 10 Feb 2022, 17:55:45 UTC

I have been very careful switching the system off at night and allow Virtualbox box to exit before switching the computer off.


You allow VB to exit. I presume you have also stopped the boinc tasks from running and exited boinc first.

Secondly, closing the computer down and restarting increases the failure rate even without using virtualisation. I don't know if running in VB increases the risk of shutting down causing failure?

Unless the Virtual Machine/CPDN is shut down cleanly, the loss rate is quite high. We found an automatic Windoze update would typically crash 75% of the running CPDN tasks, Precautions to take with a Windoze10 host:running Oracle VirtualBox VM (ubuntu) with BOINC/CPDN.

1. Pause Windows updates (settings, advanced options). The maximum delay in Win10 is 35 days.
1a. If you don't do this, Windoze updates will silently reboot your PC when it's least expected and WILL crash the VM with your long-running CPDN tasks.
1b. At least once a month, carefully close down CPDN/BOINC/VM and run the Windoze updates.

2. Close down CPDN cleanly.
2a. Suspend activity in BOINC manager.
2b. Check that your tasks have been 'suspended'.
2c. 'Power off' the Virtual Machine.
2d. Close down VirtualBox.

3. Then resume the Windoze update to check for updates and let it run.
3a. When Windoze update has done it's thing, pause the updates again (1)
3b. We usually do a post-update power-off and reboot.

4. Restart VirtualBox and the VM.
4a. It's debatable if the VM and CPDN will start cleanly, You may need to power-off the VM and restart. We found that, sometimes, it takes a post-update PC power-off and re-boot (hence 3b).
4b.. Cross your fingers. In the VM, open BOINC Manager, Activity, Restart and check the tasks are 'running'. Keep your fingers crossed, as that's the point when you'll find out if you have a full set of running tasks or three crashes..
4c.. If you look in the Event Log, it may not report the activity restart, just the suspend.

Good luck. Hope this helps.
ID: 65137 · Report as offensive     Reply Quote
ian Dell white

Send message
Joined: 6 Dec 15
Posts: 3
Credit: 4,487,311
RAC: 0
Message 65152 - Posted: 13 Feb 2022, 17:45:24 UTC - in response to Message 65094.  

Thanks for the clues
Ian Dell white
ID: 65152 · Report as offensive     Reply Quote
ian Dell white

Send message
Joined: 6 Dec 15
Posts: 3
Credit: 4,487,311
RAC: 0
Message 65153 - Posted: 13 Feb 2022, 17:45:57 UTC - in response to Message 65137.  

Thanks for the clues
Regards
Ian Dell White
ID: 65153 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : My first four tasks failed on new machine.

©2024 climateprediction.net