climateprediction.net home page
My first four tasks failed on new machine.

My first four tasks failed on new machine.

Questions and Answers : Unix/Linux : My first four tasks failed on new machine.
Message board moderation

To post messages, you must log in.

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 449
Credit: 6,779,629
RAC: 12,934
Message 63064 - Posted: 2 Dec 2020, 11:45:04 UTC

I just completed my first four tasks on my new machine running Red Hat Enterprise Linux 8. Each completed 5 trickles and did 5 uploads. They seem to have the right libraries:

[root@localhost climateprediction.net]# ldd hadam4_um_8.52_i686-pc-linux-gnu
linux-gate.so.1 (0xf7ef4000)
libdl.so.2 => /lib/libdl.so.2 (0xf7ed4000)
libm.so.6 => /lib/libm.so.6 (0xf7e02000)
libpthread.so.0 => /lib/libpthread.so.0 (0xf7de1000)
libc.so.6 => /lib/libc.so.6 (0xf7c3a000)
/lib/ld-linux.so.2 (0xf7ef6000)
[root@localhost climateprediction.net]# ldd hadam4_se_8.52_i686-pc-linux-gnu.so
linux-gate.so.1 (0xf7f4a000)
libnsl.so.1 => /lib/libnsl.so.1 (0xf7df5000)
libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7c62000)
libm.so.6 => /lib/libm.so.6 (0xf7b90000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7b73000)
libc.so.6 => /lib/libc.so.6 (0xf79cc000)
/lib/ld-linux.so.2 (0xf7f4c000)

tasks are 219230, 21967787, 21973109, and 21973090.

They report REPLANCS IO error for the first one and
zipfile not found for the last three.

Is it you, or is it me? I assume it is me, but how should I fix this? I have 12 more work units and four of them are just beginning.
ID: 63064 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 449
Credit: 6,779,629
RAC: 12,934
Message 63065 - Posted: 2 Dec 2020, 11:55:50 UTC - in response to Message 63064.  

tasks are 219230, 21967787, 21973109, and 21973090.


OOPS: first task is 21923056.
ID: 63065 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 3084
Credit: 7,580,281
RAC: 7,555
Message 63066 - Posted: 2 Dec 2020, 12:09:02 UTC

Just looked at the task pages, new one on me so afraid I can't offer any help. Doesn't look like missing library problem to me though.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63066 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2014
Credit: 53,532,025
RAC: 17,093
Message 63068 - Posted: 2 Dec 2020, 16:15:20 UTC - in response to Message 63064.  

It looks like some glitch when the three failed models tried to access the .so file for creating the first zip file. Only the first monthly file failed to upload, the others monthly zips for those three models did upload. I can't know what caused it for only the first month for the 3 tasks, but we can hope it was a one time deal. Likely the pertinent lines from stderr from the failed models' task pages:

Unable to load library hadam4_se_8.52_i686-pc-linux-gnu.so
dlopen error: /var/lib/boinc/projects/climateprediction.net/hadam4_se_8.52_i686-pc-linux-gnu.so: cannot restore segment prot after reloc: Permission denied
ID: 63068 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 449
Credit: 6,779,629
RAC: 12,934
Message 63070 - Posted: 2 Dec 2020, 19:06:30 UTC - in response to Message 63068.  

It looks like some glitch when the three failed models tried to access the .so file for creating the first zip file. Only the first monthly file failed to upload, the others monthly zips for those three models did upload. I can't know what caused it for only the first month for the 3 tasks, but we can hope it was a one time deal. Likely the pertinent lines from stderr from the failed models' task pages:

Unable to load library hadam4_se_8.52_i686-pc-linux-gnu.so
dlopen error: /var/lib/boinc/projects/climateprediction.net/hadam4_se_8.52_i686-pc-linux-gnu.so: cannot restore segment prot after reloc: Permission denied


1.) Red Hat Enterprise Linux 8 (RHEL8) that I am running uses an SELinux operating system kernel that is very fussy about who can do what. If a program tries to violate one of the restrictions, the operation is aborted. On the other hand, so did RHEL6 that I used to use.

2.) RHEL8 uses a very different approach to initiating daemon processes, such as the BOINC client. (systemd). In the past, made a separate disk partition and mounted i as just another user (/home./boinc). But with RHEL8, it goes in /var/lib/boinc.

3.) It seems that SELinux permissions are a little different, and it may have violated one of the rules. It is a shame that it took about two weeks time for the problems to show up. In any case, looking at system logs indicated a possible problem, so I changed a rule. So I will find out sometime.
ID: 63070 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 3084
Credit: 7,580,281
RAC: 7,555
Message 63071 - Posted: 2 Dec 2020, 19:13:50 UTC - in response to Message 63070.  

It looks like some glitch when the three failed models tried to access the .so file for creating the first zip file. Only the first monthly file failed to upload, the others monthly zips for those three models did upload. I can't know what caused it for only the first month for the 3 tasks, but we can hope it was a one time deal. Likely the pertinent lines from stderr from the failed models' task pages:

Unable to load library hadam4_se_8.52_i686-pc-linux-gnu.so
dlopen error: /var/lib/boinc/projects/climateprediction.net/hadam4_se_8.52_i686-pc-linux-gnu.so: cannot restore segment prot after reloc: Permission denied


1.) Red Hat Enterprise Linux 8 (RHEL8) that I am running uses an SELinux operating system kernel that is very fussy about who can do what. If a program tries to violate one of the restrictions, the operation is aborted. On the other hand, so did RHEL6 that I used to use.

2.) RHEL8 uses a very different approach to initiating daemon processes, such as the BOINC client. (systemd). In the past, made a separate disk partition and mounted i as just another user (/home./boinc). But with RHEL8, it goes in /var/lib/boinc.

3.) It seems that SELinux permissions are a little different, and it may have violated one of the rules. It is a shame that it took about two weeks time for the problems to show up. In any case, looking at system logs indicated a possible problem, so I changed a rule. So I will find out sometime.


Do post the answer if this does turn out to be the issue and I will put something in the BOINC fora as well.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63071 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7436
Credit: 23,446,854
RAC: 0
Message 63072 - Posted: 2 Dec 2020, 19:17:21 UTC

It may have run into memory problems, if the models all tried to do the same thing at once.
You may need double what you have.

Look at the Properties of one that's running, and see have much memory it's using.
ID: 63072 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 449
Credit: 6,779,629
RAC: 12,934
Message 63073 - Posted: 2 Dec 2020, 22:27:53 UTC - in response to Message 63072.  

It may have run into memory problems, if the models all tried to do the same thing at once.
You may need double what you have.

Look at the Properties of one that's running, and see have much memory it's using.


I currently have 64 Gigiabytes or RAM and the four instances running are taking 1.3 Gbytes each. Once they get going, this goes up to 1.4 GBytes each.

top - 17:25:01 up 1 day, 14 min, 1 user, load average: 13.71, 13.52, 13.46
Tasks: 482 total, 14 running, 467 sleeping, 0 stopped, 1 zombie
%Cpu(s): 3.0 us, 0.5 sy, 80.6 ni, 15.4 id, 0.1 wa, 0.3 hi, 0.0 si, 0.0 st
MiB Mem : 63944.0 total, 12570.7 free, 11707.1 used, 39666.2 buff/cache
MiB Swap: 15992.0 total, 15992.0 free, 0.0 used. 51386.3 avail Mem

PID PPID USER PR NI S RES SHR %MEM %CPU P TIME+ COMMAND
25133 25130 boinc 39 19 R 1.3g 19972 2.1 6.0 4 753:19.10 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+
22591 22588 boinc 39 19 R 1.3g 19872 2.1 6.2 0 829:50.82 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+
25164 25162 boinc 39 19 R 1.3g 19984 2.1 6.2 15 751:17.98 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+
5527 5508 boinc 39 19 R 1.3g 19932 2.1 6.2 14 1438:55 /var/lib/boinc/projects/climateprediction.net/hadam4_um_8.52_i68+
ID: 63073 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 3084
Credit: 7,580,281
RAC: 7,555
Message 63074 - Posted: 2 Dec 2020, 23:02:26 UTC

I currently have 64 Gigiabytes or RAM and the four instances running are taking 1.3 Gbytes each. Once they get going, this goes up to 1.4 GBytes each.


How much level 3 cache does that CPU have? These tasks take a fraction over 4MB each at peak I think. Another good reason for staggering tasks a bit so they don't peak at the same time though my main reason is my upload bandwidth!
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63074 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7436
Credit: 23,446,854
RAC: 0
Message 63079 - Posted: 3 Dec 2020, 3:35:37 UTC - in response to Message 63073.  

Your task list shows 12 running, so I got that wrong.
It looks like you're right about an OS setting.
Touchy version of Linux. :(
ID: 63079 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 449
Credit: 6,779,629
RAC: 12,934
Message 63080 - Posted: 3 Dec 2020, 4:34:09 UTC - in response to Message 63074.  

How much level 3 cache does that CPU have? These tasks take a fraction over 4MB each at peak I think. Another good reason for staggering tasks a bit so they don't peak at the same time though my main reason is my upload bandwidth!


It has more L3 cach than my first computer, and IBM 704, had more main memory. That is why I am limiting my working to only 4 CPDN processes at a time, 4 WCG processes at a time, two Rosetta processes at a time, and three Universe@home processes at a time. This is a little too much and I propose to lower some of the others a bit after the initial starting transient dies down. This may take several weeks.

Operating System Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.2 (Ootpa) [4.18.0-193.28.1.el8_2.x86_64|libc 2.28 (GNU libc)]
BOINC version 7.16.6
Memory 62.45 GB
Cache 16896 KB <---<<<
Swap space 15.62 GB
Total disk space 117.21 GB
Free Disk Space 68.38 GB
Measured floating point speed 5.84 billion ops/sec
Measured integer speed 22.79 billion ops/sec
Average upload rate 107.97 KB/sec
Average download rate 10660.89 KB/sec
Average turnaround time 11.87 days
ID: 63080 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : My first four tasks failed on new machine.

©2021 climateprediction.net