climateprediction.net home page
Work units failing

Work units failing

Questions and Answers : Unix/Linux : Work units failing
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55762 - Posted: 21 Feb 2017, 16:37:12 UTC

My host recently got some work units. All of them failed within seconds with a segmentation violation though. Looking at the other hosts for the same work unit shows that most tasks are failing on different hosts as well. Although the other hosts seem to fail immediately with various other errors.

Is there a problem with the work units recently released?

https://www.cpdn.org/cpdnboinc/results.php?hostid=1417684


<core_client_version>7.6.33</core_client_version>
<![CDATA[
<stderr_txt>
SIGSEGV: segmentation violation
Stack trace (13 frames):
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu(boinc_catch_signal+0x6f)[0x83b4d8f]
[0x2aa03cb0]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8163769]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x81696b4]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x81606cd]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x816add4]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x815f531]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8084b03]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x809404a]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8314d85]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8316ddf]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8341332]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf6)[0x2a7bb276]

Exiting...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=5818, selfPID=5738, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_Main::Monitor...
16:20:34 (5738): called boinc_finish(0)

</stderr_txt>

ID: 55762 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,421,805
RAC: 1,225
Message 55763 - Posted: 21 Feb 2017, 19:26:08 UTC

Knorr -

I don't guarantee this, but I think you are missing the 32-bit version of libc.so.6.

Check to see if you have the package libc-i386 installed. If not, do a

sudo apt-get install libc6-i386

or use some other method (Synaptic) to install it.

Let us know if this solves your issue.
ID: 55763 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55767 - Posted: 21 Feb 2017, 20:49:30 UTC - in response to Message 55763.  

32-bit libraries are installed. I checked all the climate@home binaries with ldd, and none of them are missing libraries.
ID: 55767 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55768 - Posted: 21 Feb 2017, 21:13:38 UTC
Last modified: 21 Feb 2017, 21:20:21 UTC

Yes, that's a different fault.
This one has a counterpart in Windows, but with different wording.

I'm not sure of the cause; perhaps a corrupt file, or something locking/using a resource at the very moment that it's needed.

edit
I've just had a look at your Tasks list; ALL of them have failed, so something is wrong at your end.

1st step
Reset the project to get rid of all of the climate binaries/files, then start again with downloading new copies.

2nd step
If the above doesn't work, then look into permissions for the 2 locations of BOINC.
ID: 55768 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,421,805
RAC: 1,225
Message 55770 - Posted: 21 Feb 2017, 22:34:05 UTC

Knorr -

Do what Les wrote. He knows way more about this stuff than I do.

Doing an ldd is not the same as what I wrote in my post for you to do. An ldd shows that, yes, libc.so.6 is installed. But, it doesn't show the correct VERSION (32 bit) is installed.

I have been down this path myself. I can't find the solution in my notes so I am going from memory. That is why I didn't guarantee my suggestion.
ID: 55770 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55772 - Posted: 21 Feb 2017, 22:59:05 UTC - in response to Message 55768.  


1st step
Reset the project to get rid of all of the climate binaries/files, then start again with downloading new copies.


Tried resetting the project. Verified that all files in the /var/lib/boinc-client/projects/climateprediction.net was deleted. Requested a new batch of work, same problem.


2nd step
If the above doesn't work, then look into permissions for the 2 locations of BOINC.


I'm not sure what the second location you are referring to is? The permissions of the /var/lib/boinc-client/projects/climateprediction.net files looks fine. I have einstein and rosetta running as well, and they have the same owner.
ID: 55772 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55773 - Posted: 21 Feb 2017, 23:15:31 UTC - in response to Message 55770.  


Doing an ldd is not the same as what I wrote in my post for you to do. An ldd shows that, yes, libc.so.6 is installed. But, it doesn't show the correct VERSION (32 bit) is installed.


ldd shows me which file is actually being used by the linker, not only if it is installed. All libraries reported by ldd are in /lib/i386-linux-gnu:

 ldd wah2am3m2_um_8.12_i686-pc-linux-gnu
 linux-gate.so.1 =>  (0xf7743000)
 libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xf7711000)
 libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xf76bb000)
 libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xf769c000)
 libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xf74e2000)
 /lib/ld-linux.so.2 (0x565c1000)


A "file" on the actual libc shows that it is indeed 32 bit
/lib/i386-linux-gnu/libc-2.24.so: ELF 32-bit LSB shared object, Intel 80386, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux.so.2, BuildID[sha1]=005209e623ca3b594b1c902c191b148ff2036623, for GNU/Linux 2.6.32, stripped

ID: 55773 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55774 - Posted: 22 Feb 2017, 0:13:44 UTC

BOINC is in 2 parts; the second part is /usr/bin according to Installing BOINC on Ubuntu
(Under: What the installer does)

Since cpdn has become mostly Windows, I've stopped running a Linux version of BOINC, and use Windows running under Wine. So I don't know where the second lot of files are these days, but if you have other projects running OK, then they're most likely to be OK.

Which leaves the complete failure of cpdn tasks a mystery still to be solved.
Perhaps search the net for SIGSEGV: segmentation violation and look for clues.

************

Missing 32 bit lib will fail after about 6 seconds, giving this:

../../projects/climateprediction.net/hadam3prm3pm2t_eu_7.01_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory
ID: 55774 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55775 - Posted: 22 Feb 2017, 1:24:08 UTC

I've just started a new batch of 4, one of which is here.
This is it's last chance; the first attempt failed because of:

../../projects/climateprediction.net/wah2_8.12_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory


The 2nd attempt failed because of:

../../projects/climateprediction.net/wah2_8.12_i686-pc-linux-gnu: /opt/McAfee/runtime/2.0/lib/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by ../../projects/climateprediction.net/wah2_8.12_i686-pc-linux-gnu)
ID: 55775 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,421,805
RAC: 1,225
Message 55776 - Posted: 22 Feb 2017, 1:33:30 UTC
Last modified: 22 Feb 2017, 1:37:05 UTC

My ldd loks like this. However, I have a different "gnu" so this may be meaningless.



bob@Tiger4:/var/lib/boinc-client/projects/climateprediction.net$ ldd wah2_8.12_i686-pc-linux-gnu
linux-gate.so.1 => (0xf77b0000)
libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xf7769000)
libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xf7761000)
libstdc++.so.6 => /usr/lib32/libstdc++.so.6 (0xf75e9000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xf7591000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xf7571000)
libc.so.6 => /lib/i386-linux-gnu/

Edit: I know I have had the issue you are having at some point in the past. I just can't remember how I fixed it. I will look into it some more tomorrow.
ID: 55776 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55779 - Posted: 22 Feb 2017, 7:59:13 UTC - in response to Message 55774.  

BOINC is in 2 parts; the second part is /usr/bin according to Installing BOINC on Ubuntu
(Under: What the installer does)

Since cpdn has become mostly Windows, I've stopped running a Linux version of BOINC, and use Windows running under Wine. So I don't know where the second lot of files are these days, but if you have other projects running OK, then they're most likely to be OK.


Ah, of course. The BOINC binaries are fine. As you mention, I'm able to run other projects. I got CPDN unit about a month ago. It produced several trickles.

https://www.cpdn.org/cpdnboinc/result.php?resultid=20144359


Which leaves the complete failure of cpdn tasks a mystery still to be solved.
Perhaps search the net for SIGSEGV: segmentation violation and look for clues.


A SEGV is a very generic message. Basically, the application is trying to access an invalid memory address. Often someone trying to read/write a NULL pointer.
A NULL pointer could appear because something is not available on my system, but the stacktrace does not really give any clues as to what this could be.

Is there anywhere I can see which arguments are passed to the CPDN binary for a certain work unit? I might be able to pick up some more forensics if I can run the application outside BOINC.
ID: 55779 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55781 - Posted: 22 Feb 2017, 8:43:47 UTC - in response to Message 55779.  

Is there anywhere I can see which arguments are passed to the CPDN binary for a certain work unit?

It's not as simple as that.

I think that there may be several binaries, and there's also lots of data files (lists) that are accessed by them. Part of the testing is just making sure that the contents of lists match up with what is expected in several places.

In Windows, there's ProcMon, (Process Monitor), which can be run to see what happens when.
But it generates LOTS of data.

Another possibility is to look at your lists of tasks, and then at the Workunit column, to select ones that have failed before, and see what happened to them.
ID: 55781 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 55782 - Posted: 22 Feb 2017, 9:20:01 UTC

According to the list in your first post, the binary in question is the um.
And the last line of the stack trace is
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf6)[0x2a7bb276]

which is that 32 bit lib that causes problems when missing.

I'm wondering if the problem is something to do with stack memory, although I don't see how.
There's an article here, but it seems to say that stack size allocation is automatic.
ID: 55782 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 55783 - Posted: 22 Feb 2017, 10:14:38 UTC - in response to Message 55782.  

The only box I have running BOINC natively under nix at present is an old 32 bit machine. Two other boxes are set to no new tasks so I can try and get some Linux work when WINE tasks finish. I have normally run BOINC on nix by extracting the tarball rather than using the packaged version so it may be worth trying that as an alternative.

Trouble is by the time my tasks running under WINE are finished there may not be any Linux work available :(
ID: 55783 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55785 - Posted: 22 Feb 2017, 11:59:35 UTC - in response to Message 55782.  

According to the list in your first post, the binary in question is the um.
And the last line of the stack trace is
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf6)[0x2a7bb276]

which is that 32 bit lib that causes problems when missing.


The stack trace shows you how we reached the place where we get the SEGV. Basically, there are 13 nested function calls. We can see that we start in __libc_start_main+0xf6, which does not really sound surprising. I think it might tell us that the SEGV happened in the main thread/process of the application.

Unfortunately, the debug symbols of the cpdn part seems to be missing, so we get no clues about what the process is doing, other than it ends up in the function boinc_catch_signal.

If I manage to get new units I will try to stop them before they fail. I might be able to find the command line arguments given, and maybe find out what kind of violation I get.
ID: 55785 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,421,805
RAC: 1,225
Message 55786 - Posted: 22 Feb 2017, 14:34:41 UTC

Knorr -

I found the reference to my having the same error as you are having. So, my brain isn't completely dead.

There is a thread with the title:

Lots of tasks end with "Error While Computing". Is there a problem at my end?

which has a last post from me on 06 Aug 2014. If you scroll down to message 49556 (13 July 2014) you will see a post from me which I think is exactly the error your are getting.

The "last" (newest) post (49556) states that I re-installed libc6 and the "problem" went away.
ID: 55786 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55787 - Posted: 22 Feb 2017, 16:36:13 UTC - in response to Message 55786.  

Knorr -

I found the reference to my having the same error as you are having. So, my brain isn't completely dead.

There is a thread with the title:

Lots of tasks end with "Error While Computing". Is there a problem at my end?

which has a last post from me on 06 Aug 2014. If you scroll down to message 49556 (13 July 2014) you will see a post from me which I think is exactly the error your are getting.

The "last" (newest) post (49556) states that I re-installed libc6 and the "problem" went away.


Does look like the same problem (at least the same nature).

I have done a apt-get install --reinstall libc6-i386

Although, I'm not too optimistic. md5sum of the libc.so.6 file shows the same before and after the reinstall.

Perhaps a batch of "bad" units were flushed while you were debugging, and the symptom went away.

Anyway, there are no more jobs to get right now, so we will see whenever they pick up again. Thanks for your help so far.
ID: 55787 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 55820 - Posted: 28 Feb 2017, 11:57:20 UTC

Just to say some more work for nix has been poured into the hopper.
ID: 55820 · Report as offensive     Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 128,450
RAC: 0
Message 55821 - Posted: 28 Feb 2017, 11:57:38 UTC - in response to Message 55787.  

Got a new batch of work units, which fails in the same way. Although I'm able to determine which arguments are given to the binary, there is a lot of inter dependencies going on. Have not had any luck to run the application stand-alone.

I have disabled the application in my settings for now.
ID: 55821 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 55822 - Posted: 28 Feb 2017, 13:08:16 UTC - in response to Message 55821.  

Have not had any luck to run the application stand-alone.


The term, "sucking eggs" probably applies here but If attempting to run the standalone setup, it is worth making sure boinc-client isn't running from the packaged version when you try and start the standalone one as it seems to stop things working in my experience.

What happens when you do try the standalone version?
ID: 55822 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Questions and Answers : Unix/Linux : Work units failing

©2024 climateprediction.net