climateprediction.net home page
Posts by Knorr

Posts by Knorr

1) Questions and Answers : Unix/Linux : Work units failing (Message 55855)
Posted 4 Mar 2017 by Knorr
Post:

Max stack size 18446744071931166720 unlimited bytes


Identical to what I see on my system. Not sure what happens when you set such large stack size limit.
2) Questions and Answers : Unix/Linux : Work units failing (Message 55851)
Posted 4 Mar 2017 by Knorr
Post:

I don't know if you can find stdout for the application anywhere, but if you can, do you also get these messages?

Setting pthread size (-1778384896 bytes)

EDIT:
Alternatively, you can use top/ps to find the PID of the wah2am3m2_um_8.12_i686-pc-linux-gnu application, and then do

cat /proc/<pid>/limits


Dave, were you able to get the output from above? It will list "Max stack size". For me it has a soft limit that looks like a integer overflow. It corresponds to the -1778384896 bytes interpreted as an unsigned integer.
3) Questions and Answers : Unix/Linux : Work units failing (Message 55850)
Posted 4 Mar 2017 by Knorr
Post:
Knorr

The latest that I've been able to get is Linux 3.16.0-38-generic, and BOINC 7.2.42 from Ubuntu. 7.2.42 is also the latest stable release from the BOINC site.

Where did you get your OS and BOINC?


As Dave mentioned this is standard on *ubuntu 16.10. Boinc is in the universe repository http://packages.ubuntu.com/yakkety/boinc

Technically, I'm running Ubuntu Gnome (https://ubuntugnome.org/), but that should not affect any of this. It is just a Ubuntu using Gnome3 instead of Unity for the desktop environment. Packages come from the same repositories as the "official" Canonical Ubuntu

Boinc was shipped at version 7.2.42 in Ubuntu 14.04.
4) Questions and Answers : Unix/Linux : Work units failing (Message 55838)
Posted 1 Mar 2017 by Knorr
Post:
have stopped WINE and am going to see if it still works for me in 16.10 on my desktop. I know it works on my old 32bit netbook.

EDIT:three minutes in and still running. So it isn't a problem with Ubuntu16.10

(packaged version)


I don't know if you can find stdout for the application anywhere, but if you can, do you also get these messages?

Setting pthread size (-1778384896 bytes)

EDIT:
Alternatively, you can use top/ps to find the PID of the wah2am3m2_um_8.12_i686-pc-linux-gnu application, and then do

cat /proc/<pid>/limits
5) Questions and Answers : Unix/Linux : Work units failing (Message 55834)
Posted 1 Mar 2017 by Knorr
Post:
I notice that even running the standalone version your boinc-client seems to be from the packaged version

Executing program /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu


I would run, "Top" and then use kill PIDno before starting BOINC-Manage. I have run into this problem before when packaged version is installed.

Where did you install the standalone version? Starting it from within the directory where it is installed should ensure that the one to go with your boinc- manager runs instead.


By standalone I meant the CPDN binary itself, outside any BOINC context. Seemed like the easiest way to be able to run i.e. GDB/Valgrind on the application.
6) Questions and Answers : Unix/Linux : Work units failing (Message 55829)
Posted 28 Feb 2017 by Knorr
Post:
If I run the application in GDB with
set follow-fork-mode child
I get a SEGV at


Thread 2.1 "wah2am3m2_um_8." received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2a821700 (LWP 13991)]
0x08163769 in gw_satn ()

Disassembling wah2am3m2_um_8.12_i686-pc-linux-gnu, I get the following at that location (Which indeed is in the gw_satn symbol)


8163754: 0f af 8d fc fe ff ff imul -0x104(%ebp),%ecx
816375b: 8b b5 60 fe ff ff mov -0x1a0(%ebp),%esi
8163761: 8b bd 94 fe ff ff mov -0x16c(%ebp),%edi
8163767: 03 f1 add %ecx,%esi
8163769: f3 0f 10 1c 96 movss (%esi,%edx,4),%xmm3
816376e: 0f 28 e3 movaps %xmm3,%xmm4
8163771: f3 0f 59 e6 mulss %xmm6,%xmm4
8163775: 8d 34 0f lea (%edi,%ecx,1),%esi


The register content is listed below

(gdb) info registers
eax            0xfdb37e30	-38568400
ecx            0xac000000	-1409286144
edx            0x7d	125
ebx            0xfdb37d9c	-38568548
esp            0xfdad3d70	0xfdad3d70
ebp            0xfdb37d88	0xfdb37d88
esi            0xa9f45c80	-1443603328
edi            0xfdf3cb30	-34354384
eip            0x8163769	0x8163769 <gw_satn+825>
eflags         0x10283	[ CF SF IF RF ]
cs             0x23	35
ss             0x2b	43
ds             0x2b	43
es             0x2b	43
fs             0x0	0
gs             0x63	99
7) Questions and Answers : Unix/Linux : Work units failing (Message 55828)
Posted 28 Feb 2017 by Knorr
Post:
Is libc-2.24 from libc.so.6 or libc.so.8, which is a latter version and won't work.


ls -la /lib/i386-linux-gnu/libc.so.6
lrwxrwxrwx 1 root root 12 Nov 16 21:53 /lib/i386-linux-gnu/libc.so.6 -> libc-2.24.so


I just noticed this bit in a strace. It is from the process that runs the wah2am3m2_um_8.12_i686-pc-linux-gnu application (which is the binary where the SEGV happens).

Notice that we have a system call that sets the stack limit to a very large value, which is larger than 32 bits. I don't know if this is a problem, but it does look a bit suspicious.

set_robust_list(0xf73841f0, 12)         = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
prlimit64(0, RLIMIT_STACK, {rlim_cur=18014398507745280*1024, rlim_max=RLIM64_INFINITY}, NULL) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=18014398507745280*1024, rlim_max=RLIM64_INFINITY}) = 0
write(1, "Getting pthread attributes - retval=0\nSetting pthread size (-1778384896 bytes) - retval=0\nExecuting program /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu\n", 197) = 197
8) Questions and Answers : Unix/Linux : Work units failing (Message 55826)
Posted 28 Feb 2017 by Knorr
Post:
Knorr -

When I look at /lib/i386-linux-gnu/libc.so.6 in the File Manager I see:

1.8 MB and (Type) Link to Unknown

When I look at the Properties of libc.so.6 I see:

Type Link to shared library (application/x-sharedlib)
Link Target libc-2.23.so
Size 1.8 MB (1,786,484 bytes)


There is a file libc-2.23.so in this directory. It shows:

1.8 MB and (Type) Unknown

The Properties are:

Type: shared library (application/x-sharedlib)
Size: 1.8 MB (1,786,484 bytes)


Is there anything here that is different from yours that may be significant?


My system is running libc-2.24.so instead. Otherwise, the locations and links are the same.
9) Questions and Answers : Unix/Linux : Work units failing (Message 55824)
Posted 28 Feb 2017 by Knorr
Post:
Seems I'm able to at least start the model standalone now.
I let a workunit start in boinc, suspending it before it failed. I then closed the boinc client, and started the application from the slots directory

../../projects/climateprediction.net/wah2_8.12_i686-pc-linux-gnu wah2_sas50_mnp0_201312_13_535_010978111 restart_atmos_s049_1998-1201_rd0001 restart_sas50_s049_1998-1201_rd0001 ic00000000_10_N96 GHGclim_ancil_14months_OSTIA_sst_v2_IPSL-CM5A-LR_2013-12-01_2015-01-30 ALLclim_ancil_14months_OSTIA_ice_2013-12-01_2015-01-30 so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5
Starting work on result wah2_sas50_mnp0_201312_13_535_010978111_0_r321574378...
Starting model in /var/lib/boinc-client/projects/climateprediction.net...
Unzipping global model data file...
Unzipping workunit control files...
Trying to kill old process # 28144
Trying to kill old process # 28168
Trying to kill old process # 28169
Created shared memory region key = 228135 of size 73297376 bytes (version 608)
Run for 1 Years and 1 Months
Uploading restart files after month 12
pShMem->PRECIS_LATITUDE 146
pShMem->PRECIS_LONGITUDE 209
pShMem->EWSPACEA 0.440000
pShMem->NSSPACEA 0.440000
pShMem->FRSTLATA 38.720001
pShMem->FRSTLONA 324.359985
pShMem->POLELATA 79.949997
pShMem->POLELONA 236.679993
Copying files for startup...
In pre_initialise_phase (part 1 of 3)
In initialise_phase (part 2 of 3)
In startup_phase (part 3 of 3)
Copying model control files...
Starting model ID wah2_sas50_mnp0_201312_13_535_010978111   Phase 1
Program launched with process id # 28345
Program launched with process id # 28346
Climate model starting - use graphics to monitor progress.
Or visit the website to see the graphs for this run.
Getting pthread attributes - retval=0
Setting pthread size (-1778384896 bytes) - retval=0
Executing program /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu
Getting pthread attributes - retval=0
Setting pthread size (-1778384896 bytes) - retval=0
Executing program /var/lib/boinc-client/projects/climateprediction.net/wah2rm3m2t_um_8.12_i686-pc-linux-gnu
wah2_sas50_mnp0_201312_13_535_010978111 - PH 1 TS 0000001 A - 01/12/2013 00:15 - H:M:S=0000:00:00 AVG= 0.25 DLT= 0.25
Detaching shared memory...
Done


I can rerun the application, it gives slightly varying results each time. Some times it will say that the model crashed, other times it will print two progress lines before saying done.
10) Questions and Answers : Unix/Linux : Work units failing (Message 55823)
Posted 28 Feb 2017 by Knorr
Post:
What happens when you do try the standalone version?


It stops running instantly

./wah2_8.12_i686-pc-linux-gnu wah2_sas50_mg69_198712_13_535_010968364 restart_atmos_s035_1995-1201_rd0001 restart_sas50_s035_1995-1201_rd0001 ic00000000_10_N96 GHGclim_ancil_14months_OSTIA_sst_v2_MIROC-ESM_1987-12-01_1989-01-30 ALLclim_ancil_14months_OSTIA_ice_1987-12-01_1989-01-30 so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5
15:17:43 (9957): Can't open init data file - running in standalone mode
Starting work on result cpdnou...
Starting model in ...
Cleaning up from the run...


Application name and command line arguments retrieved from client_state.xml[/code]

Running strace reveals that the init file it is talking about is a init_data.xml boinc file.
11) Questions and Answers : Unix/Linux : Work units failing (Message 55821)
Posted 28 Feb 2017 by Knorr
Post:
Got a new batch of work units, which fails in the same way. Although I'm able to determine which arguments are given to the binary, there is a lot of inter dependencies going on. Have not had any luck to run the application stand-alone.

I have disabled the application in my settings for now.
12) Questions and Answers : Unix/Linux : Work units failing (Message 55787)
Posted 22 Feb 2017 by Knorr
Post:
Knorr -

I found the reference to my having the same error as you are having. So, my brain isn't completely dead.

There is a thread with the title:

Lots of tasks end with "Error While Computing". Is there a problem at my end?

which has a last post from me on 06 Aug 2014. If you scroll down to message 49556 (13 July 2014) you will see a post from me which I think is exactly the error your are getting.

The "last" (newest) post (49556) states that I re-installed libc6 and the "problem" went away.


Does look like the same problem (at least the same nature).

I have done a apt-get install --reinstall libc6-i386

Although, I'm not too optimistic. md5sum of the libc.so.6 file shows the same before and after the reinstall.

Perhaps a batch of "bad" units were flushed while you were debugging, and the symptom went away.

Anyway, there are no more jobs to get right now, so we will see whenever they pick up again. Thanks for your help so far.
13) Questions and Answers : Unix/Linux : Work units failing (Message 55785)
Posted 22 Feb 2017 by Knorr
Post:
According to the list in your first post, the binary in question is the um.
And the last line of the stack trace is
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf6)[0x2a7bb276]

which is that 32 bit lib that causes problems when missing.


The stack trace shows you how we reached the place where we get the SEGV. Basically, there are 13 nested function calls. We can see that we start in __libc_start_main+0xf6, which does not really sound surprising. I think it might tell us that the SEGV happened in the main thread/process of the application.

Unfortunately, the debug symbols of the cpdn part seems to be missing, so we get no clues about what the process is doing, other than it ends up in the function boinc_catch_signal.

If I manage to get new units I will try to stop them before they fail. I might be able to find the command line arguments given, and maybe find out what kind of violation I get.
14) Questions and Answers : Unix/Linux : Work units failing (Message 55779)
Posted 22 Feb 2017 by Knorr
Post:
BOINC is in 2 parts; the second part is /usr/bin according to Installing BOINC on Ubuntu
(Under: What the installer does)

Since cpdn has become mostly Windows, I've stopped running a Linux version of BOINC, and use Windows running under Wine. So I don't know where the second lot of files are these days, but if you have other projects running OK, then they're most likely to be OK.


Ah, of course. The BOINC binaries are fine. As you mention, I'm able to run other projects. I got CPDN unit about a month ago. It produced several trickles.

https://www.cpdn.org/cpdnboinc/result.php?resultid=20144359


Which leaves the complete failure of cpdn tasks a mystery still to be solved.
Perhaps search the net for SIGSEGV: segmentation violation and look for clues.


A SEGV is a very generic message. Basically, the application is trying to access an invalid memory address. Often someone trying to read/write a NULL pointer.
A NULL pointer could appear because something is not available on my system, but the stacktrace does not really give any clues as to what this could be.

Is there anywhere I can see which arguments are passed to the CPDN binary for a certain work unit? I might be able to pick up some more forensics if I can run the application outside BOINC.
15) Questions and Answers : Unix/Linux : Work units failing (Message 55773)
Posted 21 Feb 2017 by Knorr
Post:

Doing an ldd is not the same as what I wrote in my post for you to do. An ldd shows that, yes, libc.so.6 is installed. But, it doesn't show the correct VERSION (32 bit) is installed.


ldd shows me which file is actually being used by the linker, not only if it is installed. All libraries reported by ldd are in /lib/i386-linux-gnu:

 ldd wah2am3m2_um_8.12_i686-pc-linux-gnu
 linux-gate.so.1 =>  (0xf7743000)
 libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xf7711000)
 libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xf76bb000)
 libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xf769c000)
 libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xf74e2000)
 /lib/ld-linux.so.2 (0x565c1000)


A "file" on the actual libc shows that it is indeed 32 bit
/lib/i386-linux-gnu/libc-2.24.so: ELF 32-bit LSB shared object, Intel 80386, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux.so.2, BuildID[sha1]=005209e623ca3b594b1c902c191b148ff2036623, for GNU/Linux 2.6.32, stripped
16) Questions and Answers : Unix/Linux : Work units failing (Message 55772)
Posted 21 Feb 2017 by Knorr
Post:

1st step
Reset the project to get rid of all of the climate binaries/files, then start again with downloading new copies.


Tried resetting the project. Verified that all files in the /var/lib/boinc-client/projects/climateprediction.net was deleted. Requested a new batch of work, same problem.


2nd step
If the above doesn't work, then look into permissions for the 2 locations of BOINC.


I'm not sure what the second location you are referring to is? The permissions of the /var/lib/boinc-client/projects/climateprediction.net files looks fine. I have einstein and rosetta running as well, and they have the same owner.
17) Questions and Answers : Unix/Linux : Work units failing (Message 55767)
Posted 21 Feb 2017 by Knorr
Post:
32-bit libraries are installed. I checked all the climate@home binaries with ldd, and none of them are missing libraries.
18) Questions and Answers : Unix/Linux : Work units failing (Message 55762)
Posted 21 Feb 2017 by Knorr
Post:
My host recently got some work units. All of them failed within seconds with a segmentation violation though. Looking at the other hosts for the same work unit shows that most tasks are failing on different hosts as well. Although the other hosts seem to fail immediately with various other errors.

Is there a problem with the work units recently released?

https://www.cpdn.org/cpdnboinc/results.php?hostid=1417684


<core_client_version>7.6.33</core_client_version>
<![CDATA[
<stderr_txt>
SIGSEGV: segmentation violation
Stack trace (13 frames):
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu(boinc_catch_signal+0x6f)[0x83b4d8f]
[0x2aa03cb0]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8163769]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x81696b4]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x81606cd]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x816add4]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x815f531]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8084b03]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x809404a]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8314d85]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8316ddf]
/var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8341332]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf6)[0x2a7bb276]

Exiting...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=5818, selfPID=5738, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_Main::Monitor...
16:20:34 (5738): called boinc_finish(0)

</stderr_txt>
19) Message boards : Number crunching : Could not find computer (Message 43390)
Posted 7 Nov 2011 by Knorr
Post:
The computer and the task is now showing.
20) Message boards : Number crunching : Could not find computer (Message 43386)
Posted 7 Nov 2011 by Knorr
Post:
Okay, great.

Are there problems with sending trickles etc. that necessitates suspending the task until the issue is corrected?


Next 20

©2024 climateprediction.net