1)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55855)
Posted 4 Mar 2017 by Knorr Post:
Identical to what I see on my system. Not sure what happens when you set such large stack size limit. |
2)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55851)
Posted 4 Mar 2017 by Knorr Post:
Dave, were you able to get the output from above? It will list "Max stack size". For me it has a soft limit that looks like a integer overflow. It corresponds to the -1778384896 bytes interpreted as an unsigned integer. |
3)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55850)
Posted 4 Mar 2017 by Knorr Post: Knorr As Dave mentioned this is standard on *ubuntu 16.10. Boinc is in the universe repository http://packages.ubuntu.com/yakkety/boinc Technically, I'm running Ubuntu Gnome (https://ubuntugnome.org/), but that should not affect any of this. It is just a Ubuntu using Gnome3 instead of Unity for the desktop environment. Packages come from the same repositories as the "official" Canonical Ubuntu Boinc was shipped at version 7.2.42 in Ubuntu 14.04. |
4)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55838)
Posted 1 Mar 2017 by Knorr Post: have stopped WINE and am going to see if it still works for me in 16.10 on my desktop. I know it works on my old 32bit netbook. I don't know if you can find stdout for the application anywhere, but if you can, do you also get these messages? Setting pthread size (-1778384896 bytes) EDIT: Alternatively, you can use top/ps to find the PID of the wah2am3m2_um_8.12_i686-pc-linux-gnu application, and then do cat /proc/<pid>/limits |
5)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55834)
Posted 1 Mar 2017 by Knorr Post: I notice that even running the standalone version your boinc-client seems to be from the packaged version By standalone I meant the CPDN binary itself, outside any BOINC context. Seemed like the easiest way to be able to run i.e. GDB/Valgrind on the application. |
6)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55829)
Posted 28 Feb 2017 by Knorr Post: If I run the application in GDB with set follow-fork-mode child I get a SEGV at Thread 2.1 "wah2am3m2_um_8." received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x2a821700 (LWP 13991)] 0x08163769 in gw_satn () Disassembling wah2am3m2_um_8.12_i686-pc-linux-gnu, I get the following at that location (Which indeed is in the gw_satn symbol) 8163754: 0f af 8d fc fe ff ff imul -0x104(%ebp),%ecx 816375b: 8b b5 60 fe ff ff mov -0x1a0(%ebp),%esi 8163761: 8b bd 94 fe ff ff mov -0x16c(%ebp),%edi 8163767: 03 f1 add %ecx,%esi 8163769: f3 0f 10 1c 96 movss (%esi,%edx,4),%xmm3 816376e: 0f 28 e3 movaps %xmm3,%xmm4 8163771: f3 0f 59 e6 mulss %xmm6,%xmm4 8163775: 8d 34 0f lea (%edi,%ecx,1),%esi The register content is listed below (gdb) info registers eax 0xfdb37e30 -38568400 ecx 0xac000000 -1409286144 edx 0x7d 125 ebx 0xfdb37d9c -38568548 esp 0xfdad3d70 0xfdad3d70 ebp 0xfdb37d88 0xfdb37d88 esi 0xa9f45c80 -1443603328 edi 0xfdf3cb30 -34354384 eip 0x8163769 0x8163769 <gw_satn+825> eflags 0x10283 [ CF SF IF RF ] cs 0x23 35 ss 0x2b 43 ds 0x2b 43 es 0x2b 43 fs 0x0 0 gs 0x63 99 |
7)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55828)
Posted 28 Feb 2017 by Knorr Post: Is libc-2.24 from libc.so.6 or libc.so.8, which is a latter version and won't work. ls -la /lib/i386-linux-gnu/libc.so.6 lrwxrwxrwx 1 root root 12 Nov 16 21:53 /lib/i386-linux-gnu/libc.so.6 -> libc-2.24.so I just noticed this bit in a strace. It is from the process that runs the wah2am3m2_um_8.12_i686-pc-linux-gnu application (which is the binary where the SEGV happens). Notice that we have a system call that sets the stack limit to a very large value, which is larger than 32 bits. I don't know if this is a problem, but it does look a bit suspicious. set_robust_list(0xf73841f0, 12) = 0 prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0 prlimit64(0, RLIMIT_STACK, {rlim_cur=18014398507745280*1024, rlim_max=RLIM64_INFINITY}, NULL) = 0 prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=18014398507745280*1024, rlim_max=RLIM64_INFINITY}) = 0 write(1, "Getting pthread attributes - retval=0\nSetting pthread size (-1778384896 bytes) - retval=0\nExecuting program /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu\n", 197) = 197 |
8)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55826)
Posted 28 Feb 2017 by Knorr Post: Knorr - My system is running libc-2.24.so instead. Otherwise, the locations and links are the same. |
9)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55824)
Posted 28 Feb 2017 by Knorr Post: Seems I'm able to at least start the model standalone now. I let a workunit start in boinc, suspending it before it failed. I then closed the boinc client, and started the application from the slots directory ../../projects/climateprediction.net/wah2_8.12_i686-pc-linux-gnu wah2_sas50_mnp0_201312_13_535_010978111 restart_atmos_s049_1998-1201_rd0001 restart_sas50_s049_1998-1201_rd0001 ic00000000_10_N96 GHGclim_ancil_14months_OSTIA_sst_v2_IPSL-CM5A-LR_2013-12-01_2015-01-30 ALLclim_ancil_14months_OSTIA_ice_2013-12-01_2015-01-30 so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5 Starting work on result wah2_sas50_mnp0_201312_13_535_010978111_0_r321574378... Starting model in /var/lib/boinc-client/projects/climateprediction.net... Unzipping global model data file... Unzipping workunit control files... Trying to kill old process # 28144 Trying to kill old process # 28168 Trying to kill old process # 28169 Created shared memory region key = 228135 of size 73297376 bytes (version 608) Run for 1 Years and 1 Months Uploading restart files after month 12 pShMem->PRECIS_LATITUDE 146 pShMem->PRECIS_LONGITUDE 209 pShMem->EWSPACEA 0.440000 pShMem->NSSPACEA 0.440000 pShMem->FRSTLATA 38.720001 pShMem->FRSTLONA 324.359985 pShMem->POLELATA 79.949997 pShMem->POLELONA 236.679993 Copying files for startup... In pre_initialise_phase (part 1 of 3) In initialise_phase (part 2 of 3) In startup_phase (part 3 of 3) Copying model control files... Starting model ID wah2_sas50_mnp0_201312_13_535_010978111 Phase 1 Program launched with process id # 28345 Program launched with process id # 28346 Climate model starting - use graphics to monitor progress. Or visit the website to see the graphs for this run. Getting pthread attributes - retval=0 Setting pthread size (-1778384896 bytes) - retval=0 Executing program /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu Getting pthread attributes - retval=0 Setting pthread size (-1778384896 bytes) - retval=0 Executing program /var/lib/boinc-client/projects/climateprediction.net/wah2rm3m2t_um_8.12_i686-pc-linux-gnu wah2_sas50_mnp0_201312_13_535_010978111 - PH 1 TS 0000001 A - 01/12/2013 00:15 - H:M:S=0000:00:00 AVG= 0.25 DLT= 0.25 Detaching shared memory... Done I can rerun the application, it gives slightly varying results each time. Some times it will say that the model crashed, other times it will print two progress lines before saying done. |
10)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55823)
Posted 28 Feb 2017 by Knorr Post: What happens when you do try the standalone version? It stops running instantly ./wah2_8.12_i686-pc-linux-gnu wah2_sas50_mg69_198712_13_535_010968364 restart_atmos_s035_1995-1201_rd0001 restart_sas50_s035_1995-1201_rd0001 ic00000000_10_N96 GHGclim_ancil_14months_OSTIA_sst_v2_MIROC-ESM_1987-12-01_1989-01-30 ALLclim_ancil_14months_OSTIA_ice_1987-12-01_1989-01-30 so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5 15:17:43 (9957): Can't open init data file - running in standalone mode Starting work on result cpdnou... Starting model in ... Cleaning up from the run... Application name and command line arguments retrieved from client_state.xml[/code] Running strace reveals that the init file it is talking about is a init_data.xml boinc file. |
11)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55821)
Posted 28 Feb 2017 by Knorr Post: Got a new batch of work units, which fails in the same way. Although I'm able to determine which arguments are given to the binary, there is a lot of inter dependencies going on. Have not had any luck to run the application stand-alone. I have disabled the application in my settings for now. |
12)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55787)
Posted 22 Feb 2017 by Knorr Post: Knorr - Does look like the same problem (at least the same nature). I have done a apt-get install --reinstall libc6-i386 Although, I'm not too optimistic. md5sum of the libc.so.6 file shows the same before and after the reinstall. Perhaps a batch of "bad" units were flushed while you were debugging, and the symptom went away. Anyway, there are no more jobs to get right now, so we will see whenever they pick up again. Thanks for your help so far. |
13)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55785)
Posted 22 Feb 2017 by Knorr Post: According to the list in your first post, the binary in question is the um. The stack trace shows you how we reached the place where we get the SEGV. Basically, there are 13 nested function calls. We can see that we start in __libc_start_main+0xf6, which does not really sound surprising. I think it might tell us that the SEGV happened in the main thread/process of the application. Unfortunately, the debug symbols of the cpdn part seems to be missing, so we get no clues about what the process is doing, other than it ends up in the function boinc_catch_signal. If I manage to get new units I will try to stop them before they fail. I might be able to find the command line arguments given, and maybe find out what kind of violation I get. |
14)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55779)
Posted 22 Feb 2017 by Knorr Post: BOINC is in 2 parts; the second part is /usr/bin according to Installing BOINC on Ubuntu Ah, of course. The BOINC binaries are fine. As you mention, I'm able to run other projects. I got CPDN unit about a month ago. It produced several trickles. https://www.cpdn.org/cpdnboinc/result.php?resultid=20144359
A SEGV is a very generic message. Basically, the application is trying to access an invalid memory address. Often someone trying to read/write a NULL pointer. A NULL pointer could appear because something is not available on my system, but the stacktrace does not really give any clues as to what this could be. Is there anywhere I can see which arguments are passed to the CPDN binary for a certain work unit? I might be able to pick up some more forensics if I can run the application outside BOINC. |
15)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55773)
Posted 21 Feb 2017 by Knorr Post:
ldd shows me which file is actually being used by the linker, not only if it is installed. All libraries reported by ldd are in /lib/i386-linux-gnu: ldd wah2am3m2_um_8.12_i686-pc-linux-gnu linux-gate.so.1 => (0xf7743000) libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xf7711000) libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xf76bb000) libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xf769c000) libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xf74e2000) /lib/ld-linux.so.2 (0x565c1000) A "file" on the actual libc shows that it is indeed 32 bit /lib/i386-linux-gnu/libc-2.24.so: ELF 32-bit LSB shared object, Intel 80386, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux.so.2, BuildID[sha1]=005209e623ca3b594b1c902c191b148ff2036623, for GNU/Linux 2.6.32, stripped |
16)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55772)
Posted 21 Feb 2017 by Knorr Post:
Tried resetting the project. Verified that all files in the /var/lib/boinc-client/projects/climateprediction.net was deleted. Requested a new batch of work, same problem.
I'm not sure what the second location you are referring to is? The permissions of the /var/lib/boinc-client/projects/climateprediction.net files looks fine. I have einstein and rosetta running as well, and they have the same owner. |
17)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55767)
Posted 21 Feb 2017 by Knorr Post: 32-bit libraries are installed. I checked all the climate@home binaries with ldd, and none of them are missing libraries. |
18)
Questions and Answers :
Unix/Linux :
Work units failing
(Message 55762)
Posted 21 Feb 2017 by Knorr Post: My host recently got some work units. All of them failed within seconds with a segmentation violation though. Looking at the other hosts for the same work unit shows that most tasks are failing on different hosts as well. Although the other hosts seem to fail immediately with various other errors. Is there a problem with the work units recently released? https://www.cpdn.org/cpdnboinc/results.php?hostid=1417684 <core_client_version>7.6.33</core_client_version> <![CDATA[ <stderr_txt> SIGSEGV: segmentation violation Stack trace (13 frames): /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu(boinc_catch_signal+0x6f)[0x83b4d8f] [0x2aa03cb0] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8163769] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x81696b4] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x81606cd] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x816add4] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x815f531] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8084b03] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x809404a] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8314d85] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8316ddf] /var/lib/boinc-client/projects/climateprediction.net/wah2am3m2_um_8.12_i686-pc-linux-gnu[0x8341332] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf6)[0x2a7bb276] Exiting... Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=5818, selfPID=5738, iMonCtr=1 Model crash detected, will try to restart... Leaving CPDN_Main::Monitor... 16:20:34 (5738): called boinc_finish(0) </stderr_txt> |
19)
Message boards :
Number crunching :
Could not find computer
(Message 43390)
Posted 7 Nov 2011 by Knorr Post: The computer and the task is now showing. |
20)
Message boards :
Number crunching :
Could not find computer
(Message 43386)
Posted 7 Nov 2011 by Knorr Post: Okay, great. Are there problems with sending trickles etc. that necessitates suspending the task until the issue is corrected? |
©2024 climateprediction.net