hadam3p eu WU segfault

Author	Message
pvh Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0	Message 48914 - Posted: 27 Apr 2014, 0:04:37 UTC I just had a hadam3p eu WU fail on a segfault <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> SIGSEGV: segmentation violation Stack trace (13 frames): /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu(boinc_catch_signal+0x6f)[0x8340f8f] linux-gate.so.1(__kernel_sigreturn+0x0)[0x55578400] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x813cc30] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x81426bb] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x813a3ce] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8143cea] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x813924a] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8090a3e] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8055900] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8069392] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x806a310] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x82ccda2] /lib/libc.so.6(__libc_start_main+0xf3)[0x555fc9d3] ... etc ... The WU is http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=16599536 Is this a known problem? I could not find any mention of it on the forum... ID: 48914 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 48916 - Posted: 27 Apr 2014, 10:20:38 UTC - in response to Message 48914. Segfaults happen occasionally (rarely), with all model types. Why? Who knows? A misbehaving driver for one of the computer's peripherals. Cosmic Rays flipping a bit in the computer's RAM. Electrical 'noise' on the power supply from (for example) the refrigerator turning off. Static electricity building up and then discharging. Oxide films building up on connectors in the computer. A failing component or solder joint. Most likely: a very obscure bug in the model code, that can never be reproduced by the developers, because it only shows up after a certain pattern of disk accesses, with model data in a specific place in RAM. Or something like that. If segfaults happen frequently with one computer, then it might be worth investigating further, for example checking that all the connectors to the motherboard and disk drives are properly seated, and running Memtest86+ for 72 hours or so. But a one-off segfault is nothing to worry about. ID: 48916 · Reply Quote

pvh Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0	Message 48917 - Posted: 27 Apr 2014, 15:06:51 UTC Thanks! All I know is that the segfault happened almost simultaneously with another WU on the same machine finishing. Could be a coincidence of course. But it would be consistent with the non-reproducibility... ID: 48917 · Reply Quote

Ananas Volunteer moderator Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0	Message 48918 - Posted: 27 Apr 2014, 17:34:35 UTC Last modified: 27 Apr 2014, 17:34:53 UTC Hmmm, I just lately had the impression, that two models influenced eachother in this thread. A windows machine with less detailed error output but still with a somehow similar effect. I guess it might be useful to have a look at the activity of models running concurrent on the same machines when such a thing happens. ID: 48918 · Reply Quote

Bonsai911 Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,270,715 RAC: 1,398	Message 48925 - Posted: 27 Apr 2014, 20:17:31 UTC Hi Mr. Greg van Paassen, tht sounds right and cool. Nice to read. ID: 48925 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 48926 - Posted: 27 Apr 2014, 23:36:55 UTC - in response to Message 48925. Thanks, Bonsai911! :-) ID: 48926 · Reply Quote

pvh Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0	Message 48945 - Posted: 28 Apr 2014, 14:18:28 UTC Just had a 2nd eu model crash on a segfault on the same machine http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=16599512 The other types of climateprediction WUs are still OK on this machine, but I guess it is still early days... I will run memtest to be on the safe side, but I have not seen flaky behavior with other projects. I have had 3 eu WUs on this machine so far, and all 3 failed. One immediately crashed due to a different cause which was reproduced on other machines, so that must be inherent to the WU. But the other two both failed on segfaults. ID: 48945 · Reply Quote

pvh Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0	Message 48952 - Posted: 28 Apr 2014, 22:22:28 UTC Just checked the memory. It's fine. No clue what is going on here... ID: 48952 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,000,243 RAC: 4,190	Message 48954 - Posted: 29 Apr 2014, 0:09:50 UTC pvh: Note that at least one of your crashed models has "Model crashed: INITTIME: Atmosphere basis time mismatch" (as, indeed, has one recent one of mine). This is a project configuration error and nothing to do with your machine. So at least you don't have to worry about that one. ID: 48954 · Reply Quote

pvh Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0	Message 48971 - Posted: 29 Apr 2014, 20:32:07 UTC That one crash is clearly reprodicible. But in the mean time two more WUs crashed on a segfault. This time both hadam3p anz models. This is clearly not a rare event. I am seriously contemplating abandoning this project... ID: 48971 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 48973 - Posted: 29 Apr 2014, 21:47:45 UTC - in response to Message 48971. Over the years there's been a number of reports about the SIGSEGV problem. The vague ideas that I've formed about them, without knowing anything about the cause, are: Always Linux. They usually get posted about in the Linux thread. They, like the "Visual Fortran run-rime error" and the "exited with zero status but no 'finished' file" problems, only happen to some computers. Also, they start suddenly, and stop happening just as suddenly. Or, at least, stop getting posted about. I haven't had it happen to my computers in the few months that I've been running Linux, so, basically, it's your computer that has a problem, and not the models. ID: 48973 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,523,697 RAC: 5,963	Message 48980 - Posted: 30 Apr 2014, 6:43:42 UTC Interesting Les, I have had these in the past on this box but not for a long time, certainly not since I replaced the PSU when it stopped booting up. The only other significant changes have been doubling the ram to 4GB and periodically changing the linux version. - I started off with Mandriva and then had problems with one of the incarnations of it and moved over to Kubuntu. The trouble in working out the cause is as you say, the erraticness of it. I would get a couple in a month and then none for six months or more which makes me wonder if while it is as you say almost certainly a computer problem, are some models more prone to it than others? ID: 48980 · Reply Quote

tullio Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0	Message 48981 - Posted: 30 Apr 2014, 7:21:41 UTC Last modified: 30 Apr 2014, 7:23:02 UTC I have a hadam3p_anz unit at 50% after 366 hours on my SuSE Linux box.Oddly enough, estimated completion time is given as 107 hours only. How come? Tullio ID: 48981 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 48983 - Posted: 30 Apr 2014, 8:23:28 UTC - in response to Message 48981. Probably in a loop. They don't last THAT long. ID: 48983 · Reply Quote

tullio Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0	Message 48989 - Posted: 30 Apr 2014, 12:51:06 UTC - in response to Message 48983. Last modified: 30 Apr 2014, 12:51:19 UTC Probably in a loop. They don't last THAT long. Should I abort it? ID: 48989 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,523,697 RAC: 5,963	Message 48990 - Posted: 30 Apr 2014, 18:40:01 UTC - in response to Message 48989. Should I abort it? Yes, models stuck in a loop are like the Computer programmer found dead in the shower. Instructions on bottle. Wet hair Apply Shampoo Rinse Repeat ID: 48990 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2167 Credit: 64,526,915 RAC: 6,447	Message 48991 - Posted: 30 Apr 2014, 20:08:13 UTC - in response to Message 48990. If it actually only had 100 hours to go, a little over 450 hours total might not be a bad estimate for his computer, which is averaging 12 sec/TS for the ANZ model. But at 50%, and 366 hours, there has to be a problem. ID: 48991 · Reply Quote

pvh Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0	Message 48997 - Posted: 1 May 2014, 11:02:27 UTC @Les Bayliss. Let me assure you that there are no hardware problems on my computer. I have checked. The machine is healthy. You state that the segfaults only occur on Linux machines. This proves beyond a doubt that software problems must play a role in the problem as it can be ruled out that hardware problems would exclusively occur on Linux machines. My guess is that it is the boinc library. I have seen that library produce spurious segfaults on occasion on all my machines, e.g. when there is a problem with the ethernet switch. Also a heavy load on the disk or OS can do this (i.e., when the kernel starts doing high-priority tasks). The problem (or problems, there could be multiple bugs in the boinc library...) clearly needs a combination of factors to pop up and is in part driven by external factors. I have read an interesting posting on another message board that stated that this was due to naive design of the boinc library, pretending it had control over things it simply cannot control. Unfortunately I cannot find that posting back. But it mentioned arbitrary timeouts on system commands to "check" if the system is still healthy... If so, there must be something in the way climateprediction uses the boinc library to make this problem more likely to pop us as the segfaults are clearly far more frequent here than on other boinc projects that I ran on the same box. Pointers could be that boinc resides on a RAID5 array on this machine, and that it is my regular desktop machine (so is more loaded with external tasks). ID: 48997 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,888,554 RAC: 1,481,373	Message 48998 - Posted: 1 May 2014, 11:49:26 UTC - in response to Message 48997. Last modified: 1 May 2014, 11:56:13 UTC The reason that sigsegv (signal 11) only happens on Linux and Mac machines is -- Wait for it -- Only on Linux and other ix machines is this sigsegv defined. Sigsegv signal 11 is defined on all ix machines , on any such there is a definition -- basically, access outlaw memory - get sig 11 sigsegv. Other OS's, other definitions. How, say, MS windows reports this type of error, no se. But -- on any unix-based machine - sigsegv means either a programming error or a hardware error - and you need real skills to figure out which. refer to the hardware docs at intel or AMD. Almost always when you get a sigsegv on any U*x machine, either it's incompetent C programming or, more likely "hardware problem" pvh states == You state that the segfaults only occur on Linux machines. This proves beyond a doubt that software problems must play a role in the problem as it can be ruled out that hardware problems would exclusively occur on Linux machines. naah - hardware problems happen on all machines -- Linux reports segfaults - Windows reports -- ?? non sequitur. "Let me assure you that there are no hardware problems on my computer. I have checked. The machine is healthy." this is "famous last words" candidate. Believe me, or don't -- how many times I thought "my machine has no problems - neither hardware or OS nor dll's" Don't posture to be so sure about what none of us knows about. Oh, and let me add that I've been running mostly CPDN for almost a decade, and the very very few sigsegv's were almost always hardware problems, the last decade or so. ID: 48998 · Reply Quote

tullio Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0	Message 49010 - Posted: 1 May 2014, 14:08:06 UTC - in response to Message 48991. If it actually only had 100 hours to go, a little over 450 hours total might not be a bad estimate for his computer, which is averaging 12 sec/TS for the ANZ model. But at 50%, and 366 hours, there has to be a problem. Anyway I aborted it and am waiting for a new unit, while running SETI@home Astropulse and Test4Theory@home. Tullio ID: 49010 · Reply Quote