Message boards :
Number crunching :
hadam3p eu WU segfault
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0 |
I just had a hadam3p eu WU fail on a segfault <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> SIGSEGV: segmentation violation Stack trace (13 frames): /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu(boinc_catch_signal+0x6f)[0x8340f8f] linux-gate.so.1(__kernel_sigreturn+0x0)[0x55578400] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x813cc30] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x81426bb] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x813a3ce] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8143cea] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x813924a] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8090a3e] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8055900] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8069392] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x806a310] /data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x82ccda2] /lib/libc.so.6(__libc_start_main+0xf3)[0x555fc9d3] ... etc ... The WU is http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=16599536 Is this a known problem? I could not find any mention of it on the forum... |
Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0 |
Segfaults happen occasionally (rarely), with all model types. Why? Who knows? A misbehaving driver for one of the computer's peripherals. Cosmic Rays flipping a bit in the computer's RAM. Electrical 'noise' on the power supply from (for example) the refrigerator turning off. Static electricity building up and then discharging. Oxide films building up on connectors in the computer. A failing component or solder joint. Most likely: a very obscure bug in the model code, that can never be reproduced by the developers, because it only shows up after a certain pattern of disk accesses, with model data in a specific place in RAM. Or something like that. If segfaults happen frequently with one computer, then it might be worth investigating further, for example checking that all the connectors to the motherboard and disk drives are properly seated, and running Memtest86+ for 72 hours or so. But a one-off segfault is nothing to worry about. |
Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0 |
Thanks! All I know is that the segfault happened almost simultaneously with another WU on the same machine finishing. Could be a coincidence of course. But it would be consistent with the non-reproducibility... |
Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0 |
Hmmm, I just lately had the impression, that two models influenced eachother in this thread. A windows machine with less detailed error output but still with a somehow similar effect. I guess it might be useful to have a look at the activity of models running concurrent on the same machines when such a thing happens. |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,269,058 RAC: 2,247 |
Hi Mr. Greg van Paassen, tht sounds right and cool. Nice to read. |
Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0 |
Thanks, Bonsai911! :-) |
Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0 |
Just had a 2nd eu model crash on a segfault on the same machine http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=16599512 The other types of climateprediction WUs are still OK on this machine, but I guess it is still early days... I will run memtest to be on the safe side, but I have not seen flaky behavior with other projects. I have had 3 eu WUs on this machine so far, and all 3 failed. One immediately crashed due to a different cause which was reproduced on other machines, so that must be inherent to the WU. But the other two both failed on segfaults. |
Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0 |
Just checked the memory. It's fine. No clue what is going on here... |
Send message Joined: 16 Jan 10 Posts: 1081 Credit: 6,980,320 RAC: 3,893 |
pvh: Note that at least one of your crashed models has "Model crashed: INITTIME: Atmosphere basis time mismatch" (as, indeed, has one recent one of mine). This is a project configuration error and nothing to do with your machine. So at least you don't have to worry about that one. |
Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0 |
That one crash is clearly reprodicible. But in the mean time two more WUs crashed on a segfault. This time both hadam3p anz models. This is clearly not a rare event. I am seriously contemplating abandoning this project... |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Over the years there's been a number of reports about the SIGSEGV problem. The vague ideas that I've formed about them, without knowing anything about the cause, are: Always Linux. They usually get posted about in the Linux thread. They, like the "Visual Fortran run-rime error" and the "exited with zero status but no 'finished' file" problems, only happen to some computers. Also, they start suddenly, and stop happening just as suddenly. Or, at least, stop getting posted about. I haven't had it happen to my computers in the few months that I've been running Linux, so, basically, it's your computer that has a problem, and not the models. |
Send message Joined: 15 May 09 Posts: 4342 Credit: 16,497,933 RAC: 6,477 |
Interesting Les, I have had these in the past on this box but not for a long time, certainly not since I replaced the PSU when it stopped booting up. The only other significant changes have been doubling the ram to 4GB and periodically changing the linux version. - I started off with Mandriva and then had problems with one of the incarnations of it and moved over to Kubuntu. The trouble in working out the cause is as you say, the erraticness of it. I would get a couple in a month and then none for six months or more which makes me wonder if while it is as you say almost certainly a computer problem, are some models more prone to it than others? |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
I have a hadam3p_anz unit at 50% after 366 hours on my SuSE Linux box.Oddly enough, estimated completion time is given as 107 hours only. How come? Tullio |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Probably in a loop. They don't last THAT long. |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
Probably in a loop. They don't last THAT long. Should I abort it? |
Send message Joined: 15 May 09 Posts: 4342 Credit: 16,497,933 RAC: 6,477 |
Should I abort it? Yes, models stuck in a loop are like the Computer programmer found dead in the shower. Instructions on bottle. Wet hair Apply Shampoo Rinse Repeat |
Send message Joined: 7 Aug 04 Posts: 2167 Credit: 64,477,979 RAC: 4,009 |
If it actually only had 100 hours to go, a little over 450 hours total might not be a bad estimate for his computer, which is averaging 12 sec/TS for the ANZ model. But at 50%, and 366 hours, there has to be a problem. |
Send message Joined: 9 Apr 14 Posts: 14 Credit: 1,962,018 RAC: 0 |
@Les Bayliss. Let me assure you that there are no hardware problems on my computer. I have checked. The machine is healthy. You state that the segfaults only occur on Linux machines. This proves beyond a doubt that software problems must play a role in the problem as it can be ruled out that hardware problems would exclusively occur on Linux machines. My guess is that it is the boinc library. I have seen that library produce spurious segfaults on occasion on all my machines, e.g. when there is a problem with the ethernet switch. Also a heavy load on the disk or OS can do this (i.e., when the kernel starts doing high-priority tasks). The problem (or problems, there could be multiple bugs in the boinc library...) clearly needs a combination of factors to pop up and is in part driven by external factors. I have read an interesting posting on another message board that stated that this was due to naive design of the boinc library, pretending it had control over things it simply cannot control. Unfortunately I cannot find that posting back. But it mentioned arbitrary timeouts on system commands to "check" if the system is still healthy... If so, there must be something in the way climateprediction uses the boinc library to make this problem more likely to pop us as the segfaults are clearly far more frequent here than on other boinc projects that I ran on the same box. Pointers could be that boinc resides on a RAID5 array on this machine, and that it is my regular desktop machine (so is more loaded with external tasks). |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,888,554 RAC: 1,481,373 |
The reason that sigsegv (signal 11) only happens on Linux and Mac machines is -- Wait for it -- Only on Linux and other *ix machines is this sigsegv defined. Sigsegv signal 11 is defined on all *ix machines , on any such there is a definition -- basically, access outlaw memory - get sig 11 sigsegv. Other OS's, other definitions. How, say, MS windows reports this type of error, no se. But -- on any unix-based machine - sigsegv means either a programming error or a hardware error - and you need real skills to figure out which. refer to the hardware docs at intel or AMD. Almost always when you get a sigsegv on any U*x machine, either it's incompetent C programming or, more likely "hardware problem" pvh states == You state that the segfaults only occur on Linux machines. This proves beyond a doubt that software problems must play a role in the problem as it can be ruled out that hardware problems would exclusively occur on Linux machines. naah - hardware problems happen on all machines -- Linux reports segfaults - Windows reports -- ?? non sequitur. "Let me assure you that there are no hardware problems on my computer. I have checked. The machine is healthy." this is "famous last words" candidate. Believe me, or don't -- how many times I thought "my machine has no problems - neither hardware or OS nor dll's" Don't posture to be so sure about what none of us knows about. Oh, and let me add that I've been running mostly CPDN for almost a decade, and the very very few sigsegv's were almost always hardware problems, the last decade or so. |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
If it actually only had 100 hours to go, a little over 450 hours total might not be a bad estimate for his computer, which is averaging 12 sec/TS for the ANZ model. Anyway I aborted it and am waiting for a new unit, while running SETI@home Astropulse and Test4Theory@home. Tullio |
©2024 climateprediction.net