climateprediction.net home page
Posts by pvh

Posts by pvh

1) Message boards : Number crunching : AVX and AVX2; Is it used at CPDN? (Message 56362)
Posted 11 Jun 2017 by pvh
Post:
Substantial time (~5%) is spent in libm's powf(), which uses legacy x87 instructions. Is there some other way the model could do exponentiation? Ditto for log10. (Both are at FP32 precision afaict).


There are well known speed issues with the standard linux math library, especially the single-precision math functions. The stance of the library developers has been clearly stated: they only care about accuracy, not speed. It looks like they did not do any effort to optimize single-precision math functions for speed. They often are substantially slower than their double-precision counterparts. It looks like they only cared about ticking off the box "added support for single-precision math functions"... That being said, there has been some improvement in more recent math library versions, in part due to complaints from users, and also third parties submitting their own (better optimized) versions. My hunch is that CPDN is using a very old math library version... If so, switching to a newer version may already help. Avoiding single-precision math functions likely will also help...
2) Message boards : Number crunching : Replanca Error/Sigseg fault. (Message 56360)
Posted 11 Jun 2017 by pvh
Post:
I see the same issue with the segfaults: 100% failure rate on WUs from the 583 batch, all after about the same amount of CPU time (so this doesn't look random at all). Looking at my wing men, I could not find a single one where the task finished OK (including a few Windows computers). These are statistics on 14 failed WUs. Several of those are already counted out with 3 failures.

With quite a few of my wing men I also saw this error

../../projects/climateprediction.net/wah2_8.25_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory
3) Message boards : Number crunching : Extremely high work units done. (Message 50092)
Posted 10 Sep 2014 by pvh
Post:
And then they told me I would loose all those credits... They have gone up again! But it is still only a laughable 664 billion. Nothing compared to some other guys here...
4) Message boards : Number crunching : Extremely high work units done. (Message 49974)
Posted 2 Sep 2014 by pvh
Post:
This time I got 5.6 billion credits. That will teach those Bitcoin crunchers a lesson of humility! Sadly it won't last though...
5) Message boards : Number crunching : Credit updates? (Message 49864)
Posted 25 Aug 2014 by pvh
Post:
Is it just me, or are the credits a lot lower now than before the glitch? Before the glitch a successful HadAM3P PNW WU would get me around 3000 credits, now that is only 1000-1250...

I also had a HadAM3P with MOSES II WU that recently crashed after 1.3 million sec which only got me less than 1100 credits. A similar unit that crashed earlier after a similar amount of CPU time got me 8900 credits...
6) Message boards : Number crunching : Extremely high work units done. (Message 49811)
Posted 21 Aug 2014 by pvh
Post:
Hmmm, I had to settle for a mere 40 million... A well, you can't win them all!
7) Message boards : Number crunching : hadam3p eu WU segfault (Message 48997)
Posted 1 May 2014 by pvh
Post:
@Les Bayliss. Let me assure you that there are no hardware problems on my computer. I have checked. The machine is healthy.

You state that the segfaults only occur on Linux machines. This proves beyond a doubt that software problems must play a role in the problem as it can be ruled out that hardware problems would exclusively occur on Linux machines.

My guess is that it is the boinc library. I have seen that library produce spurious segfaults on occasion on all my machines, e.g. when there is a problem with the ethernet switch. Also a heavy load on the disk or OS can do this (i.e., when the kernel starts doing high-priority tasks). The problem (or problems, there could be multiple bugs in the boinc library...) clearly needs a combination of factors to pop up and is in part driven by external factors.

I have read an interesting posting on another message board that stated that this was due to naive design of the boinc library, pretending it had control over things it simply cannot control. Unfortunately I cannot find that posting back. But it mentioned arbitrary timeouts on system commands to "check" if the system is still healthy...

If so, there must be something in the way climateprediction uses the boinc library to make this problem more likely to pop us as the segfaults are clearly far more frequent here than on other boinc projects that I ran on the same box. Pointers could be that boinc resides on a RAID5 array on this machine, and that it is my regular desktop machine (so is more loaded with external tasks).

8) Message boards : Number crunching : hadam3p eu WU segfault (Message 48971)
Posted 29 Apr 2014 by pvh
Post:
That one crash is clearly reprodicible. But in the mean time two more WUs crashed on a segfault. This time both hadam3p anz models. This is clearly not a rare event. I am seriously contemplating abandoning this project...
9) Message boards : Number crunching : hadam3p eu WU segfault (Message 48952)
Posted 28 Apr 2014 by pvh
Post:
Just checked the memory. It's fine. No clue what is going on here...
10) Message boards : Number crunching : hadam3p eu WU segfault (Message 48945)
Posted 28 Apr 2014 by pvh
Post:
Just had a 2nd eu model crash on a segfault on the same machine

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=16599512

The other types of climateprediction WUs are still OK on this machine, but I guess it is still early days... I will run memtest to be on the safe side, but I have not seen flaky behavior with other projects.

I have had 3 eu WUs on this machine so far, and all 3 failed. One immediately crashed due to a different cause which was reproduced on other machines, so that must be inherent to the WU. But the other two both failed on segfaults.
11) Message boards : Number crunching : hadam3p eu WU segfault (Message 48917)
Posted 27 Apr 2014 by pvh
Post:
Thanks! All I know is that the segfault happened almost simultaneously with another WU on the same machine finishing. Could be a coincidence of course. But it would be consistent with the non-reproducibility...
12) Message boards : Number crunching : hadam3p eu WU segfault (Message 48914)
Posted 27 Apr 2014 by pvh
Post:
I just had a hadam3p eu WU fail on a segfault

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
SIGSEGV: segmentation violation
Stack trace (13 frames):
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu(boinc_catch_signal+0x6f)[0x8340f8f]
linux-gate.so.1(__kernel_sigreturn+0x0)[0x55578400]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x813cc30]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x81426bb]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x813a3ce]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8143cea]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x813924a]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8090a3e]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8055900]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x8069392]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x806a310]
/data1/pvh/BOINC/projects/climateprediction.net/hadrm3p_eu_um_6.09_i686-pc-linux-gnu[0x82ccda2]
/lib/libc.so.6(__libc_start_main+0xf3)[0x555fc9d3]
... etc ...


The WU is http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=16599536

Is this a known problem? I could not find any mention of it on the forum...
13) Message boards : Number crunching : UK Met Office HADAM3P (global only) with MOSES II landsurface scheme v7.03 (Message 48874)
Posted 23 Apr 2014 by pvh
Post:
I have just finished 6 WUs, but BOINC is refusing to download any new work because it thinks that it doesn't need any work. This is undoubtedly a result of the excessive time remaining estimate for the hadam3pm2 WU. Is there any way of tricking BOINC into downloading new work despite this?
14) Message boards : Number crunching : UK Met Office HADAM3P (global only) with MOSES II landsurface scheme v7.03 (Message 48840)
Posted 18 Apr 2014 by pvh
Post:
I am running one of the new WUs (hadam3pm2_b8q0_1967_10_008669491_1) under openSUSE 13.1 and it is running fine for 24 hours now. The projected total run time seems extreme though. After 24 hours only 0.5% of the WU has completed. I hope that speeds up later on since 200 days run time really ties up your computer for a long time...




©2024 climateprediction.net