climateprediction.net home page
new app. 4.23 resolves signal 11 bug

new app. 4.23 resolves signal 11 bug

Questions and Answers : Unix/Linux : new app. 4.23 resolves signal 11 bug
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
old_user16868

Send message
Joined: 12 Sep 04
Posts: 7
Credit: 515,736
RAC: 0
Message 19042 - Posted: 5 Jan 2006, 12:55:50 UTC

I have just tried the v4.32 and still get sulphur_um zombie. And crashes model machine ID\'s are 335895 and 29961 and 29959.

I haven\'t stopped & restarted boinc with update to 4.32, or rebooted Linux.
Do I need to do these to get functional sulphur model after update?
I posted lengthy detail a few days ago, but maybe in wrong forum.

I have the v2.3.2 libc, libm libraries does this new version of sulphur resolve problems for the 2.3.2 libs ? I am still getting same error.

Hope can help as I have been puzzling through this for a few weeks now.

cheers

Steve R

below is post to \"unexpected behaviour in your model? / Sulphur model premature ends\"


I have been having crashes with the sulphur cycle V4.22.
seems to crash straight away with sulphur_um process zombie. I have set the boinc client \"keep in memory option\" to on, and tried the detach/reattach to no avail. the linux boxes are
Thread model: posix
gcc version 3.2 20020903 (Red Hat Linux 8.0 3.2-7)
kernel 2.4.29

I have run math and memory intensive cosmology & relativity modelling apps using parallel libs (LAM-MPI, LAPACK, BLAS, CACTUS, PETSC, etc) with 6 of the nodes in cluster-mode. These cosmology models ran for months at a time, with gig network connectivity , no crashes or issues. The machines are OK with no CPU or memory issues. I know this for a fact.
These machines are now operating standalone and have run all HADSM slabs from cpdn up until sulphur cycle models. Also othr projects have run no issues.

Boinc is optimised version 5.2.5, with setiathome, LHC and predictor projects running as well. I have even tried running with only CPDN running so there were no context switches to other projects. 3 other lnux boxes (same config) I have running are still getting fed hadsm3 4.13 models.

Does anyone know what the real issue is with the sulphur_um..... executable ?? (it never gets any memory allocation or shared memory ??? - this from top process monitor). I am assuming this is what is causing climate model to crash & time out on no. of crashes. I have checked that the shared library installed by sulphur 4.22 is locatable (ldconfig) in the slots and climatepredition directories. permission are correct.

Is anyone else out there running similar config with linux ?
Some info would be appreciated or a way to stop getting fed sulphur cycle models until issue is sorted out.

cheers


ID: 19042 · Report as offensive     Reply Quote
old_user3

Send message
Joined: 5 Aug 04
Posts: 173
Credit: 1,843,046
RAC: 0
Message 19045 - Posted: 5 Jan 2006, 13:33:24 UTC - in response to Message 19030.  
Last modified: 5 Jan 2006, 13:41:29 UTC

Mathe : thanks for the diagnosis. I\'m looking into this now.
ID: 19045 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2169
Credit: 64,555,907
RAC: 5,858
Message 19046 - Posted: 5 Jan 2006, 14:09:42 UTC

Mathe,

For someone who is not a Linux geek, you did pretty well at figuring this out. Hope this is it and can be fixed in the next version. Good job.
ID: 19046 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19049 - Posted: 5 Jan 2006, 16:01:37 UTC - in response to Message 19046.  


Final confirmation that library versions were the problem:

SHAPE is now crunching on sulphur ;)

What I did is that I have made a patched version of sulphur_um executable that does not use the libraries from the /lib location but rather a renamed copy of the old versions of the libraries (libm 2.5.2, libc 2.5.2 and pthread 0.10) - which I took from SHADING as a stem cell transplant. This way, sulphur_um no longer relies on the libraries installed on the current machine, but rather on a local copy.

I did this by rudely patching the sulphur_um_22 ELF executable. I\'m sure there were more elegant ways to do this in Linux and this is a rather Windows computer game cracking-like stile, but that\'s the best I could have come up with. Also, I don\'t have root rights on any of these stations, so I could not install/uninstall any libraries.

Actually, Geophi, I am a student in computer science, but all my programming experience was under Windows. I am just now discovering the marvelous world of Linux. Windows programmers are dreaming of such easy to use tools like strace for debugging their programs. Linux is really a wonderful thing!

Cheers,
Stefan.


ID: 19049 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19050 - Posted: 5 Jan 2006, 16:03:49 UTC - in response to Message 19049.  

Sorry, small typo in my previous message: its version pthread 0.9 which I took from SHADING (the old one), obviously.

ID: 19050 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19052 - Posted: 5 Jan 2006, 17:04:06 UTC - in response to Message 19050.  


If there is anyone interested in applying the patch (so that he/she doesn\'t have to \"downgrade\" his libraries in order to run CPDN), I made it public. Follow the link below:

http://www.freemail.atlastelecom.ro/~msutcn/

Basically, this patch makes sulphur_um independent of the libraries you have installed on your system, so it should work the same on every Linux machine.

I tested it up to this point on two machines which were having the problem, and both of them are working now.

It is kind of clumsy, but seems to work. I hope it works for others who have the same problem. Please see readme.txt inside archive for info on how to install it.

Warm regards,
Stefan.

ID: 19052 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 148
Credit: 8,195,612
RAC: 17,013
Message 19054 - Posted: 5 Jan 2006, 17:26:04 UTC
Last modified: 5 Jan 2006, 18:09:53 UTC

I have Fedora C4 and glibc 2.3.5 which includes libc-2.3.5 (after weekly updating)
It works fine and shows 2.70 s/TS on this AMD XP compared to around 3.50 before.

A 33% speedup !

A big beer to you Tolu :-)



ID: 19054 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19056 - Posted: 5 Jan 2006, 18:20:35 UTC - in response to Message 19054.  


Since your system is working fine, then there are two possibilites:

* There the incompatibility is only with version GLIBC 2.3.2, and it again fixed in GLIBC 2.3.5
* The bug only manifests itself in certain distributions

I think the ones at school are Redhat, but I have to check, since I was doing all this remotely by SSH and I don\'t know what commands to use remotely to identify exactly the distribution.

Unfortunately, I was unable to find any computer at school with a newer version so that I can check this, and I have no root access up to this point to upgrade the libraries on any of these computers.

Regards,
Stefan.

P.S. Do you guys know a linux command to query the EXACT distribution of a linux machine?
ID: 19056 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19057 - Posted: 5 Jan 2006, 18:58:12 UTC - in response to Message 19056.  

Found out how to identify linux distribution (/etc directory).

So, it seems that all our machines here at the university are:

Red Hat Linux release 7.3 (Valhalla)

Thus, we know the bug affects Red Hat.

Since both Geophi and cwhyl are using Fedora Core without any problems at all, I believe it is highly probable that the library version bug does not affect FC, maybe only affects REDHAT (Steve also said he was using RedHat earlier and had the same problem as me).

Well, we are slowly closing in on this nasty bug :) Hope our efforts will help Tol u reproduce de error and thus correct it.

Warm regards,
Stefan.


ID: 19057 · Report as offensive     Reply Quote
old_user16868

Send message
Joined: 12 Sep 04
Posts: 7
Credit: 515,736
RAC: 0
Message 19070 - Posted: 6 Jan 2006, 5:23:52 UTC - in response to Message 19052.  


If there is anyone interested in applying the patch (so that he/she doesn\'t have to \"downgrade\" his libraries in order to run CPDN), I made it public. Follow the link below:

http://www.freemail.atlastelecom.ro/~msutcn/

Basically, this patch makes sulphur_um independent of the libraries you have installed on your system, so it should work the same on every Linux machine.

I tested it up to this point on two machines which were having the problem, and both of them are working now.

It is kind of clumsy, but seems to work. I hope it works for others who have the same problem. Please see readme.txt inside archive for info on how to install it.

Warm regards,
Stefan.


I have installed and now have running on all my redhat boxes as described http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_reply.php?thread=3822&post=19052&helpdesk=1#19042

It works..... Yahoo !!!! well done Stefan.

Where can I get the surce that you used, as I would like to compile for testing and optimizing ?

cheers

Steve R

ID: 19070 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 148
Credit: 8,195,612
RAC: 17,013
Message 19078 - Posted: 6 Jan 2006, 11:29:50 UTC

Got a Fedora C2 running, it has libc 2.3.3
ID: 19078 · Report as offensive     Reply Quote
old_user21637

Send message
Joined: 28 Sep 04
Posts: 36
Credit: 268,150
RAC: 0
Message 19147 - Posted: 10 Jan 2006, 17:36:31 UTC - in response to Message 19078.  


On some systems there seems to be some daemon which deletes files from the temporary directory. As you may have seen, the patch I posted was using the temporary directory to store the older version library files. If you have had any problems (model errors), it was likely because as soon as the BOINC stopped sulphur for the first stime (say, to do some benchmarks), the daemon deleted its libraries from the temporary directory.

I have updated the patch now so that it no longer uses the temporary directory, but neatly stores the libraries inside the project directory. This is also great for backups, since a backup of the project directory now contains everything needed to run it (as it was before the patch).

So, if you encountered any of the problems above, please update from the address:

http://www.freemail.atlastelecom.ro/~msutcn/

Note: only patch for version 4.23 is updated for now

Cheers,
Stefan.

ID: 19147 · Report as offensive     Reply Quote
old_user1132

Send message
Joined: 25 Aug 04
Posts: 28
Credit: 6,522,252
RAC: 0
Message 19160 - Posted: 10 Jan 2006, 22:44:29 UTC
Last modified: 10 Jan 2006, 22:45:16 UTC

Linux V4.23 for Sulphur Cycle also gives 20-25% speed up on AMD X2 processors.

Andrew
Andrew

<a href="http://cpdnforum.info">CPDNforum<a>
ID: 19160 · Report as offensive     Reply Quote
Previous · 1 · 2

Questions and Answers : Unix/Linux : new app. 4.23 resolves signal 11 bug

©2024 climateprediction.net