climateprediction.net home page
help debugging

help debugging

Questions and Answers : Unix/Linux : help debugging
Message board moderation

To post messages, you must log in.

AuthorMessage
charles

Send message
Joined: 8 Jan 07
Posts: 6
Credit: 1,946,132
RAC: 0
Message 63935 - Posted: 5 May 2021, 4:44:20 UTC

Hi,

I've recently been happily surprised with the update of your website platform which is finally up to date which was not the case in 2018. So I wanted to start some tasks under my fedora 34 but apparently it all ends up in error at some point after several days on computation in general:

Signal 15 received: Software termination signal from kill 
Signal 15 received: Abnormal termination triggered by abort call
Signal 15 received, exiting...
SIGSEGV: segmentation violation
Stack trace (21 frames):
../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x80d4cf7]
linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7eee570]
/lib/libc.so.6(getenv+0x9a)[0xf7a09e5a]
/lib/libc.so.6(+0xab5d9)[0xf7a7e5d9]
/lib/libc.so.6(+0xab97e)[0xf7a7e97e]
/lib/libc.so.6(localtime_r+0x1b)[0xf7a7c62b]
../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-gnu[0x80d01b2]
../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-gnu[0x80d0900]
../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-gnu[0x80d09f1]
linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7eee570]
linux-gate.so.1(__kernel_vsyscall+0x9)[0xf7eee559]
/lib/libc.so.6(+0xb5e4d)[0xf7a88e4d]
/lib/libc.so.6(nanosleep+0x5d)[0xf7a8f4fd]
/lib/libc.so.6(usleep+0x45)[0xf7ac47a5]
../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-gnu[0x80e78a5]
../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-gnu[0x804fca4]
../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-gnu[0x80503d7]
../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-gnu[0x8051b13]
../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-gnu[0x8051d8b]
/lib/libc.so.6(__libc_start_main+0xed)[0xf79f19dd]
../../projects/climateprediction.net/hadam4_8.52_i686-pc-linux-gnu[0x804cd21]


I've installed the libraries recommended for fedora of course. Soo tell me what I should look for.
Best regards
ID: 63935 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 63936 - Posted: 5 May 2021, 6:04:43 UTC

Hi Charles, the first thing I would do is under options>Computing preferences>Disk and memory ensure that leave non gpu tasks in memory while suspended. It is possible you are just having an unlucky run of tasks but as the ones that have already failed on other machines have done so due to missing libraries it is impossible to tell. The other thing to check is was a particular program running/starting up/doing something intensive at the times when they failed?

This may not yield anything but would be useful to eliminate from the equation.
ID: 63936 · Report as offensive     Reply Quote
charles

Send message
Joined: 8 Jan 07
Posts: 6
Credit: 1,946,132
RAC: 0
Message 63938 - Posted: 5 May 2021, 12:29:49 UTC - in response to Message 63936.  

Thanks for the tip.
Your question is going to be a tough one to answer because I always have virtual machines turned on. Or some browsers or maybe other stuffs, so it depends really. For the little I know the segmentation fault is often due to a pointer not pointing to a place in memory owned by the program itself, so I was going to ask more about a compatibility issue with another soft or anything alike. And also I have always 3 tasks from boinc running on this machine.
ID: 63938 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 147
Credit: 12,814,088
RAC: 261,385
Message 63939 - Posted: 5 May 2021, 12:38:48 UTC - in response to Message 63938.  

Thanks for the tip.
Your question is going to be a tough one to answer because I always have virtual machines turned on. Or some browsers or maybe other stuffs, so it depends really. For the little I know the segmentation fault is often due to a pointer not pointing to a place in memory owned by the program itself, so I was going to ask more about a compatibility issue with another soft or anything alike. And also I have always 3 tasks from boinc running on this machine.


The invalid pointer can be due to a software fault but it can also be a sign of a hardware error.

Given that we do not have a mass of users reporting these errors I would check the hardware first.
ID: 63939 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 63940 - Posted: 5 May 2021, 16:17:33 UTC - in response to Message 63939.  

Given that we do not have a mass of users reporting these errors I would check the hardware first.


However when looking at failed tasks I often find that a number of users get this with the same task. The trouble with these ones is that other failures on these tasks have all been missing libraries making it impossible to check.
ID: 63940 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 76
Credit: 67,812,914
RAC: 5,809
Message 63942 - Posted: 5 May 2021, 19:16:07 UTC - in response to Message 63940.  

SIGSEGV: segmentation violation

Isn't it an indication of too much overclock of CPU and/or RAM? I often see SIGSEGV errors when a computer is pushed too hard. If you are not at stock settings, I would try to dial back a little bit.
ID: 63942 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 63944 - Posted: 5 May 2021, 19:38:47 UTC - in response to Message 63942.  

SIGSEGV: segmentation violation

Isn't it an indication of too much overclock of CPU and/or RAM? I often see SIGSEGV errors when a computer is pushed too hard. If you are not at stock settings, I would try to dial back a little bit.


Sometimes it is but others all attempts at a task from a work unit have the same error at the same point. If it was always the same cause we would have it easy!
ID: 63944 · Report as offensive     Reply Quote
charles

Send message
Joined: 8 Jan 07
Posts: 6
Credit: 1,946,132
RAC: 0
Message 63953 - Posted: 7 May 2021, 16:01:13 UTC - in response to Message 63944.  

Well since it s the only thing that fails, and I use a lot of others tasks which takes a lot of cpu usage too and I don't have any other complaint from those.
If I'm not mistaken those taks has never used GPU so that can't be it.
If i had RAM errors I would have already a lot of problems with other prrojects like worldcommunitygrid, LHC or einstein.
self statement for the cpu
And the task that has been running for 4 days now would not be able to continue under the same conditions. hadam4h_21f0_209905_5_903_012081295
So if it was hardware I assume in confidence I would have other symptoms.
So even if the others lacked the libraries, I would maybe be more inclined to believe that there is a problem with the task itself .
To be certain the team should do a run of the same tasks
ID: 63953 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63954 - Posted: 7 May 2021, 16:47:12 UTC - in response to Message 63953.  

1. If it was libs, then they'd fail at about 6 seconds.

2. You have far too many interruptions to each model.
In the "computing prefs on your account, set: Suspend when non-BOINC CPU usage is above to 100%.

3. The n216 models like LOTS of L3 cache. This was discussed early last year somewhere.
Minimum seems to be 4 Megs per model.
ID: 63954 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 63955 - Posted: 7 May 2021, 16:49:06 UTC - in response to Message 63953.  

To be certain the team should do a run of the same tasks


The tasks get up to five attempts. If it were not for the crashes due to missing libraries (which always make the task crash in the first five seconds or less) we would have more information on this. with luck some of the tasks will get another chance on a machine that has the 32 bit libraries so we can see if it fails on all machines or not. For whatever reason, in the past we have had some tasks that complete on AMD but not Intel and vice versa.

If the answer was simple we would have solved it a long time ago!
ID: 63955 · Report as offensive     Reply Quote
charles

Send message
Joined: 8 Jan 07
Posts: 6
Credit: 1,946,132
RAC: 0
Message 63956 - Posted: 7 May 2021, 18:16:33 UTC - in response to Message 63955.  

I'm sure you would. But as you've seen and as you've stated, the tasks didn't just crashed in the 5 first seconds, it was way more long than that and resulted in computational error all of a sudden.
What I meant by the team should make them make a run, they should run it through an industrial debugger and let it run and so you would have the informations we need.
From my experience we usually don't have enough informations in the logs of the tasks themselves or boinc logs. It's the same problem from LHC@home or others that have been running into computational error.
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ an industrial debugger used like in this. Then we would exactly know what would the task have run as a function before dying and which variables it hitted. That's what I meant.
ID: 63956 · Report as offensive     Reply Quote
charles

Send message
Joined: 8 Jan 07
Posts: 6
Credit: 1,946,132
RAC: 0
Message 63957 - Posted: 7 May 2021, 18:20:59 UTC - in response to Message 63954.  
Last modified: 7 May 2021, 18:21:35 UTC

1. If it was libs, then they'd fail at about 6 seconds.

2. You have far too many interruptions to each model.
In the "computing prefs on your account, set: Suspend when non-BOINC CPU usage is above to 100%.

3. The n216 models like LOTS of L3 cache. This was discussed early last year somewhere.
Minimum seems to be 4 Megs per model.


can't do this unfortunately. I'm not using my hosts for amusement and debugging boinc models as I've discussed it last a zillion times in the past like this on other projects is not solving anything because that just does show a lack of programming skills about the model.
ID: 63957 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63958 - Posted: 7 May 2021, 20:15:46 UTC - in response to Message 63957.  

There's nothing wrong with the models.
It's the way that you're trying to run them.

So you'll just have to keep letting them fail.
ID: 63958 · Report as offensive     Reply Quote
charles

Send message
Joined: 8 Jan 07
Posts: 6
Credit: 1,946,132
RAC: 0
Message 63982 - Posted: 23 May 2021, 17:08:57 UTC - in response to Message 63958.  

There's nothing wrong with the models.
It's the way that you're trying to run them.

So you'll just have to keep letting them fail.

Well agree to disagree. That's the always same response from dev that can't debug correctly their app. And as a dev myself I know that pretty well since I need devs from my team to turn 7 times their tongue in their mouth before to say something that stupid. And I'm sorry to say this not diplomatically but this is just the reality of the area and there are plenty of articles about it from people having actually done some thinking about it. Including some research papers about it like from Edwin Zaccai.

Anyway since,
https://www.cpdn.org/workunit.php?wuid=11937549
https://www.cpdn.org/workunit.php?wuid=11938554
https://www.cpdn.org/workunit.php?wuid=12081295

have all successfully completed. And as they've run for more than 10 days each, same environment, same workload, same phenomenons about the pausing of the CPU, and different upgrades along the way, this is the actual proof that there were at least something wrong with the tasks before and so maybe something wrong with the model if tasks are all created the same way. This means that there is some inconsistencies that the model doesn't handle correctly and conclude to an error.
ID: 63982 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 63983 - Posted: 24 May 2021, 5:48:13 UTC - in response to Message 63954.  

3. The n216 models like LOTS of L3 cache. This was discussed early last year somewhere.
Minimum seems to be 4 Megs per model.


It seems to me if you run "too many" simultaneous N216 models for your L3 cache, the only problem would be that the tasks run slower due to L3 cache hit misses. It would not crash anything, just slow you down.

BTW, my N216 models want about 1.3GBytes RAM until very near the end when they want about 1.4 GBytes.
ID: 63983 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 63984 - Posted: 24 May 2021, 6:54:23 UTC
Last modified: 24 May 2021, 6:57:16 UTC

BTW, my N216 models want about 1.3GBytes RAM until very near the end when they want about 1.4 GBytes.


With 47GB RAM isn't an issue in this case unless there is a lot of very memory intensive stuff going on alongside BOINC.

With regards to the issue of whether there is anything wrong with the code or not, I would say, "yes there is." In an ideal world, if the machine is switched off during a disk write, the task would simply roll back to the previous checkpoint. The code would also cope better with issues such as a file being locked when an antivirus program is looking at it (an issue mainly affecting Windows work when it is there) and with other programs suddenly grabbing the resources used by the climate model in question. (The code was originally written for the Met Office supercomputers where the resources being grabbed by a game or video rendering is not relevant.)

The trouble is that most of the code (over a million lines of Fortran) is not open source and is licensed to CPDN by the Met Office. If the issue is in that code as opposed to the code that interfaces between that and BOINC. There is virtually nothing those in Oxford can do about it. When Open IFS finally makes it to the main site, that code will be open source and in theory at least some of those issues may get resolved at least for that type of task. However I am not holding my breath and don't advise anyone else to do so either!

The bottom line is that to get a decent return rate, we have to follow the various tips suggested in these forums. My current success rate is just over 95%. I could probably get that higher still if I always waited till all tasks were completed before rebooting to apply kernel updates.
ID: 63984 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : help debugging

©2024 climateprediction.net