climateprediction.net home page
Exit status 193 compute error

Exit status 193 compute error

Questions and Answers : Unix/Linux : Exit status 193 compute error
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user701677

Send message
Joined: 17 Aug 13
Posts: 6
Credit: 2,378
RAC: 0
Message 48155 - Posted: 12 Feb 2014, 16:12:34 UTC

Hello,

I've been meaning to address this issue for quite some time time, but I've just now found the time for it.

I'm running hadcm3n workunits on my linux mint distro and every one of them crashes unexpectedly somewhere between 2 and 10% giving exit status 193.

Google hasn't been very helpful in diagnosing this. The only helpful thing I've been able to find about this exit code is in a list of Windows system error codes where it says that the application is not a valid Win32 application (but I don't need it to be, I'm running on linux!).

stderr looks normal, I guess, apart from exiting with signal 3, which is SIGQUIT, but again, doesn't tell me much.

Could you please look into this and tell me what's wrong? Is this standard behaviour for these wus? 98% of all tasks from other projects finish without errors, but tasks from climateprediction always crash. I'd like them to finish cleanly.
ID: 48155 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 48156 - Posted: 12 Feb 2014, 17:00:28 UTC - in response to Message 48155.  
Last modified: 12 Feb 2014, 17:01:03 UTC

Clicking on the + symbol next to stderr shows lots of
Suspended CPDN Monitor - Suspend request from BOINC... messages

This makes me wonder about your BOINC settings. Ensure under the tools menu while computer is in use is ticked. The CPDN models don't like lots of starting and stopping. Also it may be worth suspending computation before doing anything very processor intensive as that can throw a spanner into the works.

I seem to remember reading that the exit codes have different meanings under Linux.
ID: 48156 · Report as offensive     Reply Quote
old_user701677

Send message
Joined: 17 Aug 13
Posts: 6
Credit: 2,378
RAC: 0
Message 48159 - Posted: 13 Feb 2014, 17:15:43 UTC - in response to Message 48156.  
Last modified: 13 Feb 2014, 17:27:32 UTC

Thanks for replying. I've checked and computing is allowed while computer is in use. Also, if you'd like to know other settings, computing is allowed while processor usage is less than 70% and it it is set to use at most 60% of CPU time (90 nm old cpu heats up fast if boinc is left with default 100% cpu usage).

I've set it that way because I don't have a lot of cpu intensive apps, and I suspend computations before using flash or java sockets or anything that is CPU intensive, so no worries here.

I always thought that what we're seeing in stderr comes not from unusually many starts and stops, but from the fact that the running time of CPDN is very long. Usually the deadline is 3 months away and each of those tasks ran for a minimum of a week before it crashed, so I don't think that there are that many stops there, but rather distributed over a large timespan.

About that exit code, yes, I said that it should have a different meaning under linux, but I can't find it...
ID: 48159 · Report as offensive     Reply Quote
old_user701677

Send message
Joined: 17 Aug 13
Posts: 6
Credit: 2,378
RAC: 0
Message 48160 - Posted: 13 Feb 2014, 17:24:19 UTC - in response to Message 48156.  
Last modified: 13 Feb 2014, 18:04:50 UTC

Just found this problem has been discussed here before:
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7306#43193

Dagorath defines the error there. It seems that the task is writing in non-permitted memory zones or there's a problem with the RAM or hard drive. It's been a while since I ran a memtest but I have no reason to suspect that either of them are faulty right now.

It's important to note most people there were talking about fail to success ratios, but none of my CPDN tasks have gotten past 10%

I've just upgraded both the boinc version and the distro. If the problems persist, I'll try Greg's suggestions.
ID: 48160 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 48161 - Posted: 13 Feb 2014, 19:32:52 UTC

There are just way too many suspend requests for only 1 to 3 trickles being returned. It could be that given the speed of the computer, the CPU usage setting, and the number of projects it is attached to, there are just too many opportunities for something bad to happen at key times.
ID: 48161 · Report as offensive     Reply Quote
old_user701677

Send message
Joined: 17 Aug 13
Posts: 6
Credit: 2,378
RAC: 0
Message 48162 - Posted: 13 Feb 2014, 20:10:20 UTC - in response to Message 48161.  
Last modified: 13 Feb 2014, 20:26:02 UTC

How does the CPU usage setting affect it? Sure, it's 60%, so it must pause the calculations sometimes to maintain this average usage, but the pause is likely done through the usual pre-empting, it probably saves the context, than restores it at a later time. As long as it has enough memory (and it has), that shouldn't be a problem. The OS does that all the time when a program's time on the cpu expires and another program needs to run.

Besides, some of those fails have been when the system was in idle and boinc had 100% of cpu time for itself, so that doesn't hold, imho.

Given that other linux (and some windows) users have experienced the same error, has any developer looked into the code so we can definitely rule out a bug from inside? As I said before, 98% of all other tasks from all other projects don't segfault.
ID: 48162 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 48163 - Posted: 13 Feb 2014, 21:11:48 UTC - in response to Message 48162.  

If the pausing happens while the model is check pointing, then part of the saved values will be current, and some will be from the previous checkpoint.
So the checkpoint will be corrupted.

************

has any developer looked into the code

The code for the climate models belongs to the UK Met Office. It's been developed over many years by many climatologists and software engineers, to run on their super computers. It was never intended to be frequently stopped and started during it's running.

It's also said to be close to a million lines of source code. And the project people don't have the source code for the main program, just the auxiliary programs.

The people whose work is being run here, are external to Oxford. They work in climate centres in various places around the world.



Backups: Here
ID: 48163 · Report as offensive     Reply Quote
old_user701677

Send message
Joined: 17 Aug 13
Posts: 6
Credit: 2,378
RAC: 0
Message 48164 - Posted: 13 Feb 2014, 21:51:15 UTC - in response to Message 48163.  
Last modified: 13 Feb 2014, 22:06:08 UTC

Regarding the tasks not being designed to be frequently started and stopped, establishing on what hardware and operating systems your application can be modified to run on, is probably part of initial design.

Frequent starting and stopping comes with the design of the boinc systen and all projects are adapted to that. Since the 8th of august, I've only had 4 or 5 tasks non-related to climate prediction crashing. Moreover I've set the upper limit of 70 % cpu usage by non-boinc software specifically so the tasks won't get suspended frequently. By the way, most of those pauses aren't reported by the GUI because I don't see "CPU busy" 10 times per hour.

I know this problem doesn't happen to a lot of users, but you can analyse to whom it happens. If you could filter the results that gave computation errors by cause, you might notice some commonalities either in hardware or in settings for those clients whose tasks have crashed, then you could give some general guidelines (have cpu usage at x%) and so on. I can't experiment enough with that because hacm3n estimates it needs 997 hours to complete, so I don't have enough data points to make a statistic, but you do.
ID: 48164 · Report as offensive     Reply Quote
old_user701677

Send message
Joined: 17 Aug 13
Posts: 6
Credit: 2,378
RAC: 0
Message 48165 - Posted: 13 Feb 2014, 21:55:40 UTC - in response to Message 48163.  

And I've imagined that the best climatologists and software engineers have been working on that code for years, but a pointer that points where it shouldn't in a very very rare use case is enough to ruin otherwise brilliant work. I'm not saying that's definitely the case but it can happen.
ID: 48165 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 48166 - Posted: 13 Feb 2014, 22:33:45 UTC - in response to Message 48164.  

... then you could give some general guidelines ...
We have.

As for a survey, it's been noted many times that those having problems are the ones NOT allowing BOINC to run without restriction.

There are far more computers than are really needed for this project, so a few computers not being able to complete models doesn't matter. They will just be re-issued, and sooner or later will be run by a computer that doesn't have a high crash record.

The job of the two project people are to keep the servers running, and to produce results for the researchers. This is happening, with lots of computers to spare.





Backups: Here
ID: 48166 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Exit status 193 compute error

©2024 climateprediction.net