climateprediction.net home page
Posts by old_user701677

Posts by old_user701677

1) Questions and Answers : Unix/Linux : Exit status 193 compute error (Message 48165)
Posted 13 Feb 2014 by old_user701677
Post:
And I've imagined that the best climatologists and software engineers have been working on that code for years, but a pointer that points where it shouldn't in a very very rare use case is enough to ruin otherwise brilliant work. I'm not saying that's definitely the case but it can happen.
2) Questions and Answers : Unix/Linux : Exit status 193 compute error (Message 48164)
Posted 13 Feb 2014 by old_user701677
Post:
Regarding the tasks not being designed to be frequently started and stopped, establishing on what hardware and operating systems your application can be modified to run on, is probably part of initial design.

Frequent starting and stopping comes with the design of the boinc systen and all projects are adapted to that. Since the 8th of august, I've only had 4 or 5 tasks non-related to climate prediction crashing. Moreover I've set the upper limit of 70 % cpu usage by non-boinc software specifically so the tasks won't get suspended frequently. By the way, most of those pauses aren't reported by the GUI because I don't see "CPU busy" 10 times per hour.

I know this problem doesn't happen to a lot of users, but you can analyse to whom it happens. If you could filter the results that gave computation errors by cause, you might notice some commonalities either in hardware or in settings for those clients whose tasks have crashed, then you could give some general guidelines (have cpu usage at x%) and so on. I can't experiment enough with that because hacm3n estimates it needs 997 hours to complete, so I don't have enough data points to make a statistic, but you do.
3) Questions and Answers : Unix/Linux : Exit status 193 compute error (Message 48162)
Posted 13 Feb 2014 by old_user701677
Post:
How does the CPU usage setting affect it? Sure, it's 60%, so it must pause the calculations sometimes to maintain this average usage, but the pause is likely done through the usual pre-empting, it probably saves the context, than restores it at a later time. As long as it has enough memory (and it has), that shouldn't be a problem. The OS does that all the time when a program's time on the cpu expires and another program needs to run.

Besides, some of those fails have been when the system was in idle and boinc had 100% of cpu time for itself, so that doesn't hold, imho.

Given that other linux (and some windows) users have experienced the same error, has any developer looked into the code so we can definitely rule out a bug from inside? As I said before, 98% of all other tasks from all other projects don't segfault.
4) Questions and Answers : Unix/Linux : Exit status 193 compute error (Message 48160)
Posted 13 Feb 2014 by old_user701677
Post:
Just found this problem has been discussed here before:
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7306#43193

Dagorath defines the error there. It seems that the task is writing in non-permitted memory zones or there's a problem with the RAM or hard drive. It's been a while since I ran a memtest but I have no reason to suspect that either of them are faulty right now.

It's important to note most people there were talking about fail to success ratios, but none of my CPDN tasks have gotten past 10%

I've just upgraded both the boinc version and the distro. If the problems persist, I'll try Greg's suggestions.
5) Questions and Answers : Unix/Linux : Exit status 193 compute error (Message 48159)
Posted 13 Feb 2014 by old_user701677
Post:
Thanks for replying. I've checked and computing is allowed while computer is in use. Also, if you'd like to know other settings, computing is allowed while processor usage is less than 70% and it it is set to use at most 60% of CPU time (90 nm old cpu heats up fast if boinc is left with default 100% cpu usage).

I've set it that way because I don't have a lot of cpu intensive apps, and I suspend computations before using flash or java sockets or anything that is CPU intensive, so no worries here.

I always thought that what we're seeing in stderr comes not from unusually many starts and stops, but from the fact that the running time of CPDN is very long. Usually the deadline is 3 months away and each of those tasks ran for a minimum of a week before it crashed, so I don't think that there are that many stops there, but rather distributed over a large timespan.

About that exit code, yes, I said that it should have a different meaning under linux, but I can't find it...
6) Questions and Answers : Unix/Linux : Exit status 193 compute error (Message 48155)
Posted 12 Feb 2014 by old_user701677
Post:
Hello,

I've been meaning to address this issue for quite some time time, but I've just now found the time for it.

I'm running hadcm3n workunits on my linux mint distro and every one of them crashes unexpectedly somewhere between 2 and 10% giving exit status 193.

Google hasn't been very helpful in diagnosing this. The only helpful thing I've been able to find about this exit code is in a list of Windows system error codes where it says that the application is not a valid Win32 application (but I don't need it to be, I'm running on linux!).

stderr looks normal, I guess, apart from exiting with signal 3, which is SIGQUIT, but again, doesn't tell me much.

Could you please look into this and tell me what's wrong? Is this standard behaviour for these wus? 98% of all tasks from other projects finish without errors, but tasks from climateprediction always crash. I'd like them to finish cleanly.




©2024 climateprediction.net