climateprediction.net home page
Posts by old_user451764

Posts by old_user451764

1) Questions and Answers : Unix/Linux : Model exception handling; crashes, infinite loops, etc. (Message 28982)
Posted 26 May 2007 by Profile old_user451764
Post:
Thanks for your update.

The 1GHz AMD is on 24/7 and is only used as an X-Term to another system about once every 2 weeks by my bookkeeper (she likes the keyboard and flat panel display which have low power modes when not being used). From your description, it sounds like this machine is still \"good enough\" to produce useful results, considering it isn\'t doing much else. This machine was rock solid when it was a Slackware based server, so I like to keep it around and useful, and now it sounds like it can still be productive!
2) Questions and Answers : Unix/Linux : Model exception handling; crashes, infinite loops, etc. (Message 28935)
Posted 24 May 2007 by Profile old_user451764
Post:
Well, I\'m certainly satisfied with what you\'ve told me and will mark this thread as answered after I finish entering this message. I also appreciate the obvious effort, care, and time put into both your responses.

My only disappointment is not being able to put my bigger machine on your task for reasons we can\'t easily identify. I\'ll leave the other two processing and hope they\'ll be able to contribute something of value to this project.

I\'ll wait for another BOINC release or two before trying the larger machine again, or such time as I bring the Intel hyper-threading machine back online. I hope your assessment of possible BOINC reliability contributing to my issue is the answer we need.

I also realized some time after my second response that \"backup strategy\" wasn\'t referring to my on site system backup\'s. It is instead referring to project directory backup as described in your many FAQ\'s. This was my confusion.

My slowest machine, the ~20 credit/day machine, is a 1GHz AMD Duron processor with 512M RAM. I have the impression from this exchange that this one isn\'t considered up to the task. I\'ll leave it running unless you want to send me an email telling me it isn\'t worth continuing.

Thank you,
Craig Arno
3) Questions and Answers : Unix/Linux : Model exception handling; crashes, infinite loops, etc. (Message 28900)
Posted 22 May 2007 by Profile old_user451764
Post:
The SuSE 10.1 machine is running a stable installed set of apps... actually a large set of apps. The machine will stay up for months with the current set of apps without BOINC/Climate.

After starting BOINC/Climate Prediction.net, the machine appears to be sensitive to the work unit it receives. With some work units the machine will run for weeks and return results. Then there are other days like yesterday when the machine will crash three times, one time destroying one virtual machine which I now need to recover. The crashes appear to be work unit related, which lead me to conclude it is a software problem, not hardware. The only graphics application I run is the BOINC client on a Windows Laptop. My intent is only to loan out CPU to your project and let it return results you need. I don\'t monitor processing, except once a day, though I did once on the Windows machine shortly after installing the climate modeling software. It was pretty cool, actually.

As you pointed out in your post, you are dealing with a huge set of code, and from your FAQ\'s, a substantial chunk of code that isn\'t yours, which isn\'t always well behaved. An example from your FAQ pointed to \"glib\" on some Linux/GNU distributions. Since I\'m not using BOINC display graphics under KDE, I would hope the only graphics calls are being made to setup a remote network connection. I start BOINC from the command line under a fairly normal user account with the command line;

cd \"/home/usr/craig/seti/BOINC\" && exec ./boinc -allow_remote_gui_rpc $@ &

This command line allows me to periodically (once a day during a break over tea) check work progress from a remote machine.

Since I can stop the crashing behavior by not running BOINC/Climate, this is what I\'ve decided to do. Unfortunately I can\'t give you much more information about the nature of the crash other than when I walk into the server room and the crash was recent enough that the display isn\'t automatically blanked, the KDE/X display image is corrupt and the machine will only respond to a hardware reset or power cycle. It could be BOINC/Climate code, it could be libraries used by BOINC/Climate, it could be some sort of system interaction, which stops when I don\'t run the code. I\'m also running a RAID-1 setup with power backed and filtered by a UPS, if that makes any difference. This server is intended to be as reliable as I can make it.

As I said in my profile, I feel what you are doing is important work. As I also said in my post, I don\'t have the additional bandwidth to deal with machine problems introduced by software (or hardware). I\'m a single dad with three kids and a professional career, and life kind of gets in the way of non-critical activities. My post was intended to state what I felt was obvious in an effort to bring attention and awareness to the problem of acquiring and -keeping- volunteer machines working for ClimatePredictions.net. With 30 years of experience in high reliability medical software design, hardware design, and customer reaction to design problems, I thought I\'d state what I felt was obvious. For success it has to be plug-and-play easy and forget-its-running reliable, even if it takes a watchdog app running in a separate thread to monitor, control, and restart as necessary large application behavior.

Your comment about my backup strategy is only tangentially (as in not really) relevant to this issue.
4) Questions and Answers : Unix/Linux : Model exception handling; crashes, infinite loops, etc. (Message 28890)
Posted 22 May 2007 by Profile old_user451764
Post:
Guys,

I thought about not writing this to you, but I believe it is something you need to hear if you want your project to be a larger success instead of a hobby/hacker effort.

I really want to participate in your project objectives, but can\'t afford to babysit your code while it crashes my machines and goes into infinite loops. Yes, I read your FAQ\'s and am disappointed in the implicit assumptions I see. If you want good public participation, more consideration must be given to producing more robust code for your models. This may mean involving a mathematician to look for equations and conditions which will cause infinite loops prior to coding and putting models into \"production\". It also may mean testing boundary conditions on every parameter passed into and returned from a routine or method. You will have to decide what it will take to clean up your proprietary code.

There are three machines (Linux and Windows) at my site contributing to climate.net objectives. The biggest one (SuSE) averaging 143.48 credits/day is being pulled off the climate.net effort because your code keeps crashing this machine. The other two averaging 40.46 and 20.33 will continue to contribute unless these also start requiring too much babysitting due to climate.net code idiosyncratic behavior.

It is not realistic to expect the general public, even an educated general public, to spend time debugging and babysitting their machines due to the behavior of your code. Please take the time to clean it up before dropping \"proprietary\" code on your public volunteers which will dwindle in numbers if running your code become too problematic for their installations.

I admire what you are trying to do, and feel strongly it needs to be handled better than what I currently see.




©2024 climateprediction.net