Model exception handling; crashes, infinite loops, etc.

Author	Message
old_user451764 Send message Joined: 16 May 07 Posts: 4 Credit: 9,145 RAC: 0	Message 28890 - Posted: 22 May 2007, 16:29:47 UTC Guys, I thought about not writing this to you, but I believe it is something you need to hear if you want your project to be a larger success instead of a hobby/hacker effort. I really want to participate in your project objectives, but can\'t afford to babysit your code while it crashes my machines and goes into infinite loops. Yes, I read your FAQ\'s and am disappointed in the implicit assumptions I see. If you want good public participation, more consideration must be given to producing more robust code for your models. This may mean involving a mathematician to look for equations and conditions which will cause infinite loops prior to coding and putting models into \"production\". It also may mean testing boundary conditions on every parameter passed into and returned from a routine or method. You will have to decide what it will take to clean up your proprietary code. There are three machines (Linux and Windows) at my site contributing to climate.net objectives. The biggest one (SuSE) averaging 143.48 credits/day is being pulled off the climate.net effort because your code keeps crashing this machine. The other two averaging 40.46 and 20.33 will continue to contribute unless these also start requiring too much babysitting due to climate.net code idiosyncratic behavior. It is not realistic to expect the general public, even an educated general public, to spend time debugging and babysitting their machines due to the behavior of your code. Please take the time to clean it up before dropping \"proprietary\" code on your public volunteers which will dwindle in numbers if running your code become too problematic for their installations. I admire what you are trying to do, and feel strongly it needs to be handled better than what I currently see. ID: 28890 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 28896 - Posted: 22 May 2007, 18:40:35 UTC Last modified: 22 May 2007, 18:47:03 UTC Actually, we could give it a number... Your account lists three machines. Each has one Model associated. One Model indicated a crash -- on the Win. laptop. The Linux boxes show no errors. I was unable to look at the failed Run result -- timeout on the database -- so can\'t determine what happened. How did you determine that CPDN code crashed your Model? Believe it or not, some of these 41+K-credit Runs actually finish despite the misbehaving code. (41,472 credits per Run; I\'ll leave it to you to do the arithmetic for how many days it will take for your machine to finish at 20.33 credits per day.) Typically, stable machines have little or no problem completing the Models. (Nearly everyone thinks their boxes are stable and are quick to defend them. This high-intensity floating-point job puts the machines to the test. Some are found wanting.) Your machines are attempting to run a Model developed by UK Met. Office/Hadley Centre scientists over two decades. -- to run on Cray supercomputers. A million, plus, lines of Fortran. It required a pair of techs a couple years to port the thing to run on PCs. No small accomplishment, that. When something goes awry with one of my Models, I look first at my own environment. The last thing I\'d consider is \"the code\" (except when running Alpha or Beta tests). The Sticky set contains information for doing that. These are only machines, after all, and I\'m willing to cut a bit of slack. Nor do I consider making regular backups \"too much babysitting\"? Do you? Again, I\'m interested in how you determined that your crash(es) was/were caused by CPDN code. Such knowledge could help us. Edit: Made another attempt to access the record for your laptop\'s failed Run. It succeeded: <core_client_version>5.8.16</core_client_version> <![CDATA[ <message> aborted by user </message> <stderr_txt> "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 28896 · Reply Quote

old_user451764 Send message Joined: 16 May 07 Posts: 4 Credit: 9,145 RAC: 0	Message 28900 - Posted: 22 May 2007, 20:04:10 UTC The SuSE 10.1 machine is running a stable installed set of apps... actually a large set of apps. The machine will stay up for months with the current set of apps without BOINC/Climate. After starting BOINC/Climate Prediction.net, the machine appears to be sensitive to the work unit it receives. With some work units the machine will run for weeks and return results. Then there are other days like yesterday when the machine will crash three times, one time destroying one virtual machine which I now need to recover. The crashes appear to be work unit related, which lead me to conclude it is a software problem, not hardware. The only graphics application I run is the BOINC client on a Windows Laptop. My intent is only to loan out CPU to your project and let it return results you need. I don\'t monitor processing, except once a day, though I did once on the Windows machine shortly after installing the climate modeling software. It was pretty cool, actually. As you pointed out in your post, you are dealing with a huge set of code, and from your FAQ\'s, a substantial chunk of code that isn\'t yours, which isn\'t always well behaved. An example from your FAQ pointed to \"glib\" on some Linux/GNU distributions. Since I\'m not using BOINC display graphics under KDE, I would hope the only graphics calls are being made to setup a remote network connection. I start BOINC from the command line under a fairly normal user account with the command line; cd \"/home/usr/craig/seti/BOINC\" && exec ./boinc -allow_remote_gui_rpc $@ & This command line allows me to periodically (once a day during a break over tea) check work progress from a remote machine. Since I can stop the crashing behavior by not running BOINC/Climate, this is what I\'ve decided to do. Unfortunately I can\'t give you much more information about the nature of the crash other than when I walk into the server room and the crash was recent enough that the display isn\'t automatically blanked, the KDE/X display image is corrupt and the machine will only respond to a hardware reset or power cycle. It could be BOINC/Climate code, it could be libraries used by BOINC/Climate, it could be some sort of system interaction, which stops when I don\'t run the code. I\'m also running a RAID-1 setup with power backed and filtered by a UPS, if that makes any difference. This server is intended to be as reliable as I can make it. As I said in my profile, I feel what you are doing is important work. As I also said in my post, I don\'t have the additional bandwidth to deal with machine problems introduced by software (or hardware). I\'m a single dad with three kids and a professional career, and life kind of gets in the way of non-critical activities. My post was intended to state what I felt was obvious in an effort to bring attention and awareness to the problem of acquiring and -keeping- volunteer machines working for ClimatePredictions.net. With 30 years of experience in high reliability medical software design, hardware design, and customer reaction to design problems, I thought I\'d state what I felt was obvious. For success it has to be plug-and-play easy and forget-its-running reliable, even if it takes a watchdog app running in a separate thread to monitor, control, and restart as necessary large application behavior. Your comment about my backup strategy is only tangentially (as in not really) relevant to this issue. ID: 28900 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 28901 - Posted: 22 May 2007, 21:33:32 UTC Last modified: 22 May 2007, 21:40:36 UTC Just a few points to add to what you\'ve both said. Because the cpdn workunits are so long, even if the downloaded model software leads to perfect processing, the likelihood that the user or an unforeseen event will do something to cause the model to crash is quite high. This is why we recommend regular backups and why an entire project README is devoted to this topic and offers a selection of methods. (They\'re linked to in my sig.) Most such crashes are caused by the way boinc works. The boinc platform itself can cause the model to crash if the user does not know or take certain essential precautions. Models can very occasionally crash because of boinc bugs. I can assure you that the boinc programmers in Berkeley are continually improving the software and have a very active bug and problem reporting system. The pre-release beta testing system for new boinc versions appears to me to be sound. Some model crashes are caused by the initial values attributed to the parameters of individual models (each model is unique). A fundamental criticism that can be directed at ensembles of climate models (or any other numerical modelling ensemble) is that the researchers get the results they want to see because they predetermine the initial parameter values within a narrow range. I have never seen this criticism directed at cpdn. The initial values for cpdn model parameters appear to be as varied as possible. This means, however, that some sets of parameters turn out to be unviable and later in the models\' development produce values impossible in the real world. This causes such models to crash. But the data produced is still useful to the researchers in determining the starting values for future sets of models. This point is very important. It means that some crunchers are disappointed because they get a model that turns out to be impossible to complete. But these failed models help to ensure the credibility of the whole project. Some model crashes are caused by the processing getting into a loop so that progress becomes impossible. The programmers in Oxford have already greatly reduced the incidence of looping. We know, however, that some such loops are caused by a calculation error within the computer. If a backup of a looping model is transferred from an AMD machine to an Intel, or vice-versa, in some cases the model then continues normally to completion. So floating-point calculation errors can be caused by the computer itself. How commonly this occurs we don\'t know because most members are not in a position to try a backup of a looper on the other type of machine. I hope that these points answer a few of your worries and show that the researchers and programmers do not neglect the concerns you have raised. Cpdn news ID: 28901 · Reply Quote

old_user451764 Send message Joined: 16 May 07 Posts: 4 Credit: 9,145 RAC: 0	Message 28935 - Posted: 24 May 2007, 3:37:32 UTC Last modified: 24 May 2007, 3:47:02 UTC Well, I\'m certainly satisfied with what you\'ve told me and will mark this thread as answered after I finish entering this message. I also appreciate the obvious effort, care, and time put into both your responses. My only disappointment is not being able to put my bigger machine on your task for reasons we can\'t easily identify. I\'ll leave the other two processing and hope they\'ll be able to contribute something of value to this project. I\'ll wait for another BOINC release or two before trying the larger machine again, or such time as I bring the Intel hyper-threading machine back online. I hope your assessment of possible BOINC reliability contributing to my issue is the answer we need. I also realized some time after my second response that \"backup strategy\" wasn\'t referring to my on site system backup\'s. It is instead referring to project directory backup as described in your many FAQ\'s. This was my confusion. My slowest machine, the ~20 credit/day machine, is a 1GHz AMD Duron processor with 512M RAM. I have the impression from this exchange that this one isn\'t considered up to the task. I\'ll leave it running unless you want to send me an email telling me it isn\'t worth continuing. Thank you, Craig Arno ID: 28935 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 28943 - Posted: 24 May 2007, 16:54:22 UTC Hi Craig Your slow 1 GHz AMD can be used for a model if it\'s left on most of the time, but not if the computer\'s turned off a lot. It\'s doing the same speed as my 1.33GHz, which originally managed 6sec/ts, did after it had to be underclocked to 1GHz because the CPU is probably fractured(!!). Left on 24/7 except for model stoppages to allow computer housekeeping, this means an annual trickle about every 2 days + a few hours. Rather more than a year for a full model. You\'ll have to decide whether this scenario appeals. The deadline isn\'t a problem because the cpdn server very sensibly ignores it, unlike other project servers. But if a model is going to take many months beyond the deadline, there\'s a risk that it may not then be of much use to the researchers. I ran a model at 6 sec/ts for ages, then at 8sec for a while. I then moved it to the new computer and let it race to the finish - it completed last night. If I\'d left it on the old computer I think it would still have finished eventually. Backups of the complete contents of the boinc folder only work if one exits from boinc before the backup. There\'s a selection of backup methods in the dedicated project README in my sig. The longer a model takes to process, the more important it is to make regular backups because more time = higher likelihood that something could go wrong and crash the model. Cpdn news ID: 28943 · Reply Quote

old_user451764 Send message Joined: 16 May 07 Posts: 4 Credit: 9,145 RAC: 0	Message 28982 - Posted: 26 May 2007, 2:22:39 UTC Thanks for your update. The 1GHz AMD is on 24/7 and is only used as an X-Term to another system about once every 2 weeks by my bookkeeper (she likes the keyboard and flat panel display which have low power modes when not being used). From your description, it sounds like this machine is still \"good enough\" to produce useful results, considering it isn\'t doing much else. This machine was rock solid when it was a Slackware based server, so I like to keep it around and useful, and now it sounds like it can still be productive! ID: 28982 · Reply Quote