Generic solutions to models crashing (error codes -161, -22, and -1073741819)

Author	Message
MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 21066 - Posted: 5 Mar 2006, 11:16:23 UTC Last modified: 2 Oct 2007, 8:18:21 UTC There are a number of common errors which cause many people problems. The first is the Windows Stop message (appears as a Microsoft Send / Don\'t Send dialogue, and -1073741819 in the log), the second is a -161 error in the log, and finally there is an error code 22. Unfortunately the \'-161\' and 22 errors mask the underlying error (-161 simply means that the model ended without results to upload, and the error code 22 seems to be something to do with how the work unit deals with other errors). When you get one of these \'generic\' errors, the first thing to do is to take a look at the model\'s server web page. To find this, click \'your account\', \'results\', and then select the result which crashed. The reason for the crash sometimes appears near to the end of the section \'stderr out\', prior to any -161 errors. For example, NEGATIVE PRESSURE VALUE CREATED indicates that the model reached an impossible climate, and shut itself down. This can be caused by overclocking, bad memory, or in many cases simply because the initial starting parameters for the model will never lead to a viable climate. In the absence of a clear reason for the model failing, we can only make the following general suggestions: * Firstly you need to realise that a crash is not a disaster, even if you have no backup. The coupled model (HadCM3) uploads climate data at intervals: - A summary every year - A more detailed summary every 10 years - A \'restart dump\' every 40 years (1960, 2000, and 2040). The scientists will have the data so far, and if a \'restart dump\' was uploaded, then someone else may be able to continue running the model from that point. The Slab model (HadSM3) uploads climate data at the end of each of its three phases. * However, since it is far more satisfying completing your own model, we advise everyone to back up their climate models. The HadCM models take upwards of 4 months to complete (over a year on some computers), running 24/7. It is fairly likely that something will go wrong on the computer during such a long period. Make backups at least once per week; it only takes a few moments. See the following for information about backups. In some cases restoring from backup will work (where the crash was caused by transient problems on the PC, i.e., code -107... errors, error code 0, and error code -1), but in others the restored model is doomed to fail. If you\'re not sure whether restoring the model is a good idea, then ask on the forum. * If you see a Microsoft Send/Don\'t Send dialogue, don\'t select anything until you have gone into boinc and selected \'exit\' from the menu. Hopefully the model will restart from the previous checkpoint rather than giving up and crashing. * If you use Norton or Sophos antivirus, exclude the boinc project directory from the automated scan. Norton is the cause of many models crashing, because it locks files aggressively, whereas Sophos incorrectly identifies one of the key files as a worm (known as a \'false positive\' in the trade). More about Boinc and antivirus systems. * Before carrying out antivirus scans or defragmenting your system, exit from boinc. Do not just suspend the model. Exit by right-clicking on the system tray icon (lower right of screen) and selecting Exit. You may need to disable automatic scheduled AV scans if these would run without exiting from boinc. * Windows updates have occasionally caused problems. It is wise to run the update manually rather than automatically, and also take a backup before installing them or at least to exit from boinc first. * Never end the model process or the model globe process or boinc manager using the End Task or End Process buttons in Task Manager. If you have a frozen screen, first exit from boinc via the boinc icon in the system tray. Then deal with the frozen screen. * Before shutting down or restarting the computer, first suspend the model and then exit from boinc by right-clicking on the system-tray icon and selecting Exit. Wait until the icon disappears before going into the Start menu. * If you are running your model 24/7, reboot the computer at least weekly. * Before playing games or other heavy duty applications (high CPU or memory usage), set \'no more work\' against the project and \'suspend\' the model - that way they won\'t tread on each other\'s toes. Sometimes simultaneous use of graphics drivers from two different programs seems to cause problems. * Similarly, turning off the screensaver will reduce the chance of crashes, and will also save a lot of CPU time. On a computer with integrated motherboard graphics, displaying the screensaver can take up to 50% of CPU time. To disable the screensaver: Right-click on the desktop Click Properties Select Screensaver Select None Anyone finding that the model interferes with normal use of the computer should first disable the screensaver - it\'s easy to do and often helps. View your globe instead using the View graphics button in boinc manager. If you have previously suffered a -107 error, avoid maximising this globe graphics window. * If you have suffered a -107 or -1 error code you should update your graphics card driver. This is a free update from the card manufacturer. Even a new computer may need this. For further details and instructions see how to update your graphics drivers. * Overheating can cause instability and shorten the life of your computer. Cleaning out dust from the motherboard and fans often helps if this is the case. Make sure all fans are working OK. Machines are often supplied with noisy unreliable fans without ballbearings. If you need to replace one, it\'s quite an easy job, but make sure you buy an \'ultraquiet ballbearing\' fan. There is a program called \'Everest\' which can tell you your CPU and motherboard temperatures on a lot of systems. 50c is the recommended maximum for AMDs, and 60c for Intel. For more information, see keeping your hardware healthy. * Run a stability test on your machine, I recommend Prime95\'s torture test. Run it for about 24 hours, one copy per CPU core (There is a Win32 variant which automatically tests multiple cores). If this runs without error, then it indicates that your PC\'s hardware is very stable, and any problems are more likely to do with software or the model. For more see stability testing. Regarding overclocking, some do it and have stable machines, others do it with disastrous results (literally, by cooking something). Whenever overclocking, or changing timings on memory, a torture test should be considered mandatory. Note that Seti and so forth don\'t stress the machine enough to be a useful stability check. People who are getting errors after overclocking their machines usually find that relaxing memory timings, reducing CPU MHz or improving system cooling will help. * Watch out for firewall messages at the same time as when the model crashed. If your network connection or firewall is unreliable, it may be best to select \'suspend network\' from the boinc manager, and then manually allow it about once per week, to let trickles upload to the servers. This is also a good approach for people using dialup connections rather than broadband. * Windows \'time sync\' messages have been mentioned recently as causing \'process exited with zero status\' crashes. Although these are relatively benign, it may be worth trying to reduce their frequency. * The benchmark boinc runs every 5 days can cause the model to fail (see this post). * The memory requirement for XP machines is now 256MB per HadSM model, 512MB per HadCM and 1GB per HadAM (Seasonal) model, or 1.5GB for two Seasonal models running in tandem. Vista needs an extra 512MB. (See memory requirements). Running with insufficient memory may cause slowness and excessive work for the hard disk. Some people have managed to run a coupled model (HadCM3) with only 256Mb RAM, but it\'s pushing things to the very limit. In this situation I\'d advise frequent backups. Try running a different type of model instead - for example, the Slab model (HadSM3) uses a lot less memory. This is set via Your Account / CPDN preferences / View / Edit / tick HadSM3 and untick HadCM3. Brief technical descriptions of all the currently available models and their requirements can be found here. It\'s worthwhile scanning through the various README files for more information. If any of these suggestions succeeded in helping, or you have any other comments/queries, please add a note to the comments thread so that people know which are the best suggestions. I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 21066 ·