Reporting - Errors while computing -

Author	Message
3rkko Send message Joined: 12 Feb 08 Posts: 66 Credit: 4,877,652 RAC: 0	Message 46001 - Posted: 20 Apr 2013, 22:34:47 UTC Three crashes hadcm3n_3af0_1980_40_008349704_2 hadcm3n_49r8_1980_40_008350067_1 hadcm3n_3jf5_1980_40_008352170_0 with the same "(C++ Exception) (0xe06d7363) at address 0x7732C41F". ID: 46001 · Reply Quote

Matthias Lehmkuhl Send message Joined: 24 Sep 05 Posts: 7 Credit: 2,796,633 RAC: 2,305	Message 46004 - Posted: 21 Apr 2013, 11:49:17 UTC got also one Unhandled Exception Detected... - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7637C41F Engaging BOINC Windows Runtime Debugger... hadcm3n_4gu4_2020_40_008351404 4 results have crashed with the error above 1 result (the short one) is on Darwin 12.3.0 with error process exited with code 22 (0x16, -234) Matthias ID: 46004 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 46005 - Posted: 21 Apr 2013, 12:16:22 UTC one more of my Models Crashed approx two hours ago. hadcm3n_4h2i_1980_40_008350145_3 Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7560812F ID: 46005 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 46012 - Posted: 22 Apr 2013, 3:49:14 UTC I'm the 5th Computer to Crarsh this Model. hadcm3n_4m8m_1980_40_008349532_4 Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7560812F However I did have Success on this Model after 532 hours of 24 / 7 none stop Crunching hadcm3n_zmjk_1920_40_008340870_4 three (3) other Computers Crashed this same Model with various Error while computing. ID: 46012 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 46015 - Posted: 22 Apr 2013, 9:19:39 UTC Byron, in the case of the first WU you link to, all the computers have crashed the model with error code -529697949. If you have the same OS as a computer that's already suffered this the chances of you succeeding must be almost zero. Something wrong with the model. Fortunately these models didn't spend much time crunching. In the case of the second model you linked to the other computers are serial crashers. Look at the other computers belonging to your wingmen. Serial crashers usually kill models after no seconds of computing time at all or just a few seconds. They have very little credit for the number of models they've had. If you look at a page or two of their recent models you may see an unmitigated disaster. That Linux machine probably hasn't d/l the 32-bit libraries it needs so every model crashes. After private messaging was introduced to the forums it was possible for a short time to send a PM to the people crashing models and they'd receive an email notification. But the email notification of BOINC forum PMs was turned off by default to protect members' privacy. How many members notice this detail in their accounts and turn email notification on? I think the current default situation is a mistake but my pleas to Berkeley were rejected. Cpdn news ID: 46015 · Reply Quote

WB8ILI Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,421,805 RAC: 1,225	Message 46021 - Posted: 22 Apr 2013, 12:04:56 UTC This morning I had a "Run Time Error" message on my screen. I checked my memory allocation and the model had bout 1.5 gig of real memory allocated. When I clicked OK the model aborted with a Computational Error. The error in the error file was: The system cannot find the path specified. (0x3) - exit code 3 (0x3) Maybe these out-of-memory errors have something to do with a program loop allocating memory on a missing file situation. ID: 46021 · Reply Quote

nenym Send message Joined: 13 Jan 09 Posts: 2 Credit: 3,197,689 RAC: 23,688	Message 46269 - Posted: 23 May 2013, 4:30:51 UTC The http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8503012 seems to have a memory bug. After 31 hrs CPU time/35 hrs Run time/8% progress bar I mentioned no trickle received on the server side. Task allocated 1,5 GB memory and 3,5 GB virtual memory. I deleted the task, because three crunchers before me got error while computing after a long time. Another weird issue - zero CPU time in database. ID: 46269 · Reply Quote

nenym Send message Joined: 13 Jan 09 Posts: 2 Credit: 3,197,689 RAC: 23,688	Message 46280 - Posted: 23 May 2013, 22:10:51 UTC - in response to Message 46269. The same once more http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8503830 ID: 46280 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 46283 - Posted: 23 May 2013, 22:46:23 UTC - in response to Message 46280. All of those models were created about the time of the "malloc error". I posted about it in this thread on the 19 April. I don't remember the outcome of the thinking/testing. Backups: Here ID: 46283 · Reply Quote

Ba Send message Joined: 27 Jan 11 Posts: 7 Credit: 67,224,533 RAC: 296	Message 46344 - Posted: 2 Jun 2013, 11:51:38 UTC Last modified: 2 Jun 2013, 11:52:28 UTC Seem to be getting a large ammount of errors across my rigs. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1179592 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1283022 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1283587 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1282401 is this normal or have I got problems? ID: 46344 · Reply Quote

ojum-le Send message Joined: 5 May 07 Posts: 27 Credit: 6,369,307 RAC: 0	Message 46346 - Posted: 2 Jun 2013, 17:49:50 UTC Try to clean up your data-directory. C:\Boinc\data\projects\climateprediction.net\ Delete all files. I had the same issues like u. ID: 46346 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 46347 - Posted: 2 Jun 2013, 22:49:40 UTC - in response to Message 46344. Ba Do NOT delete all files! You'll lose the many running models as well if you do. There are some 'model problems' with a lot of what you have, but I haven't checked all of them because you have so many. Those I looked at are not your fault. The errors mentioned have come up a few times over the past year, and have been talked about 'somewhere'. I'm not sure about the last one on your list. But that machine is running an "old" version of BOINC, which won't help. I'd suggest upgrading to the next version (.28) which is a release version, and will have less bugs. Backups: Here ID: 46347 · Reply Quote

Ba Send message Joined: 27 Jan 11 Posts: 7 Credit: 67,224,533 RAC: 296	Message 46362 - Posted: 3 Jun 2013, 17:25:45 UTC Thanks. Not run this many models for a while just didn't like the look of that many errors. Just checked through and most also error on the other machines running them so will just keep an eye on them for now. ID: 46362 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 46363 - Posted: 3 Jun 2013, 20:32:00 UTC Hi Ba You have listed 4 computers. When models have crashed on the first 3 computers in your list the reason is usually a defect in the model. Very often model defects are listed in uppercase eg NAMELIST, REPLANCA, INITTIME. These problems are not the fault of your computers and I expect that other computers in the workunits also crashed them. That's just bad luck. Computer #4 in the list is different: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1282401 This computer has crashed a lot of models with 'No heartbeat' messages which I think could be a problem with the computer. It's an AMD with 48 cores and lots of RAM. Is it overclocked? If so, I think you should test for stability because CPDN models are rather temperamental and any instability can push them over the edge. Cpdn news ID: 46363 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 46364 - Posted: 4 Jun 2013, 5:18:51 UTC Hi Ba, In addition to what Mo said, you have several crashes on 1179592 that looks as though they are disk-related. HadCM3Ns are "disk-write-heavy" and seem to be sensitive to sluggish disk response, much more so than the regional models (HadAM3). HadCM3Ns seem to like neither their code and static data being swapped out, nor for the disks to take too long when they're creating zip files at the 25%, 50%, 75%, and 100% marks. Probably I'm teaching my grandmother to suck eggs here, but you might want to check the sysctls vm.swappiness, vm.dirty_background_ratio, and vm.dirty_ratio. Avoid swapping if possible (low swappiness, say 20), and avoid big "surges" in disk activity. With 64 GB of memory, letting pending disk writes accumulate to 5% of memory (IIRC, that's the default value for vm.dirty_background_ratio) before writing them out would produce noticeable delays when the writing does take place. HadCM3Ns won't like that. Try vm.dirty_background_ratio=1 and vm.dirty_ratio=3 (both are percent of memory), and see if that reduces the number of crashes at the 25% mark, 3110.40 credits. Reducing vm.vfs_cache_pressure may also help, since CPDN models are continually writing to the same files. Alternatively (or as well), try the 'deadline' scheduler, if you're using CFQ. ID: 46364 · Reply Quote

Ba Send message Joined: 27 Jan 11 Posts: 7 Credit: 67,224,533 RAC: 296	Message 46365 - Posted: 5 Jun 2013, 0:17:49 UTC Thanks guys. None of the server rigs I am running is overclocked so that should not be a problem. One of them has an older install (1179592) ,think I will let that one run its models and reinstall. I really dont know that much about linux I just use it on the big rigs as its free ,so thanks for the sugestions I will give them a try over the weekend. ID: 46365 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 46367 - Posted: 5 Jun 2013, 10:31:41 UTC Last modified: 5 Jun 2013, 10:36:21 UTC Try running a both a memory stress-test and a CPU stress-test on the computer which Mo identified, to see if there are any underlying issues (perhaps a bad memory stick). The best stress tests are the USB-/CD-bootable ones which run on the bare hardware. Also it might be worth taking a look at the CPU temperatures, a dislodged heatsink can cause problems also. If you have lots of memory, it might be worth setting the 'stay in memory' flag so that tasks are not constantly stopped & restarted. I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 46367 · Reply Quote

MartinNZ Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0	Message 46370 - Posted: 5 Jun 2013, 22:19:23 UTC - in response to Message 46367. Last modified: 5 Jun 2013, 22:19:52 UTC Mike, which test do you suggest? Prime95 is often mentioned in the forums, but I don't think that is a bootable test. ID: 46370 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,518,727 RAC: 5,698	Message 46371 - Posted: 6 Jun 2013, 7:45:53 UTC - in response to Message 46370. I use memtest http://www.memtest86.com/download.htm which can be booted from cd or usb. And prime95 under whichever OS you use as a stress tester. ID: 46371 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 46711 - Posted: 27 Jul 2013, 2:26:43 UTC I wanted to open a new thread for this subject but when I try I get: Internal Server Error: The server encountered an internal error or misconfiguration and was unable to complete your request. Please contact the server administrator, cpdn-sysadmin@oerc.ox.ac.uk and inform them of the time the error occurred, and anything you might have done that may have caused the error. More information about this error may be available in the server error log. Apache Server at climateapps2.oerc.ox.ac.uk Port 80 I'm using - W 7 - IE 10 - BOINC 7.0.64 (x86) - running as a single instillation - (not as a service) ... any my problem: haven't seen this one before highlighted in red. Does anyone know what it Means ? 26/07/2013 4:16:15 PM \| climateprediction.net \| Requesting new tasks for CPU 26/07/2013 4:16:19 PM \| climateprediction.net \| Scheduler request completed: got 0 new tasks 26/07/2013 4:16:19 PM \| climateprediction.net \| Server can't open log file (../log_climateapps2/scheduler.log) ID: 46711 · Reply Quote