Iceworlds & Slowdowns hadsm3/mh - Closed

Author	Message
mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 34461 - Posted: 1 Aug 2008, 13:49:41 UTC Last modified: 1 Aug 2008, 14:09:14 UTC If that\'s the case it sounds like a massive overclock (not that I know anything about O/Cing except that if not done totally stably it can trash CPDN models). It could be that part of the extra speed comes from the computer having Linux rather than Windows. I was wondering whether an unstable overclock on that computer had altered the processing and somehow got the model past Adrian\'s iceworld point. I also wondered whether Adrian\'s model from this WU was truly defective. But looking at other models from the same WU today (I\'d been waiting) I see that * This task running on Windows (like Adrian) seems from the sec/TS to have met the iceworld problem without its owner noticing yet. * This task running on Linux at a very standard speed seems to have got past the crisis point without problems. I don\'t know whether any previous iceworlds have occurred on one OS but not the other(s). I think I\'ll send a private message to the member who appears to be stuck but not contact anyone else at the moment - wait to see what happens to their models. Cpdn news ID: 34461 ·

adrianxw Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,021,020 RAC: 816	Message 34462 - Posted: 1 Aug 2008, 15:57:47 UTC I had 3 models on that quad. It has a 25% resource share, so theoretically, only 1 was running, but as LHC and Cels have no work at the moment, it gets more. The one that had been running pretty much all the time, and that which was furthest down the tree were the iceballs. The middle wu is running and trickling normally. The machine is a Intel Q6600 and the advanced G0 stepping model. It is rated for 2.4GHz and clocked at 3.0GHz. It is in an open chasis with a huge Zalman CNPS9700 heatsink on it running at full speed. Speedfan shows the cores 38-39C but Speedfan under reports these chips by 15C so they are running at less then 55C which for a G0 is nothing. I\'ve torture tested the thing a couple of times without issue. It has crunched a good number of CPDN\'s. I don\'t think it exhibits any instability but you know what CPDN is like! As always, if there is anything else you need, post or PM me. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 34462 ·

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 34464 - Posted: 1 Aug 2008, 17:29:51 UTC Yes, your quad has a superb completion record, Adrian. And with lots of slabs which at the moment are the model type most likely to develop abnormalities. But in the case of iceworlds we expect every model in the workunit to be affected. So if this doesn\'t happen it\'s worth looking for other possible causes. I think that in this workunit, Windows machines make the models into iceworlds but Linux machines don\'t. We already know that Intel and AMD process data slightly differently, but I don\'t think we\'ve noticed differences between operating systems before. These iceworlds are such a bizarre phenomenon that I think almost anything\'s possible. As far as I know, nobody\'s discovered the real cause of iceworlds. There may be more than one cause because most start processing slowly whereas a few suddenly race ahead. Lots of credits but no usable data. A Beta slab developed into an iceworld on my C2D and I was very glad indeed when a slab from the same WU also became an iceworld at the same point on one of Thyme Lawn\'s computers. No need for me to start stability testing. Adrian, I don\'t think you need to retest your quad. Cpdn news ID: 34464 ·

adrianxw Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,021,020 RAC: 816	Message 34466 - Posted: 1 Aug 2008, 19:41:28 UTC I asked this once before and don\'t think I ever got an answer. Can I somehow, (obviously not through the regular BOINC mechanism), get EXACTLY the same wu again to run on another machine? There is a second Q6600 machine, (B4 stepping and not OC\'d), sitting 1m from the G0 Q6600. It normally does not run CPDN but there is no reason why I shouldn\'t stick it on, and simply reduce the quota on the other quad to even things out. I am happy to run it again as, although it may fail, I get the trickles and so am not really losing anything, and it might just help. PM me if you like. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 34466 ·

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 34471 - Posted: 2 Aug 2008, 15:13:01 UTC Last modified: 3 Aug 2008, 10:06:01 UTC Hi Adrian I don\'t think the programmers would be too happy about sending you a task from the exact same WU again because Milo\'s said to the moderators that he doesn\'t think the researchers want failed WUs rerun. They just accept that some WUs fail. And in the case of this WU, they\'ll probably get results from one or possibly two tasks running on Linux. Realistically, the only way we can rerun our own exact same task is if we have a backup from before the failure point or before the abnormality developed. I wouldn\'t worry about it if I were you. Administrator is still running the same model on Windows here so if his model doesn\'t crash for some other reason, he\'s effectively doing the experiment for you. Unfortunately he can\'t be contacted by PM because his computers are hidden. Edit: his name is in fact Anonymous. Cpdn news ID: 34471 ·

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 34491 - Posted: 4 Aug 2008, 18:28:19 UTC Last modified: 4 Aug 2008, 18:37:27 UTC Adrian, Anonymous who has Windows now appears to have developed the iceball problem, judging from the sec/timestep of his last trickle. (Link in my above post.) The other person with Windows mustn\'t have noticed my private message warning him; his task has been stuck in the iceworld for 3 days. If anyone has a theory about why 3 computers with Windows have produced an iceworld with this workunit while 3 computers with Linux haven\'t and appear to be processing it normally, I\'d be interested. Cpdn news ID: 34491 ·

adrianxw Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,021,020 RAC: 816	Message 34539 - Posted: 6 Aug 2008, 18:34:53 UTC Could be almost anything. Without a detailed knowledge of the codebase, I wouldn\'t know where to hazard a guess. I don\'t even know what compilers are used by the project, let alone options or libraries. I know, (Carl told me), that there is a lot of Fortran in there. I also recall weirdness when, ~20 years back, I was porting a big Fortran-77 fluid dynamics application from a Gould 32/77 running MPX-32 to a VAX 11-750 under VMS. Theoretically, they were both standard compliant 32 bit systems with standard compliant compilers, but the application results differed from the same source data - we traced it to the runtime libraries and had to write our own versions of some of the mathematics functions. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 34539 ·

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 34543 - Posted: 6 Aug 2008, 19:16:20 UTC The models are all based on the UK Met Office Unified Model which I believe consists of a million lines of Fortran. Cpdn news ID: 34543 ·

adrianxw Send message Joined: 31 Aug 04 Posts: 145 Credit: 2,021,020 RAC: 816	Message 34545 - Posted: 6 Aug 2008, 19:28:10 UTC Last modified: 6 Aug 2008, 19:28:39 UTC I recall well the older Met Office systems at Bracknel. The Cosmos, Cray II, the ETA-10 debarcle, Cray Y-MP etc. Not terribly productive, but a lot of memories stirred! I really quite liked programming in Fortran. I\'m a C++ jock now though, haven\'t used Fortran professionally for more then 10 years. (Carl wanted to know if I wanted a job - is how I know a little about the code base!). Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 34545 ·

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 34552 - Posted: 7 Aug 2008, 2:05:38 UTC The Met Office is mostly in Exeter now. Cpdn news ID: 34552 ·

wateroakley Send message Joined: 6 Aug 04 Posts: 186 Credit: 27,123,458 RAC: 3,218	Message 34561 - Posted: 8 Aug 2008, 12:49:33 UTC - in response to Message 34552. The Met Office is mostly in Exeter now. The Bracknell Met Office site is resplendent with numerous new apartments, but would you want to live over-looking the roundabout? The Hadley Centre building has also been demolished, replaced with an Express hotel. ID: 34561 ·

Kevin Rigotti Send message Joined: 31 Aug 08 Posts: 4 Credit: 1,910,065 RAC: 0	Message 34911 - Posted: 5 Sep 2008, 18:06:36 UTC Possible iceworld, but may be more broken than that. Had a Windows Defender crash yesterday evening and a subsequent boot problem until full power off and repairs by chkdsk, so the data files may have been compromised. Model time and date shown as 00/00/0000 00:00 ?! Result ID:http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7605217 Timestep: 154081 s/TS: 1.43 Colour: Blue CPU: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz [x86 Family 6 Model 15 Stepping 11] No overclocking. I\'m happy to let it continue over the week-end to see what happens. If it still looks stuck on Monday I\'ll abort it. ID: 34911 ·

Iain Inglis Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317	Message 34912 - Posted: 5 Sep 2008, 18:19:45 UTC - in response to Message 34911. Last modified: 5 Sep 2008, 18:20:50 UTC [Kevin Rigotti wrote:] Possible iceworld, but may be more broken than that. Had a Windows Defender crash yesterday evening and a subsequent boot problem until full power off and repairs by chkdsk, so the data files may have been compromised. ... Another computer with the same characteristics as yours (i.e. Intel/Windows) has gone further in that work unit: 7605224, so it probably isn\'t an ice world. That isn\'t, of course, to say that the graphics aren\'t icy blue! A normal crash will do that as well - the Defender crash may have done for the model. If you have a backup, it would be worth giving that a go. ID: 34912 ·

Kevin Rigotti Send message Joined: 31 Aug 08 Posts: 4 Credit: 1,910,065 RAC: 0	Message 34913 - Posted: 5 Sep 2008, 18:44:51 UTC - in response to Message 34912. Another computer with the same characteristics as yours (i.e. Intel/Windows) has gone further in that work unit: 7605224, so it probably isn\'t an ice world. That isn\'t, of course, to say that the graphics aren\'t icy blue! A normal crash will do that as well - the Defender crash may have done for the model. If you have a backup, it would be worth giving that a go. Sadly, I have only ever done a weekly backup at home - now that I\'m doing this it might be a good point at which to change to daily - so as it\'s probably broken I\'ll abort the stuck task now and free up a chunk of CPU for a new one. ID: 34913 ·

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 34914 - Posted: 5 Sep 2008, 19:07:07 UTC If you make big automated backups of everything while BOINC + model are running there\'s no guarantee that the BOINC folder contents (or BOINC6 Data folder contents) will be restorable after a crash. I think that ghosted backups made with BOINC running are more likely to be restorable. To be certain of restorability it\'s worth stopping the model and exiting from BOINC every couple of days, then backing up the contents of the BOINC (or BOINC Data) folder. The READMEs linked in my signature contain a collection about backups. I apologise if you already know all this! Cpdn news ID: 34914 ·

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 34917 - Posted: 6 Sep 2008, 2:28:09 UTC Regarding one of Adrianwx\'s iceworlds, more than a month ago I sent private messages to three crunchers sharing the same workunit to warn them about the potential problem. A model belonging to one of these crunchers has now been stuck in an iceworld for more than a month. The owner hasn\'t noticed my private message. (The model will eventually abort with the error \'Maximum CPU time exceeded\', but that could be in October...) In your project preferences please enable email notification of private messages! Even if you don\'t often visit the forum you will then know if a PM is waiting for you! Cpdn news ID: 34917 ·

Kevin Rigotti Send message Joined: 31 Aug 08 Posts: 4 Credit: 1,910,065 RAC: 0	Message 34921 - Posted: 6 Sep 2008, 6:48:45 UTC - in response to Message 34914. If you make big automated backups of everything while BOINC + model are running there\'s no guarantee that the BOINC folder contents (or BOINC6 Data folder contents) will be restorable after a crash. I think that ghosted backups made with BOINC running are more likely to be restorable. To be certain of restorability it\'s worth stopping the model and exiting from BOINC every couple of days, then backing up the contents of the BOINC (or BOINC Data) folder. The READMEs linked in my signature contain a collection about backups. I apologise if you already know all this! Vista\'s built in backup does a shadow copy so should be OK (?) I admit I hadn\'t read the README\'s (thanks for the reminder). It was only 3 days in, so I wasn\'t too attached to it. ID: 34921 ·

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 34923 - Posted: 6 Sep 2008, 10:44:10 UTC Last modified: 6 Sep 2008, 10:57:06 UTC I assume that Vista\'s shadow copy is made with BOINC running. There\'s no guarantee that these backups will be restorable. I have to say that nobody as far as I know has systematically tested background backups for restorability. For each program that does this sort of thing you\'d have to make 20 or 30 backups at different times and test whether every backup restored successfully and ran. So we judge from reports on the forums. One member reported that his backups using this sort of built-in feature all restored properly and ran. But then he lost a model because one backup wouldn\'t restore. It\'s just luck whether or not the backup made with the model running catches the model at particular moments in its processing (probably when it\'s writing to disk). I\'d say keep using your shadow copy program, but in addition every two or three days stop the model, exit from BOINC and use one of the tried and tested methods described in the README collection. I just use Les\'s easy manual method which is the first method described in the README collection. It only takes a few minutes and his restore method is just as quick and easy. I don\'t think we\'ve ever had a forum report of a failure using this method. Making regular restorable backups really is worthwhile. Doing this almost guarantees that the cruncher will be able to complete every healthy model (a few models turn out to be defective). Anyway, that\'s what I\'d do. On the other hand, if you don\'t want to spend time on manual backups you could test a couple of your shadow backups to see whether they restore and run. If they do run you\'d have to accept that an occasional shadow backup might be no good. If that happened you could try to restore the previous shadow backup instead. If you test this, make a couple of backups using Les\'s method (after exiting from BOINC) before you start. If you do test Vista\'s shadow backups for restorability it would be useful if you could report back to let us know what happens. There\'s also a really useful list of ways to avoid model crashes in the README collection about crashes and problems. It\'s item #6 by MikeMars. Cpdn news ID: 34923 ·

wedgef5 Send message Joined: 19 Jun 08 Posts: 2 Credit: 739,082 RAC: 0	Message 35062 - Posted: 20 Sep 2008, 17:46:43 UTC This looks like an iceball to me. Progress has been stuck at ~97% for a few weeks. Current timestep: 25/07/2064 02:30 s/TS = 2.99 Temperature Color = solid blue Machine: Pentium(R) 4 CPU 3.00GHz [x86 Family 15 Model 4 Stepping 3], No overclocking. ID: 35062 ·

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 35063 - Posted: 20 Sep 2008, 20:21:42 UTC If it were mine, I\'d pull the plug. (They can sometimes be saved by transferring a backup to another machine type, Intel to AMD or vice versa, but there\'s no guarantee it will work.) Irritating to get so close and then see it fail, but still, the work done will be of use to the researchers. Welcome to the Boards. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 35063 ·

Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion