climateprediction.net home page
Model stops, cpu goes idle

Model stops, cpu goes idle

Questions and Answers : Windows : Model stops, cpu goes idle
Message board moderation

To post messages, you must log in.

AuthorMessage
Thunder

Send message
Joined: 1 Sep 04
Posts: 42
Credit: 6,475,117
RAC: 0
Message 29882 - Posted: 7 Aug 2007, 16:28:44 UTC

This model:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6718407

Has twice now stopped (it may be that when it resumes, it doesn\'t actually resume) and BOINC says it\'s running, but CPU time and % done never change.

The first time it happened, I figured it was a fluke and stopped/restarted BOINC. The model process did not stop and I had to kill it\'s process by hand.

Anyone know what may be causing this or if this is just a bad model?
ID: 29882 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 29883 - Posted: 7 Aug 2007, 16:51:42 UTC - in response to Message 29882.  

This model:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6718407

Has twice now stopped (it may be that when it resumes, it doesn\'t actually resume) and BOINC says it\'s running, but CPU time and % done never change.

The first time it happened, I figured it was a fluke and stopped/restarted BOINC. The model process did not stop and I had to kill it\'s process by hand.

Anyone know what may be causing this or if this is just a bad model?


I\'ve had this twice. The first time was on the BBC models and a reboot of the PC seemed to fix it. The second time was on CPDN proper and it went on to crash (quickly, i.e. within 2 or 3 minutes) so I had to restore a backup and resume processing from there. Both seemed to be an unsatisfactory consequence of Resume. I concluded that I had chosen a bad moment to Suspend, possibly too close to an incomplete Checkpoint, so now I always wait until the model is several timesteps beyond the Checkpoint before I Suspend.

Hope this helps. Someone with more knowledge will probably tell us it\'s something else entirely :)
ID: 29883 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 29886 - Posted: 7 Aug 2007, 19:40:38 UTC

Another thing to watch out for is BOINC Manager deciding to run its periodic benchmark test. This delays the model starting for a few minutes at most, but if I\'m in an impatient mood, I\'ve sometimes mistaken that for a freeze of some sort. It\'s easy to see if that really is the cause, because it\'s listed in the Messages tab of BOINC Manager - and it will happen only once every week or so.
ID: 29886 · Report as offensive     Reply Quote
Thunder

Send message
Joined: 1 Sep 04
Posts: 42
Credit: 6,475,117
RAC: 0
Message 29887 - Posted: 7 Aug 2007, 20:40:28 UTC - in response to Message 29886.  

I appreciate the suggestion, but it\'s definitely not benchmarking at the time this happens. After a second time of shutting down BOINC, killing the CPDN process and restarting, I\'ve now watched it process for a few hours, suspend once and successfully restart. I guess I\'ll see if it keeps behaving correctly... :)

Another thing to watch out for is BOINC Manager deciding to run its periodic benchmark test. This delays the model starting for a few minutes at most, but if I\'m in an impatient mood, I\'ve sometimes mistaken that for a freeze of some sort. It\'s easy to see if that really is the cause, because it\'s listed in the Messages tab of BOINC Manager - and it will happen only once every week or so.


ID: 29887 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 29888 - Posted: 7 Aug 2007, 21:17:40 UTC
Last modified: 7 Aug 2007, 21:34:35 UTC

Hi Thunder

This is the problematic model:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6718407

It produced 4 trickles at 6-hourly intervals, which is/was good.

1) In your first post, when you said the model had stopped twice but boinc said it was running although the CPU time and % done didn\'t change, did the model graphics display? (If the globe did display it means the model actually was running.) I suspect that as you said the CPU was idle, the model had stopped running. I presume you mean the CPU % graph display in Task manager?

2) Same first post. When you stopped and restarted boinc, if you hadn\'t suspended the model before exiting boinc, the model would start up again automatically. How exactly did you \'kill its process by hand\'?

3) I see that the model hasn\'t shown any trickles for 12 hours, but I\'m not sure what the delay is for them to be displayed. Is this model running at the moment? If it\'s running, could you look at its graphics frequently and jot down the model dates on paper. I\'m wondering whether this model is a looper. There\'s an item about loopers in the project READMEs linked to in my sig. Loopers get stuck - sometimes apparently a flaw in the model, sometimes a calculation glitch on the computer - and repeat a day, then a month, then a model year. If they still can\'t get through the sticking point they\'re supposed to abort themselves. I don\'t think your model has had enough time to get through this whole process yet.

If it does turn out to be a looper, the only method we know of that sometimes rescues them and gets them through the sticking point is to transfer them from an AMD machine to Intel or vice-versa. Only worth-while if you\'re certain it\'s looping, and you\'d have to decide whether you want to spend the time on a model that\'s only crunched 4 years.

4) When a few members have had a problem with models not restarting after benchmarks, I don\'t think that boinc manager showed these models as running. The problem had to be solved by exiting and restarting boinc, after which these models ran normally until the next benchmarks.


Cpdn news
ID: 29888 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 29889 - Posted: 7 Aug 2007, 21:26:38 UTC
Last modified: 7 Aug 2007, 21:28:47 UTC

Hi Lockleys. You said

\'I concluded that I had chosen a bad moment to Suspend, possibly too close to an incomplete Checkpoint, so now I always wait until the model is several timesteps beyond the Checkpoint before I Suspend.\'

I don\'t think we\'ve had any other reports of suspending creating a problem when it\'s done while the model has stopped to carry out its calculations. But we do recommend suspending when the model is a few timesteps past its checkpoint, just to make sure. We\'ll look out for any further reports of this possible problem.

Cpdn news
ID: 29889 · Report as offensive     Reply Quote
Thunder

Send message
Joined: 1 Sep 04
Posts: 42
Credit: 6,475,117
RAC: 0
Message 29890 - Posted: 7 Aug 2007, 22:11:29 UTC - in response to Message 29888.  

1) In your first post, when you said the model had stopped twice but boinc said it was running although the CPU time and % done didn\'t change, did the model graphics display? (If the globe did display it means the model actually was running.) I suspect that as you said the CPU was idle, the model had stopped running. I presume you mean the CPU % graph display in Task manager?


I knew I should have explained more... It\'s a service installation, so I don\'t use the screensaver. I monitor it through BOINCview and rarely actually visit the machine itself. Hence, no graphics ever display. In fact, I don\'t *think* the graphics display process even runs on it (I can\'t remember ever looking for it specifically). In any case, I used windows task manager to see that the model process was still resident, still using some memory, but 0% CPU.

2) Same first post. When you stopped and restarted boinc, if you hadn\'t suspended the model before exiting boinc, the model would start up again automatically. How exactly did you \'kill its process by hand\'?


I shut down the BOINC manager, then stopped the BOINC service. From countless other service installs, I know that this normally causes all the client processes to exit as well. The CPDN process did not. I had to use \'End Process\' from task manager to kill the onery sucker.

3) I see that the model hasn\'t shown any trickles for 12 hours, but I\'m not sure what the delay is for them to be displayed. Is this model running at the moment? If it\'s running, could you look at its graphics frequently and jot down the model dates on paper. I\'m wondering whether this model is a looper. There\'s an item about loopers in the project READMEs linked to in my sig. Loopers get stuck - sometimes apparently a flaw in the model, sometimes a calculation glitch on the computer - and repeat a day, then a month, then a model year. If they still can\'t get through the sticking point they\'re supposed to abort themselves. I don\'t think your model has had enough time to get through this whole process yet.

If it does turn out to be a looper, the only method we know of that sometimes rescues them and gets them through the sticking point is to transfer them from an AMD machine to Intel or vice-versa. Only worth-while if you\'re certain it\'s looping, and you\'d have to decide whether you want to spend the time on a model that\'s only crunched 4 years.


The lack of trickles was due to the fact that it was \'stuck\' again when I came to the office this morning. Upon restarting, it dedicated 1hr to E@H then resumed (apparently correctly) the CPDN model. It has since trickled again. I read the bit on loopers, but without tweaking the service so it can interact with the desktop, I can only see the % done, not dates. I can only verify for sure that when the CPU use shows the model is running, the % done advances and has not apparently decreased at any point.

4) When a few members have had a problem with models not restarting after benchmarks, I don\'t think that boinc manager showed these models as running. The problem had to be solved by exiting and restarting boinc, after which these models ran normally until the next benchmarks.


The first time it stopped, it may have been when the machine came out of a benchmark (last benchmark was ~ 2 days ago), but I know with 100% certainty that the second time, it either stopped \'mid-stream\' while running, or it stopped at the beginning of a suspend/resume time slice. Unfortunately, since there\'s no problem indicated in the message log, I\'d pretty much have to stare at the BOINC manager for a few hours solid to be positive which is the case.
ID: 29890 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 29891 - Posted: 7 Aug 2007, 22:28:20 UTC

The model is a slab model: do they loop? Migrating from the BBC side of the project, where there were only coupled models, I\'ve never come across a slab model looping.
ID: 29891 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 29892 - Posted: 8 Aug 2007, 6:26:35 UTC - in response to Message 29889.  

Hi Lockleys. You said

\'I concluded that I had chosen a bad moment to Suspend, possibly too close to an incomplete Checkpoint, so now I always wait until the model is several timesteps beyond the Checkpoint before I Suspend.\'

I don\'t think we\'ve had any other reports of suspending creating a problem when it\'s done while the model has stopped to carry out its calculations. But we do recommend suspending when the model is a few timesteps past its checkpoint, just to make sure. We\'ll look out for any further reports of this possible problem.


Yeah. I had a habit of letting it get to 432 and then clicking Suspend immediately. I ended up believing that this may have been a few seconds too keen to allow the disc write to complete. Could be wrong.
ID: 29892 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 29893 - Posted: 8 Aug 2007, 7:22:25 UTC - in response to Message 29890.  

...
I shut down the BOINC manager, then stopped the BOINC service. From countless other service installs, I know that this normally causes all the client processes to exit as well. The CPDN process did not. I had to use \'End Process\' from task manager to kill the onery sucker.
...



This is dangerous, and often terminates the model. If this happens with me I do the following first:

boinccmd --quit

and if that doesn\'t work, then reboot
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 29893 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 29894 - Posted: 8 Aug 2007, 11:28:46 UTC
Last modified: 8 Aug 2007, 11:30:30 UTC

Do slab models loop? I don\'t know whether these new ones can. The first time I ever encountered the looping problem was when the HadCM coupled models were introduced. When we used to run the old Classic slab models, looping was never mentioned. The Seasonal HadAM models can loop.

Thunder\'s model seems to have lost about 7 or 8 hours of forward crunching/progress yesterday (7 Aug) but has since then trickled twice at about the expected interval. What the problem really consisted of is still a mystery to me.

Two other crunchers have had this same model:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6076772

Brian appears to have already crashed the model, but if Markus makes progress with it, it could be interesting to see whether he encounters a hitch at roughly the same point ie after the same number of trickles.

Thunder, I\'d recommend backing up the contents of your boinc folders that contain cpdn models at least once a week. Don\'t bother doing this for computers not running cpdn. You need to exit from boinc before backing up. Through my sig you can reach the project READMEs where there\'s a selection of backup methods. If a model does crash, restoring a backup is the only way to get it back and continue crunching it.

Cpdn news
ID: 29894 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 29895 - Posted: 8 Aug 2007, 15:01:57 UTC - in response to Message 29892.  

Hi Lockleys. You said

\'I concluded that I had chosen a bad moment to Suspend, possibly too close to an incomplete Checkpoint, so now I always wait until the model is several timesteps beyond the Checkpoint before I Suspend.\'

I don\'t think we\'ve had any other reports of suspending creating a problem when it\'s done while the model has stopped to carry out its calculations. But we do recommend suspending when the model is a few timesteps past its checkpoint, just to make sure. We\'ll look out for any further reports of this possible problem.


Yeah. I had a habit of letting it get to 432 and then clicking Suspend immediately. I ended up believing that this may have been a few seconds too keen to allow the disc write to complete. Could be wrong.


I wait until 424: I\'m convinced that 425 is extra-long.
ID: 29895 · Report as offensive     Reply Quote
Thunder

Send message
Joined: 1 Sep 04
Posts: 42
Credit: 6,475,117
RAC: 0
Message 29896 - Posted: 8 Aug 2007, 16:45:03 UTC - in response to Message 29893.  

boinccmd --quit

and if that doesn\'t work, then reboot


Thanks for the tip Mike, it should have occurred to me to try the command line. I\'ll use that in the future if any other oddity occurs.

Thunder\'s model seems to have lost about 7 or 8 hours of forward crunching/progress yesterday (7 Aug) but has since then trickled twice at about the expected interval. What the problem really consisted of is still a mystery to me.


Agreed, mo.v... I don\'t think I\'m anywhere closer to understanding what caused this, but at least I\'ve learned a few things. The GOOD news (despite the lack of a \'resolution\') is that the model has run for nearly 24 hours and appears quite normal now. It\'s suspended/resumed 4 times without so much as a hiccup.

If it has any other issues, I\'ll be sure to post here, but I\'m going to assume this was just due to some randomness (cosmic ray hit flipped a bit somewhere?) and press on. :) Thanks for the good info that everyone has provided so far! :)
ID: 29896 · Report as offensive     Reply Quote
Thunder

Send message
Joined: 1 Sep 04
Posts: 42
Credit: 6,475,117
RAC: 0
Message 29914 - Posted: 9 Aug 2007, 15:34:49 UTC

Well, unfortunately, when I checked on the model this morning, the exact same thing had happened.

Using boinccmd --quit did not work. The CPDN process remained loaded so I had to terminate it again. In fact, the boinc service continued running, so I don\'t think using the command line is necessarily the right thing for a service install, but I\'m not positive about that.
ID: 29914 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 29935 - Posted: 10 Aug 2007, 6:50:25 UTC


There\'s also \'net stop boinc\', but I\'ve never used it so I don\'t know whether it does what you want.

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 29935 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 30104 - Posted: 20 Aug 2007, 20:48:12 UTC - in response to Message 29894.  
Last modified: 20 Aug 2007, 20:55:55 UTC

[mo.v wrote:]Do slab models loop? I don\'t know whether these new ones can. The first time I ever encountered the looping problem was when the HadCM coupled models were introduced. When we used to run the old Classic slab models, looping was never mentioned. The Seasonal HadAM models can loop. ...
The new models do loop: I\'ve just aborted this slab model, which had run for several days with no progress, plus being restored from a week-old back-up (which is quite a distance for a slab model).

There are some new error messages to me in the listing:

Can\'t set up shared mem: -1
Will run in standalone mode.


Oddly, I also had the next work unit in the sequence and that finished fine.
ID: 30104 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 30105 - Posted: 21 Aug 2007, 0:46:13 UTC

Iain, I see that your computers are all Intels, but it would be interesting if someone with a looping slab model and who has both an Intel and an AMD could restore a pre-loop backup to the other computer to see whether the slightly different calculation method gets the model through the loop.
Cpdn news
ID: 30105 · Report as offensive     Reply Quote

Questions and Answers : Windows : Model stops, cpu goes idle

©2024 climateprediction.net