climateprediction.net home page
hadcm3n affecting other projects, computer crash if running a long time

hadcm3n affecting other projects, computer crash if running a long time

Message boards : Number crunching : hadcm3n affecting other projects, computer crash if running a long time
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joe's Climate
Avatar

Send message
Joined: 10 Dec 11
Posts: 11
Credit: 251,696
RAC: 0
Message 45407 - Posted: 1 Jan 2013, 21:28:33 UTC

I definitely notice the current hadcm3n appears to make other projects return computation error. I'm getting computation error for milkyway, cosmology, einstien, seti

I have not quite found if this is due to had3n, but if I leave the computer running more than 3 or something days, I eventually get computer lockup.

This resulted in one climate prediction hadcm3n failing, while the second one changed from 400hr complete 300 to go, to now being at this time... 512hr done, 1354hr to go.

I'll probably try a boinc shutdown and restart again to see if it is actually boinc/climate prediction causing the issue.
ID: 45407 · Report as offensive     Reply Quote
Belfry

Send message
Joined: 19 Apr 08
Posts: 179
Credit: 4,306,992
RAC: 0
Message 45408 - Posted: 1 Jan 2013, 22:54:17 UTC

Hi Joe, My hadcm3n's run fine alongside WCG and Rosetta on Ubuntu. Since your machine runs two tasks at a time, you should check the "leave applications in memory while suspended" box in the advanced memory settings of BOINC Manager. Another thing that might be happening is you're downloading too many of the other projects' tasks and they end up running high priority, which can end up competing with file managers and other applications for disk access. I turn off work fetching for CPDN and set my work buffer to two days when I fetch work for WCG and Rosetta.
ID: 45408 · Report as offensive     Reply Quote
Profile Joe's Climate
Avatar

Send message
Joined: 10 Dec 11
Posts: 11
Credit: 251,696
RAC: 0
Message 45411 - Posted: 6 Jan 2013, 21:53:45 UTC

Hi Belfry,
I believe you are right about running two projects at once, that could be a case for this problem.
I have had 2 different projects running at once in the past (concerning climate change), but this particular run seems to be throwing errors at the other projects. Maybe it happened in the past too, but I'm just paying more attention now.

I recall in the past, particularly with some projects that were first beginning, how they appeared to have difficulty in sharing resources, but then, they did figure a way around that, so I do think that there are solutions available, plus this is also going to become more and more of a problem as more computers become multi-CPU, so, programs do need to be more aware of being able to share resources.

I have some interest in programming but I know I wouldn't have time to help here if I wanted right now, but based on past knowledge and experience, I think you probably may take an interest in looking at the main math routines used by other projects, and this might give some suggestions and ideas on getting some co-operation happening. I'd suggest looking at the key math routines used in einstein, milkyway, seti, they seem to have things figured-out around these key areas so that things don't clash soo much.

Only noticed your reply now - I'll have to check to see if there is a tick-box or something to notify me of reply messages. Thanks for the prompt detailed reply.
Joe
ID: 45411 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45412 - Posted: 6 Jan 2013, 22:16:39 UTC - in response to Message 45411.  

These models DO co-exist nicely with work from other projects.
It's just a matter of understanding how BOINC reacts when it has very short running work from other projects, as well as the very long work from this project.

And always Suspending BOINC before Exiting from BOINC, to give the many files used by project's work, time to close down.

Programming for this project is only done by people employed by Oxford e-Research Centre, because, as has been mentioned many times, the programs used are owned by the UK Met Office. And the core code is close to a million lines of Fortran, so not a trivial matter to change.


Backups: Here
ID: 45412 · Report as offensive     Reply Quote
Belfry

Send message
Joined: 19 Apr 08
Posts: 179
Credit: 4,306,992
RAC: 0
Message 45418 - Posted: 7 Jan 2013, 20:54:30 UTC

I don't mean two projects running at the same time would cause the problem, rather look at the BOINC memory settings and/or work fetch queing.

Math in one project won't interfere with that of another--unless said math is overheating your processor. I looked at some of your other projects' tasks, and they seem to error on file accesses. You should see if the user running BOINC has full read/write file permissions throughout the BOINC data directory.
ID: 45418 · Report as offensive     Reply Quote
Profile Joe's Climate
Avatar

Send message
Joined: 10 Dec 11
Posts: 11
Credit: 251,696
RAC: 0
Message 45419 - Posted: 8 Jan 2013, 0:02:55 UTC - in response to Message 45418.  

Hi Les,
Thanks for the suggestion about Suspend before shutting-down, but doing tasks like that is a bit of extra work, which realistically shouldn't need to be done if we are supposed to run BOINC as run-n-forget. For now, I'll just leave the computer running 24/7 while this WU is running, but my preference has always been to shut-off the computer if I'm done.
I also took a look at your setup and both your machines are XP, which are pretty well single user computers, which to me most likely means you've got the boinc manager handy on your desktop so that you can suspend right away before you shut down. With linux, it's not that much more difficult to create a second or Nth user, and let boinc run isolated in that other user account, so even if it messed-up that account, it wouldn't affect your own stuff (it's just a different way of thinking security - and if you want to read-up, I've got it more or less set up like this: http://www.joescat.com/boinc/ ). The other difference between XP and linux is that in XP, it seems safer to actually suspend a boinc if you become busy with something (move your mouse), but with linux, you set the "nice" command to be nice and give other programs priority, so, climate prediction may slow down to give priority to other tasks, but it never stops, so if you saw where the CPU is allotting time, it's running climate prediction at almost 100% even between mouse movements and keystrokes. It's just a little bit of a different concept. I'm guessing perhaps the best way to describe this would be to run boinc at 100% with no suspend, while at the same time running an older version of directX running some graphical 1st-person-shooter-type-game at the same time (if I recall right, I think the older versions of directX had problems with sharing the math coprocessor with other programs, so you would have conflicts between boinc and directX ...I think you can find some old bugs listed in the boinc buglist related to directX co-existance). ...and I'm guessing this may be a little similar here without really going into looking at code itself to see where the problem lays.


Hi Belfry,
Good point to mention cpu overheating. It is a tower/desktop machine with fairly good ventilation - maybe it could be a recent possibility.
I normally have 2 projects running at a time but these climate projects (or the recent project that bubbled to the additional odd hours), were causing all secondary projects to fail computing within 1 minute of starting, so I'm a little more opinionated to think that the climate calculations aren't sharing the math registers nicely (similar to boinc and directX co-existence mentioned above). I do recall looking at milkyway a while ago and seeing some interesting code just to deal with math going through the math co-processor, so if climate prediction is doing something that assumes it's got 100% attention for the coprocessor, then we're going to have issues as tasks flip back-n-forth between time slices. The reason for mentioning milkyway is I think it had some coexistance issues really early in some of the earliest versions...if not milkyway, it might have been cosmology or einstein then - but I don't recall 100% now.
If it is Fortran code, I can understand the problems you mention of getting similar fixes inserted.

Ohh well, let's see where this goes for now...just running 1 project instead of 2. Thanks for all your suggestions.
Joe
ID: 45419 · Report as offensive     Reply Quote
Belfry

Send message
Joined: 19 Apr 08
Posts: 179
Credit: 4,306,992
RAC: 0
Message 45438 - Posted: 11 Jan 2013, 19:39:47 UTC - in response to Message 45419.  
Last modified: 11 Jan 2013, 19:40:41 UTC

Hi Joe, from the links to your other projects it looks like you're still crashing tasks, even though no CPDN tasks are on your machine. Is the partition containing the BOINC data directory mounted with restrictive options or ACL's? Can you run 2 threads of mprime for 24 hours?

You can try these commands (suspend BOINC network activity, shutdown BOINC and backup the data directory; run as root and replace italics with particulars from your install):

cd path-to-boinc-data-directory
chown -R boinc-user:boinc-user ./
chmod 755 ./
find . -type d -exec chmod 0771 {} \;


If this doesn't help then you might try detaching and reattaching to each project.

Good luck.
ID: 45438 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45441 - Posted: 11 Jan 2013, 22:16:44 UTC

Crashes on other projects seem to be about not being able to open files.
Which may indicate a permissions problem. Common with Linux systems.

Or else some other program is locking/using a critical file.

Either way, it's Joe's computer that's the problem.

I'd suggest setting all but one project to No new tasks, and concentrating on finding out why the work from that one is failing.
That will save on wasted time, electricity, and WUs.


Backups: Here
ID: 45441 · Report as offensive     Reply Quote
Belfry

Send message
Joined: 19 Apr 08
Posts: 179
Credit: 4,306,992
RAC: 0
Message 45448 - Posted: 12 Jan 2013, 19:08:18 UTC

Definitely a permissions problem. Those commands I wrote might help.

Joe, just how are you starting BOINC? Installations through package managers generally require root to start-up via the /etc/init.d/boinc-client script. In Ubuntu, "sudo /etc/init.d/boinc-client start" works. I'm not sure about Mageia (although it should be Red Hat-like: "su - ; systemctl start boinc-client").
ID: 45448 · Report as offensive     Reply Quote

Message boards : Number crunching : hadcm3n affecting other projects, computer crash if running a long time

©2024 climateprediction.net