climateprediction.net home page
Persistent crashes with all three models

Persistent crashes with all three models

Questions and Answers : Windows : Persistent crashes with all three models
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile batan
Avatar

Send message
Joined: 25 Aug 07
Posts: 17
Credit: 80,535
RAC: 0
Message 33322 - Posted: 13 Apr 2008, 17:37:01 UTC

Hi. On my computer all three model types persistently crash. I have tried two other computers, which run fine, but they are too slow.

I have been suggested to test the memory of my PC (dell core2duo). Well, I did that and I also extensively ran this stress test (I don\'t remember the name - PI calculations or something like that) without any errors. Also, no other applications on my PC are ever crashing, and I am using memory and cpu intense apps.

I stopped BOINC when running intense work. I set BOINC to run only on one core to avoid possible multithreading problems. Nothing helped. Everything else runs fine (including other BOINC apps), cpdn keeps crashing.
The hadsm models are worst, they crash before reaching 1%. After the hadam models came out, I hoped that these would work (no memory shortage), since I had successfully completed one or two (?) hadam models from the sister project (seasonal attribution) with no problems. But since I get the hadam models from cpdn, they crash too.

My PC has 4 GB of RAM installed, but since WinXP 32-bit is installed, only 3 GB are available. Should this 1 extra GB that Windows does not recognize be the fault?

My question: Is it of any use, if I let my computer further request models from the project in the hope that one or the other runs well?
ID: 33322 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 33330 - Posted: 13 Apr 2008, 20:24:40 UTC


Sorry to hear about your bad luck with all of the models.
I\'m afraid that there\'s been a recent change to the server code, which has removed all of the important error messages from crash listings, so until this gets fixed, the best that we can do is to refer you to the READMEs, which contain links to help files.
Something in there may help you.


Backups: Here
ID: 33330 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 33333 - Posted: 13 Apr 2008, 22:01:15 UTC


Do these crashes seem to happen at any particular point in time, for example, when you are shutting down the system (including hibernation, standby), or perhaps when you run some particular piece of software?

What sort of software is installed + run on your PC? Anything unusual? Which A/V and firewall do you use?

I don\'t think the 4GB will be a problem, as long as you\'ve tested as much as you can of the 4GB memory range and the tests passed.

I suspect your stress testing tool was SuperPI. Personally I like \'Prime95\', which I run for 24 hours or so (one copy per core). But if you need to specificaly test memory, you can use a tool such as MemTest86+. I had a problem with unstable models (about 25% crashed/looped after a few days) which turned out to be due to a dodgy stick of OCZ ram.

Do you run with graphics on or off? If on, then try disabling the screensaver (set to \'blank\'). Updating the graphics drivers sometimes also helps.


I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 33333 · Report as offensive     Reply Quote
Profile batan
Avatar

Send message
Joined: 25 Aug 07
Posts: 17
Credit: 80,535
RAC: 0
Message 33343 - Posted: 14 Apr 2008, 17:16:08 UTC - in response to Message 33333.  
Last modified: 14 Apr 2008, 17:16:55 UTC


Do these crashes seem to happen at any particular point in time, for example, when you are shutting down the system (including hibernation, standby), or perhaps when you run some particular piece of software?

I could not determine any pattern. Never problems with hibernation/standby, not some specific software running.


What sort of software is installed + run on your PC? Anything unusual? Which A/V and firewall do you use?

Avira AntiVirus PE Classic, Windows XP software firewall. Standard graphics and multimedia software which sometimes pretty much uses up the pc resources, but the crashes don\'t seem to be related to that. Also I stop or even close BOINC when running intense apps.


I suspect your stress testing tool was SuperPI. Personally I like \'Prime95\', which I run for 24 hours or so (one copy per core). But if you need to specificaly test memory, you can use a tool such as MemTest86+. I had a problem with unstable models (about 25% crashed/looped after a few days) which turned out to be due to a dodgy stick of OCZ ram.

Do you run with graphics on or off? If on, then try disabling the screensaver (set to \'blank\'). Updating the graphics drivers sometimes also helps.

Now that you\'ve mentioned it, I remembered: it was \'Prime95\', I think. I ran it in test mode on both cores, no errors. Memory I tested with \'MemTest\' (?) and it too gave no errors.
My windows screensaver is always set to \"Black Screen\". I used to watch the model graphics from BOINC though, occasionally. That never caused crashes.

Not recently, but some time ago I checked the boards and stickies here but nothing seems to fit. I also stopped making backups, since all backups that I restored after a crash, then crashed again.

The thing is, I really would like to participate in cpdn, and I already spent energy to find the problem. But now I reached a point where I simply don\'t think I can do anything more.
I will give it another try with hadcm and hadam models. If those fail too, I will pause until one day I\'ll buy a new computer.
ID: 33343 · Report as offensive     Reply Quote
Profile batan
Avatar

Send message
Joined: 25 Aug 07
Posts: 17
Credit: 80,535
RAC: 0
Message 33344 - Posted: 14 Apr 2008, 17:27:39 UTC

I am wondering why the hadam models worked fine when I got them from seasonal attribution project while they crash since they come from climateprediction. Are they different?
If they were modified - there lies the key to my problem.
ID: 33344 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 33345 - Posted: 14 Apr 2008, 20:12:52 UTC


They\'re the same. They are even coming from the SAP server, which is why none are available at the moment, while the SAP server is down.


Backups: Here
ID: 33345 · Report as offensive     Reply Quote
Profile batan
Avatar

Send message
Joined: 25 Aug 07
Posts: 17
Credit: 80,535
RAC: 0
Message 33348 - Posted: 14 Apr 2008, 22:36:46 UTC - in response to Message 33345.  


They\'re the same. They are even coming from the SAP server, which is why none are available at the moment, while the SAP server is down.

Thanks for the info! That is strange indeed - like I said, no problems with hadam3 before.
In the meantime I am giving it another try with a hadcm3 model now - I got a 160 years model (which is great!). Will make backups and be persistent.

I thought of another possible cause to the error, because I remember that the error log sometimes showed something like \"...could not read from drive...\". My PC uses a raid level 0. Are there any raid-related problems known?
ID: 33348 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 33352 - Posted: 15 Apr 2008, 7:19:59 UTC
Last modified: 15 Apr 2008, 7:20:22 UTC


There is one difference, the version - SAP units issued via the main CPDN site are supplied with the V5 wrapper, while the SAP units issued via the SAP site come with V4.07, although the workunit details are identical.

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 33352 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 33353 - Posted: 15 Apr 2008, 7:59:50 UTC

Ah. It\'s been a long time since I ran a model on SAP.

However, my cpu is almost the same as your\'s, batan, (you may have a B1 stepping, compared to my C0), I have 3 gigs of ram composed of 2 x 1Gig, and 2 x 0.5gig, compared to your 4Gigs.
And I\'ve been running HADAM models here for some time without a problem.
(15 completed, and 4 more in about 6 days.)

I don\'t know about raid; I have a single 160 Gig HD, split into 3: 15 for the OS, 10 for BOINC, and the rest for test programs, backups, etc.


Backups: Here
ID: 33353 · Report as offensive     Reply Quote
Profile batan
Avatar

Send message
Joined: 25 Aug 07
Posts: 17
Credit: 80,535
RAC: 0
Message 33392 - Posted: 17 Apr 2008, 14:13:53 UTC
Last modified: 17 Apr 2008, 14:16:33 UTC

Hi. I\'m not sure, but I think I found the problem (and the solution). The current model runs for a day now, while all others crashed after a few hours.
Not faulty RAM, not the graphics driver (which I updated), not the RAID. I figured it was a writing error though - Windows rights management. BOINC had been installed in C:\\Program Files\\BOINC - and that\'s where Windows apparently thought it had to disallow some write operations.
I reinstalled BOINC in a folder on a second drive (D:\\BOINC) and it seems now things are going well. I had mentioned that two other computers had worked without errors - well in both those cases I had installed BOINC on other drives than C:.

I am suspecting all the errors I\'ve had (Visual Fortran Error, exit code 22, etc. were due to this.
I do not remember to have seen the advice NOT to install BOINC in C:\\Program Files\\ (or C:\\Programme\\ for Djermans) in any README here - so, dear admins, maybe that would be a good idea to add. It might save a lot of troubled users. Install BOINC in C:\\ directly or wherever on any other drive.

I\'ll report here, if indeed rights management was the cause of the problems; model has to prove that yet (by not crashing). What I still do not understand is, why the hadam models did not crash before.

There is one difference, the version - SAP units issued via the main CPDN site are supplied with the V5 wrapper, while the SAP units issued via the SAP site come with V4.07, although the workunit details are identical.

What is the \"wrapper\"? Is it like a driver - mediator between OS and model?
ID: 33392 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 33393 - Posted: 17 Apr 2008, 14:41:01 UTC - in response to Message 33392.  

...I do not remember to have seen the advice NOT to install BOINC in C:\\Program Files\\ (or C:\\Programme\\ for Djermans) in any README here - so, dear admins, maybe that would be a good idea to add. It might save a lot of troubled users. Install BOINC in C:\\ directly or wherever on any other drive. ...

Well done: that looks to be a good bit of detective work.

There is advice in the README posts for Vista, where the install location makes a big difference; however, the applications folder isn\'t normally subject to special permissions in XP. Permission issues can arise if BOINC is installed as administrator, but then run as a normal user.

Mike is following this thread, so he may amend one of the READMEs. Thanks for the suggestion.
ID: 33393 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 33394 - Posted: 17 Apr 2008, 14:55:06 UTC
Last modified: 17 Apr 2008, 14:58:03 UTC

What is the \"wrapper\"? Is it like a driver - mediator between OS and model?


Basically, yes, but between the model and the BOINC core client. Several BOINC projects use them. The following page describes the wrapper application for people setting up BOINC projects:

http://boinc.berkeley.edu/trac/wiki/WrapperApp

I don\'t think wrappers usually cause tasks to fail. On some projects where tasks have no checkpoints, the worst thing the wrapper does is make the task go back to the beginning if you exit from BOINC before completing the task.
Cpdn news
ID: 33394 · Report as offensive     Reply Quote
Profile batan
Avatar

Send message
Joined: 25 Aug 07
Posts: 17
Credit: 80,535
RAC: 0
Message 33395 - Posted: 17 Apr 2008, 17:01:48 UTC - in response to Message 33394.  

I don\'t think wrappers usually cause tasks to fail. On some projects where tasks have no checkpoints, the worst thing the wrapper does is make the task go back to the beginning if you exit from BOINC before completing the task.

Thanks for the wrapper hint. The changed wrapper version (see Mike\'s posting about that) might be a possible explanation why I didn\'t experience any crashes with hadam models from SAP, while having them crashing from CPDN - maybe the (different) wrapper handled file access in the BOINC folder differently.
Anyhow, the errors only appeared with the cpdn applications, maybe because the other apps I am running don\'t have savepoints, or trickle-save-states.
ID: 33395 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 33398 - Posted: 17 Apr 2008, 20:36:50 UTC - in response to Message 33392.  
Last modified: 17 Apr 2008, 20:44:06 UTC

...
I figured it was a writing error though - Windows rights management. BOINC had been installed in C:\\Program Files\\BOINC - and that\'s where Windows apparently thought it had to disallow some write operations.
I reinstalled BOINC in a folder on a second drive (D:\\BOINC) and it seems now things are going well. I had mentioned that two other computers had worked without errors - well in both those cases I had installed BOINC on other drives than C:.

I am suspecting all the errors I\'ve had (Visual Fortran Error, exit code 22, etc. were due to this.
I do not remember to have seen the advice NOT to install BOINC in C:\\Program Files\\ (or C:\\Programme\\ for Djermans) in any README here - so, dear admins, maybe that would be a good idea to add. It might save a lot of troubled users. Install BOINC in C:\\ directly or wherever on any other drive.
...


Hi Batan,

That\'s a very interesting discovery - could I ask whether the PC is running Vista or XP?

As Iain mentions, the install location is known to sometimes cause problems in Vista, but there are rumours that Microsoft may be migrating some of the less popular aspects of Vista into XP SP3. If you are using XP, are you using any additional security software from Microsoft, such as Windows Defender, or OneCare? If not, which firewall / antivirus / antispyware software do you use? (Just wondering if any third party products might be doing the same thing).

I do know of one potential crash situation which is present in the V5 wrapper but not in the V4.07 wrapper - if the CPU goes to 100% on all cores for over 15 minutes (from a normal-priority task such as a game), the software decides that the climate model must have crashed, and starts another copy. The two copies then collide and crash.

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 33398 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 33399 - Posted: 17 Apr 2008, 21:56:28 UTC

Just for the record, putting data files in the C:\\Program Files\\ area has been know to be a problem for sometime.

The BOINC people have been doing a complete re-write of BOINC since late last year, and will release version 6.* before too much longer.
This will split up the program and data files, with the later going into \"My Documents\". Permissions/Users will also get split into several parts.
And the graphics will get split off into a separate package.

So, big changes on the way.

ID: 33399 · Report as offensive     Reply Quote
Profile batan
Avatar

Send message
Joined: 25 Aug 07
Posts: 17
Credit: 80,535
RAC: 0
Message 33400 - Posted: 17 Apr 2008, 22:13:07 UTC - in response to Message 33398.  

That\'s a very interesting discovery - could I ask whether the PC is running Vista or XP?

WinXP Prof. Version 5.1.2600 Service Pack 2 Build 2600

If you are using XP, are you using any additional security software from Microsoft, such as Windows Defender, or OneCare? If not, which firewall / antivirus / antispyware software do you use? (Just wondering if any third party products might be doing the same thing).

No additional Microsoft security software, just the native XP Firewall. And Avira AntiVir PersonalEdition Classic.

I do know of one potential crash situation which is present in the V5 wrapper but not in the V4.07 wrapper - if the CPU goes to 100% on all cores for over 15 minutes (from a normal-priority task such as a game), the software decides that the climate model must have crashed, and starts another copy. The two copies then collide and crash.

This wasn\'t the case; also I throttle BOINC to 75% CPU use (in preferences).
ID: 33400 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 33410 - Posted: 18 Apr 2008, 7:17:18 UTC


The situation I\'m thinking of is not when Boinc is using 100% of the CPU time, it\'s when something else is using 100% of the CPU time and Boinc is getting nothing (which happens since Boinc runs at a lower priority level).

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 33410 · Report as offensive     Reply Quote
Profile batan
Avatar

Send message
Joined: 25 Aug 07
Posts: 17
Credit: 80,535
RAC: 0
Message 33414 - Posted: 18 Apr 2008, 13:55:58 UTC - in response to Message 33410.  
Last modified: 18 Apr 2008, 13:59:38 UTC


The situation I\'m thinking of is not when Boinc is using 100% of the CPU time, it\'s when something else is using 100% of the CPU time and Boinc is getting nothing (which happens since Boinc runs at a lower priority level).

Oh, ok. But that wasn\'t the case either.

Anyway, there is sad news (for me): the model crashed (output here). I am trying a hadam3 now - last try unless I stumble upon new ideas.

Das Ger�t erkennt den Befehl nicht. (0x16) - exit code 22 (0x16)

means \"The device does not recognize the instruction.\" in German.

Therefore the location of the BOINC folder does NOT seem to be the problem, also I checked user rights and everything is allowed to admins, users, power users and system. Sorry for the trouble.
ID: 33414 · Report as offensive     Reply Quote
Profile batan
Avatar

Send message
Joined: 25 Aug 07
Posts: 17
Credit: 80,535
RAC: 0
Message 33416 - Posted: 18 Apr 2008, 15:59:48 UTC

The hadam model just crashed too. I am giving up, frustrated.
ID: 33416 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 33417 - Posted: 18 Apr 2008, 17:37:12 UTC

You don\'t need to apologise - you\'ve done everything you could.
Cpdn news
ID: 33417 · Report as offensive     Reply Quote
1 · 2 · Next

Questions and Answers : Windows : Persistent crashes with all three models

©2024 climateprediction.net