climateprediction.net home page
New Model Type HadAM4

New Model Type HadAM4

Message boards : Number crunching : New Model Type HadAM4
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 59679 - Posted: 26 Feb 2019, 17:02:15 UTC - in response to Message 59677.  

I wonder if they ever figured out what was causing the original Linux problem, or whether it just worked with a new batch?


I can't remember, there are a lot of messages to trawl through to check that and the crucial one may be one I have deleted!

That is, can we expect more Linux in the future? I find it easier to put a Linux machine on CPDN, as I already have them set up to run BOINC anyway.


Yes, we can certainly expect some more Linux tasks. I think I am right in saying that the windows version of this particular task type didn't work and didn't make it to the testing stage. Be warned however, the restart uploads for higher resolution HadAM4 tasks are likely to be around 190-200MB but discussions about how to handle that are still ongoing.
ID: 59679 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 59680 - Posted: 26 Feb 2019, 17:19:23 UTC - in response to Message 59679.  

Be warned however, the restart uploads for higher resolution HadAM4 tasks are likely to be around 190-200MB but discussions about how to handle that are still ongoing.

No problem. I can upload at 10 Mbps, and need something to justify my monthly fee.
ID: 59680 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59683 - Posted: 27 Feb 2019, 5:54:49 UTC - in response to Message 59677.  

The good news is Sarah has worked out what is causing the problem.

I wonder if they ever figured out what was causing the original Linux problem, or whether it just worked with a new batch?

That is, can we expect more Linux in the future? I find it easier to put a Linux machine on CPDN, as I already have them set up to run BOINC anyway.


It's a long way back now, and the details are getting a bit hazy, but I think it went something like:

The office computers/software are supplied by the uni/IT, and when our people tried to compile a 32 bit Linux model on a 64 bit computer, the exe didn't work correctly.
And I think there was urgent work to get out at around the same time. (Isn't there always, when you're trying to cram in a bit of "other stuff"?)

So it got dropped. And another opportunity didn't show up any time soon.
Until now.

And now it's the other way around - the Windows version won't co-operate.

But that's all with new programs; they still have the older programs for them to use.
ID: 59683 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 59686 - Posted: 27 Feb 2019, 9:12:47 UTC - in response to Message 59683.  

Thank you, that is quite sufficient. And I won't need to build a Windows machine, which would only crash the project anyway.
ID: 59686 · Report as offensive     Reply Quote
squaregoldfish

Send message
Joined: 22 Aug 06
Posts: 6
Credit: 2,836,837
RAC: 0
Message 59704 - Posted: 3 Mar 2019, 13:03:40 UTC

Another data point:

I've had 8 models running on my 8-core machine. 5 have just finished with 22 days run time, and the other three look like they'll make it within the next 18 hours.

BOINC is configured to suspend computation if other processes are demanding CPU time, but I have "Leave in memory" checked. The models were suspended dozens of times but didn't seem to mind.

Sadly I can't continue with CPDN because my PC is unlikely to be on 24/7 beyond this week.
ID: 59704 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 59705 - Posted: 3 Mar 2019, 13:46:27 UTC - in response to Message 59704.  

Sadly I can't continue with CPDN because my PC is unlikely to be on 24/7 beyond this week.


I have found that tasks which crash if machine is fully closed survive if suspend or hibernate is used.
ID: 59705 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,487,746
RAC: 3,014
Message 59706 - Posted: 3 Mar 2019, 17:30:36 UTC - in response to Message 59705.  

I'm leaving mine to finish, then some software update and then will test restart, suspend etc with HadAM4 if I get any
ID: 59706 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 76
Credit: 67,812,914
RAC: 5,809
Message 62389 - Posted: 3 May 2020, 18:12:54 UTC - in response to Message 59706.  
Last modified: 3 May 2020, 18:14:36 UTC

I have several errors: "Model crashed: READDUMP: BAD BUFFIN OF DATA". The Wus have been quite advanced.

Like this one: https://www.cpdn.org/result.php?resultid=21924162

I have limited the climateprediction to two concurrent WUs on this computer, so I was wondering if there is a cure.
ID: 62389 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 62390 - Posted: 3 May 2020, 19:05:14 UTC - in response to Message 62389.  

I have several errors: "Model crashed: READDUMP: BAD BUFFIN OF DATA". The Wus have been quite advanced.

Like this one: https://www.cpdn.org/result.php?resultid=21924162

I have limited the climateprediction to two concurrent WUs on this computer, so I was wondering if there is a cure.

It looks like the problems started in early to mid April. Did anything change on that PC or the environment it's in during that time frame?

Some of the crashes, and even some of those that said they completed successfully, had negative theta errors in stderr.txt. While that is sometimes a problem with the initial conditions or parameters for a given task or set of tasks, it can also indicate some hardware instability. If it's in a particularly dusty, or warm environment, that could cause some problems and a thorough cleaning and checking that good air flow through the system is occurring might remove that possibility. Or perhaps CPU, memory and hard disk integrity checking software could be run to determine if any obvious errors are evident? Just a shot in the dark here as I'm not certain it is a hardware/cooling issue but checking those things would at least remove them as possibilities for the problems.
ID: 62390 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 62393 - Posted: 3 May 2020, 21:23:24 UTC - in response to Message 62389.  

Not enough memory for a computer that size, either.
ID: 62393 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 62394 - Posted: 3 May 2020, 21:45:12 UTC - in response to Message 62393.  

Not enough memory for a computer that size, either.

He said he's only running a couple at a time though. If he was trying to run 32, or even 16 that would be a whole different matter.
ID: 62394 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 76
Credit: 67,812,914
RAC: 5,809
Message 62395 - Posted: 3 May 2020, 22:17:15 UTC
Last modified: 3 May 2020, 22:19:53 UTC

Just to clarify:
a) computer is limited to "use at most = 99%"
b) climateprediction is limeted to 2 WUs with app_config
c) rest of the task is CPU: tn-grid and GPU: gpugrid
d) CPU is works at 4100 GHz - I tried to go higher (4150 GHz - around april), but then tn-grid has some random errors "SIGSEGV: segmentation violation", so I limited it again to 4100 GHz, I had some rare shut-downs since then.*
e) With this set-up RAM seems not to be an issue
f) I run some dispersion models on the computer for work sometimes, but not during this shutdown.
g) I clean computer periodicaly, so dirt should be no issue
* this explains some errors.
tn-grid works fine again, this is why I am wondering why climeteprediction does has some errors.
ID: 62395 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 62396 - Posted: 3 May 2020, 23:00:49 UTC

I thought it might be a "can't get the data fast enough" problem, caused by tasks from other projects running.

It's going to crash on the next computer as well, because of the missing library problem.
ID: 62396 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,487,746
RAC: 3,014
Message 62397 - Posted: 4 May 2020, 13:58:57 UTC - in response to Message 62396.  

I also have few WUs with this error, however they all finished successfully
https://www.cpdn.org/cpdnboinc/result.php?resultid=21927138
https://www.cpdn.org/cpdnboinc/result.php?resultid=21920061
This computer is on heavy usage with other tasks, hence 2 WUs only but I suspect it just can't handle all the load (i7-3520m 8GB RAM)
ID: 62397 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 62401 - Posted: 6 May 2020, 3:08:17 UTC - in response to Message 62396.  

That computer that got it next, DID crash it. (hostid=1482854)
It has a perfect record: 803 tasks run, 803 tasks failed.
And it's now been blocked.

If they show up, could someone please point them to the Linux thread that has all the various updates that need to be run for the different types of Linux.
ID: 62401 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : New Model Type HadAM4

©2024 climateprediction.net