climateprediction.net home page
New Model Type HadAM4

New Model Type HadAM4

Message boards : Number crunching : New Model Type HadAM4
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 59554 - Posted: 7 Feb 2019, 13:09:27 UTC

A new Linux only model type. Aside from the crashes due to missing 32bit libs, there have been a number of crashes with,

Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy

The project would like to know a bit more about what is causing this. I have managed to recreate it by
1.Suspending computation
2.Exiting BOINC.
3.Rebooting
4.Restarting BOINC
5 Resuming computation.

Three times the task survived leaving boinc and restarting without the reboot even when one of them didn't include suspending computation.

My experience in the past is that the task crashing is more likely if there has been a kernel upgrade prior to the reboot which there had been in this case though a hadcm3s task did survive.

If you experience tasks crashing with this error please do post and say what was happening at the time.

Thank you.
ID: 59554 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 59556 - Posted: 7 Feb 2019, 16:42:00 UTC - in response to Message 59554.  

I'm running Ubuntu 18.04 in VMs on Windows 10 PCs.

The only time I've interrupted one was when I shut down my virtual machine to allocate more logical cores to it.

I waited for a checkpoint,
stopped boinc a few timesteps past that checkpoint through the File -> Exit Boinc Manager menu choice,
shut down Ubuntu in the VM in the normal Ubuntu shutdown method,
changed the number of logical cores that the VM uses from 2 to 4,
restarted Ubuntu in the VM, and
restarted BOINC Manager.

The task crashed immediately with the BAD BUFFIN error. There was no PC reboot or Ubuntu update that occurred during this time.
ID: 59556 · Report as offensive     Reply Quote
Shadak

Send message
Joined: 15 Nov 18
Posts: 1
Credit: 15,365
RAC: 0
Message 59563 - Posted: 8 Feb 2019, 10:52:25 UTC

only linux :(
ID: 59563 · Report as offensive     Reply Quote
mmonnin

Send message
Joined: 28 May 17
Posts: 49
Credit: 14,706,403
RAC: 9,709
Message 59566 - Posted: 8 Feb 2019, 22:55:59 UTC

My Linux PC grabbed 6 of this this morning. About 18.5 day ETA. Still running after 12 hours. ~630 MB of memory usage.
ID: 59566 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 59569 - Posted: 9 Feb 2019, 6:31:15 UTC

I just noticed it existed this morning. I checked to see if it would run on Linux, so I added it to my list. So far, I have received none. I get so few tasks from climateprediction that my client only tries about every two or three days.
ID: 59569 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 59571 - Posted: 9 Feb 2019, 9:37:20 UTC
Last modified: 9 Feb 2019, 9:51:23 UTC

I see that over one in five have fallen to the three strikes and out rule.

I think the advice for this batch is something those of us above a certain age will remember.

Das machine is nicht fur gefingerpoken und mittengrabben. Ist easy
schnappen der springenwerk, blowenfusen und corkenpoppen mit
spitzensparken. Ist nicht fur gewerken by das dummkopfen. Das
rubbernecken sightseeren keepen hands in das pockets. Relaxen und
vatch das blinkenlights!!!
ID: 59571 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 59574 - Posted: 9 Feb 2019, 15:58:06 UTC - in response to Message 59571.  

Back when the first computer I saw took up a room the size of a small aircraft hanger, comedians would put little messages on the machines.

The first one I saw said:"If you can remain calm in all this confusion, you obviously do not understand the problem."

But the best one I ever saw went something like this:

We have not answered all our questions.
Sometimes we think we have not answered any of them.
Those questions we have answered have served only to raise a host of new questions.
So now we are as confused as ever, but on a higher level, and about more important things.
ID: 59574 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 59579 - Posted: 10 Feb 2019, 19:27:39 UTC
Last modified: 10 Feb 2019, 19:29:21 UTC

Currently uploading an atmos_restart.day file for Sarah to have a look at to see if the problem with these tasks can be identified. Makes some of the recent uploads look small. (187.2MB)

Edit: and unless someone else has a different experience, like the hadcme3s tasks only the first trickle shows up on the task web pages.
ID: 59579 · Report as offensive     Reply Quote
mmonnin

Send message
Joined: 28 May 17
Posts: 49
Credit: 14,706,403
RAC: 9,709
Message 59580 - Posted: 11 Feb 2019, 0:54:18 UTC

I realized due to these CPDN tasks that tasks from other projects were taking many times as long so I paused all but one. Today I received from BURP tasks which paused that one task since they have a short deadline and are mt. Upon resuming it had an error. Along with every other one I had. Pretty much a waste of time.

Model crashed: READDUMP: BAD BUFFIN OF DATA
Sorry, too many model crashes! :-
ID: 59580 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59581 - Posted: 11 Feb 2019, 1:56:56 UTC

These new models REALLY do not like being interrupted, even more than climate models usually do.
ID: 59581 · Report as offensive     Reply Quote
mmonnin

Send message
Joined: 28 May 17
Posts: 49
Credit: 14,706,403
RAC: 9,709
Message 59584 - Posted: 11 Feb 2019, 3:21:45 UTC

Its hard for them to not be interrupted when there isn't enough CPDN work to fill the queue, they have a runtime of 18 days and they have a year deadline. Something will end up interrupting them.
ID: 59584 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 59585 - Posted: 11 Feb 2019, 3:24:12 UTC - in response to Message 59581.  

These new models REALLY do not like being interrupted, even more than climate models usually do.


Is it enough to check the Leave applications in memory while suspended box? I reboot only when Red Hat update the kernel which is usually a little less often than once a month. And even then, the shutdown procedure sends a shutdown signal to all running processes, then waits about 10 seconds before killing any that remain.
ID: 59585 · Report as offensive     Reply Quote
Sergey Kovalchuk

Send message
Joined: 30 Aug 04
Posts: 4
Credit: 535,502
RAC: 0
Message 59587 - Posted: 11 Feb 2019, 12:41:20 UTC

additional messages - immediately after stopping the task
<stderr_txt>
Signal 15 received: Software termination signal from kill
Signal 15 received: Abnormal termination triggered by abort call
Signal 15 received, exiting...
07:16:01 (6392): called boinc_finish(193)
*** Error in `../../projects/climateprediction.net/hadam4_8.08_i686-pc-linux-gnu': free(): invalid pointer: 0xf7371008 ***
. . . . . .
*** Error in `../../projects/climateprediction.net/hadam4_8.08_i686-pc-linux-gnu': free(): invalid pointer: 0xf7371008 ***


then, after re-start
Model crashed: READDUMP: BAD BUFFIN OF DATA                                                                                                                                                                                                                                    tmp/xnnuj.pipe_dummy

Model crashed: READDUMP: BAD BUFFIN OF DATA                                                                                                                                                                                                                                    tmp/xnnuj.pipe_dummy
Sorry, too many model crashes! :-(
08:58:05 (2041): called boinc_finish(22)
ID: 59587 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 59588 - Posted: 11 Feb 2019, 13:33:32 UTC
Last modified: 11 Feb 2019, 13:53:56 UTC

I am currently downloading three tasks to see if the issue with stopping and restarting is resolved. Sarah believes they have identified the issue.

Edit: Any moderators able to get the remaining two of these with faster boxes than mine might be useful.
ID: 59588 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 59589 - Posted: 11 Feb 2019, 15:08:50 UTC

I just got one. According to my client, it is expected to take about 44 days on my machine. The other two computers that tried this work unit bombed out, one very quickly.

Name hadam4_a04k_200811_12_785_011729940_2
Workunit 11729940

I do not expect to reboot my machine before it finishes.
ID: 59589 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 59590 - Posted: 11 Feb 2019, 17:17:30 UTC - in response to Message 59588.  

I am currently downloading three tasks to see if the issue with stopping and restarting is resolved. Sarah believes they have identified the issue.


Well, one of the three on the laptop has just crashed when I exited boinc and restarted it, even without the reboot along with the four main site tasks I had running. So, not fixed yet.
ID: 59590 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,488,575
RAC: 2,962
Message 59591 - Posted: 11 Feb 2019, 18:58:29 UTC - in response to Message 59590.  

I got two at their 2nd attempt, so far at 13% and it looks like 20-23 days for this WUs. I do not intend to interrupt them just to check how they behave. I will do normal use of the machine and if they crash I will report.
ID: 59591 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59592 - Posted: 11 Feb 2019, 19:12:51 UTC

8 at 58% after 5 days.
Six zips created and returned.

And all trickles showing on the Task's pages.
ID: 59592 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 59597 - Posted: 12 Feb 2019, 6:30:44 UTC - in response to Message 59589.  

Name hadam4_a04k_200811_12_785_011729940_2
Workunit 11729940


This work unit now has 15 hours 32 minutes on it and has not crashed, so it does not seem to have congenital weakness. It has not uploaded any .zip files yet. Claims to to be a trifle over 4% done now.
ID: 59597 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59598 - Posted: 12 Feb 2019, 7:05:36 UTC

There are 12 months to these, so 12 zips, which is about 8% intervals.
ID: 59598 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : New Model Type HadAM4

©2024 climateprediction.net