climateprediction.net home page
New Model Type HadAM4

New Model Type HadAM4

Message boards : Number crunching : New Model Type HadAM4
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 59599 - Posted: 12 Feb 2019, 7:24:10 UTC

Two tasks finished successfully on my PCs today. These two were never interrupted/removed from memory which explains why they had a chance to finish without crashing.
ID: 59599 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 59600 - Posted: 12 Feb 2019, 7:37:28 UTC - in response to Message 59599.  

Two tasks finished successfully on my PCs today. These two were never interrupted/removed from memory which explains why they had a chance to finish without crashing.


What has become clear is that these tasks have a very high probability of failing with the bad buffin error if BOINC is stopped or the tasks are suspended and under Options>computing preferences>disk and memory the leave non gpu tasks in memory when suspended is not ticked.

I have determined that if suspend or hibernate, i.e. suspend to disk or to ram is used, the computer can be stopped but as a full reboot means re-starting BOINC this will almost inevitably crash the tasks.

This information may make running these tasks unsuitable for some until/if the issue is resolved.
ID: 59600 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 59609 - Posted: 12 Feb 2019, 16:36:32 UTC - in response to Message 59599.  

Two tasks finished successfully on my PCs today. These two were never interrupted/removed from memory which explains why they had a chance to finish without crashing.


I am not sure what this actually means, especially on machines running Linux. The memory manager will try to keep the working set of the program in RAM, but will page out inactive pages if RAM is needed for something else. And since BOINC processes run with the lowest priority; i.e., run only if nothing with higher priority wants a processor, the entire process could get paged out. I always leave Leave non-GPU tasks in memory while suspended checked, but "in memory" just means it can be paged in very quickly, not that it is physically in RAM all the time.

Since my machine these days does little other than the occasional e-mail and some web browsing, BOINC usually get over 90% of all processor time. And at night, when I am logged out of the machine, they get a little over 99% of the processor time.
ID: 59609 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 59610 - Posted: 12 Feb 2019, 17:04:37 UTC

Well over one in ten of these tasks are crashing due to not having the requisite 32bit libs. Tomorrow, Sarah will speak to Andy about setting the misbehaving computers' flags to -1 so they will not get any more tasks. The owners should also get emails telling them about this.

Once they have added the relevant libs to their Linux installation, they will have to report back to get their computers re-instated.

https://www.cpdn.org/cpdnboinc/forum_thread.php?id=7828 Is the most recent thread which explains how to install the libs. There is a specific thread in the Linux section of the forums should anyone have problems installing the missing files.

There are about 15 of you out there judging by my perusing of the task pages!
ID: 59610 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 59611 - Posted: 12 Feb 2019, 17:11:40 UTC - in response to Message 59609.  

I am sure George can answer this better than I but I don't think the issue is whether the information is in ram or in the swap file. What ticking the, "Keep non-gpu tasks in memory while suspended" does is stop the information being written over while waiting for the task to be resumed. Going to swap would slightly increase the risk of corruption happening but I probably not by very much.

Unticking it on some testing tasks didn't actually crash them when I suspended them, I suspect because I didn't leave it long enough for anything else to write over the information. The computer has enough memory that it probably didn't get written out to swap anyway. Still a bit of learning for me when trying to do these diagnostic tests.
ID: 59611 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 59613 - Posted: 12 Feb 2019, 17:39:24 UTC

Dave is correct. My wording was careless. One can interrupt the task as long as it remains in memory, real or virtual.
ID: 59613 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 59614 - Posted: 12 Feb 2019, 17:44:37 UTC - in response to Message 59610.  

Well over one in ten of these tasks are crashing due to not having the requisite 32bit libs. Tomorrow, Sarah will speak to Andy about setting the misbehaving computers' flags to -1 so they will not get any more tasks. The owners should also get emails telling them about this.

Excellent idea. I would also think about a user-selectable checkbox to enable work units to go only to 24/7 machines.
The ones that are shut down/suspended will fail, and so those crunchers won't want them anyway.
ID: 59614 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 59615 - Posted: 12 Feb 2019, 19:42:03 UTC - in response to Message 59609.  
Last modified: 12 Feb 2019, 19:56:49 UTC

RE - Leaving in memory - Jean-David Beyer

"I always leave Leave non-GPU tasks in memory while suspended checked, but "in memory" just means it can be paged in very quickly, not that it is physically in RAM all the time."

Are you sure about that? I might be wrong but all the evidence I can find seems to point to a task state physically staying in system memory when suspended with a 'Leave'. Obviously, in unchecked, suspended tasks are removed from memory, and resume from their last checkpoint. Hence the problems.

Since paging is virtual memory methodology, it would only occur if the system needed more RAM than was available for some process. I wouldn’t have thought that leaving a task state in memory would trigger paging of that state if another task needed more memory. Rather that elements of the the second task would be paged.

Obviously, when the machine is switched off everything in RAM is lost but the system copies the RAM state to disk to be read back on on startup.

Having said all that, I’m always willing to be corrected.

Out of interest, I have used 'Leave non-GPU tasks in memory while suspended' from shortly after I joined CPDN, having had a task crash after a suspend and restart of my Mac. Since then, I have never had a problem, even when running CPDN in a Virtual Machine.

I sometimes have to switch between dual booted systems so the 'Leave' option is invaluable. I still take care to ensure I don't do this if trickles are pending or uploads are occuring, although pending zips don't appear problematic.

I have Mac systems.
ID: 59615 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59617 - Posted: 12 Feb 2019, 20:40:15 UTC - in response to Message 59615.  

I think that you're right there, Dave.
I remember something from years ago, about people saying they don't use "Leave ..." because it uses up too much memory.

Which is part of the reason why we also suggest, that a rough rule of thumb is: 2 Gigs of memory per processor core.

I think what the "Leave ..." option does, is, it tells the OS that the data in memory is important, so don't delete it just because we're not using it at the moment. And if you REALLY need to use the memory for something else, then swap out all of the data to the HD first, so that it's not lost.

The problem with this new model type is thought to be something similar; some data in one of the ancils isn't getting saved, which is OK if the model just chuggs along, but fatal if it's stopped in some specific way.
ID: 59617 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 59620 - Posted: 12 Feb 2019, 22:32:47 UTC - in response to Message 59610.  

On my Red Hat Enterprise Linux Server release 6.10 (Santiago) system (an older release, but Red Hat support their releases for 10 years), the dependencies are:

$ ldd hadcm3s_8.34_i686-pc-linux-gnu
linux-gate.so.1 => (0x006b7000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00cef000)
libdl.so.2 => /lib/libdl.so.2 (0x00d0c000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00347000)
libm.so.6 => /lib/libm.so.6 (0x00da1000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00fd5000)
libc.so.6 => /lib/libc.so.6 (0x00b4b000)
/lib/ld-linux.so.2 (0x565c9000)

$ ldd hadam4_8.08_i686-pc-linux-gnu
linux-gate.so.1 => (0x0038e000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00cef000)
libdl.so.2 => /lib/libdl.so.2 (0x00d0c000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x006cc000)
libm.so.6 => /lib/libm.so.6 (0x00da1000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x002f4000)
libc.so.6 => /lib/libc.so.6 (0x00b4b000)
/lib/ld-linux.so.2 (0x56607000)

$ rpm -qf /usr/lib/libstdc++.so.6 /lib/libm.so.6 /lib/libgcc_s.so.1 /lib/libc.so.6
libstdc++-4.4.7-23.el6.i686
glibc-2.12-1.212.el6.i686
libgcc-4.4.7-23.el6.i686
glibc-2.12-1.212.el6.i686
ID: 59620 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 59621 - Posted: 12 Feb 2019, 23:02:29 UTC - in response to Message 59615.  

"I always leave Leave non-GPU tasks in memory while suspended checked, but "in memory" just means it can be paged in very quickly, not that it is physically in RAM all the time."

Are you sure about that? I might be wrong but all the evidence I can find seems to point to a task state physically staying in system memory when suspended with a 'Leave'. Obviously, in unchecked, suspended tasks are removed from memory, and resume from their last checkpoint. Hence the problems.


I am very sure about what I say (still admitting that I have been wrong before).

I do not know what you mean by "a task state physically staying in system memory when suspended with a 'Leave'". In Linux, if the process scheduler decides to interrupt a process and give the processor to another process, it requires no cooperation of the to-b interrupted process. The interrupted process does not know if it lost the processor because of a hardware interrupt or anything else. And once it is interrupted, as far as the memory manager is concerned, the interrupted process does not need any RAM at all, so it is all candidate for being paged out. Now, if the there is enough RAM in the machine, it may not be paged out at all. When RAM is required, it is grabbed from free RAM if there is any. If not, it is grabbed from the input cache. If that is not available, output cache must be written out, and can then be re-used. Lacking that, least recently used pages of processes can be written be written out.

But even if there is enough physical RAM, if a process stops running, its memory translation registers are remapped to the new process and the other RAM would be inaccessable. And when the interrupted process again gets a processor, the memory translation registers will be mapped back to where its data are.

I could put 512 GBytes of RAM in my 64-bit machine. So in one sense, it would "never" run out of RAM. But even so, I do not know the maximum virtual address space a process can get. In the old days, it was quite easy to put more RAM in a 32-bit machine than can be addressed by 32 bits (if you had the PAE hardware in the chip set; I did). The OS kernel had to diddle the memory translation registers so a process could address all the memory it needed. Generally, this meant about 32 bits worth of address space per process. Sometimes (on PDP-11 machines), they could have 32 bits worth of data space and 32 bits of instruction space, but that required trickiness in the hardware.

The exact details may have changed since I worked on these things, but that is the general idea.

Since paging is virtual memory methodology, it would only occur if the system needed more RAM than was available for some process. I wouldn’t have thought that leaving a task state in memory would trigger paging of that state if another task needed more memory. Rather that elements of the the second task would be paged.


Leaving a process suspended would not trigger paging in and of itself. The kernel tends to try to keep such stuff in physical memory if possible, so that when the process is resumed it can do so quickly. Remember, the memory manager does not usually know how long a process will be suspended, and it may be resumed much quicker that it takes to write out a page and read it back.
ID: 59621 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 59622 - Posted: 13 Feb 2019, 0:18:59 UTC - in response to Message 59597.  

Name hadam4_a04k_200811_12_785_011729940_2
Workunit 11729940


12-Feb-2019 18:41:52 Started upload of hadam4_a04k_200811_12_785_011729940_2_r1285776232_1.zip
12-Feb-2019 18:47:17 Finished upload of hadam4_a04k_200811_12_785_011729940_2_r1285776232_1.zip

Has not crashed yet.
The other two workers have both crashed. One because no 32-bit libraries. The other, after about three days, with a computation error.
ID: 59622 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59623 - Posted: 13 Feb 2019, 2:09:36 UTC

Good for you. :)
Mine have uploaded zip 8, and another mod has completed 2, so that's 3 of us running OK.
ID: 59623 · Report as offensive     Reply Quote
Sarah Sparrow
Volunteer moderator
Project administrator

Send message
Joined: 30 Sep 15
Posts: 11
Credit: 91,760
RAC: 0
Message 59625 - Posted: 13 Feb 2019, 9:43:54 UTC

Hello all,

Firstly I just wanted to say a big thank you for all your useful posts on this model. They have been very gratefully received and trying to resolve issues such as this would be much harder without your input.

To give you a quick update we are still trying to resolve the BAD BUFFIN error to identify what is happening. My initial tests trying to ensure that ancillary information is read when the model starts do not appear to have fixed the problem so this is proving to be a more complicated issue and I will need to think further how to fix it.

I will talk to Andy about the machines with missing 32-bit libraries so we can clear those sort of errors for this model.
ID: 59625 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 59626 - Posted: 13 Feb 2019, 16:06:49 UTC - in response to Message 59625.  

I will talk to Andy about the machines with missing 32-bit libraries so we can clear those sort of errors for this model.


I suppose the information is there, since when I look at the stderr file in the results entries, it says why they failed.

One of my failed partners got this:

stderr out

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 127 (0x7f, -129)</message>
<stderr_txt>
../../projects/climateprediction.net/hadam4_8.08_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory

</stderr_txt>
]]>

so all you have to do is look at the error returns from any failed work units, and if it says error while loading shared libraries send him no more work units until (s)he interacts. This test need be done only once.

Of course this is easier for me to say than for you to implement. It does show that the needed information is there, but not how to do it in a low resource demand way. Both human and hardware resources.
ID: 59626 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 59644 - Posted: 14 Feb 2019, 13:42:09 UTC - in response to Message 59622.  

Name hadam4_a04k_200811_12_785_011729940_2
Workunit 11729940


Not only did I get a second trickle done, when my two partners did not seem to, but I got a second trickle actually acknowledged. So this runs better than the hadcm3s models that also upload further trickles, but do not appear

12-Feb-2019 18:41:52 [climateprediction.net] Started upload of hadam4_a04k_200811_12_785_011729940_2_r1285776232_1.zip
12-Feb-2019 18:47:17 [climateprediction.net] Finished upload of hadam4_a04k_200811_12_785_011729940_2_r1285776232_1.zip
14-Feb-2019 02:41:10 [climateprediction.net] Started upload of hadam4_a04k_200811_12_785_011729940_2_r1285776232_2.zip
14-Feb-2019 02:42:18 [climateprediction.net] Finished upload of hadam4_a04k_200811_12_785_011729940_2_r1285776232_2.zip

Latest Trickles Received
Time Sent (UTC) Host ID Result ID Result Name Phase Timestep CPU Time (sec) Average (sec/TS)
14 Feb 2019 07:42:10 1256552 21490395 hadam4_a04k_200811_12_785_011729940_2 1 8,741 221,125 25.2974
12 Feb 2019 23:44:55 1256552 21490395 hadam4_a04k_200811_12_785_011729940_2 1 4,421 112,144 25.3662
ID: 59644 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 59651 - Posted: 14 Feb 2019, 19:01:58 UTC - in response to Message 59644.  

Well!

My machine just crashed for the first time since I got it a little over six years ago (1 Dec 2012, 10:10:08 UTC was when I registered it with BOINC and CPDN). The screen put up a curious pattern (not the one I get if there is no video), and nothing worked as far as I could tell. I tried to run a regular terminal, but if I did, I could not tell. I powered off the monitor and waited a bit, and restarted it: no change. I pushed the reset button on the monitor with no change. I tried to shut down the windowing system (Control-Alt-Backspace) but no change.

So I powered the whole thing off, waited a bunch of seconds for the hard drives to spin down, and rebooted. It came up as normal, and everything seems to be running, including my current CPDN task.

So we will see what we will see.

I wish it had not done that.
ID: 59651 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 59652 - Posted: 14 Feb 2019, 19:15:47 UTC - in response to Message 59651.  

Oh! Poo!

It died.

Because my machine crashed, and that was too much for the program. It did restart and ran almost 1/2 hour, but by then it could not continue.

Name hadam4_a04k_200811_12_785_011729940_2
Workunit 11729940


Name hadam4_a04k_200811_12_785_011729940_2
Workunit 11729940
Created 11 Feb 2019, 14:54:33 UTC
Sent 11 Feb 2019, 14:54:46 UTC
Report deadline 24 Jan 2020, 20:14:46 UTC
Received 14 Feb 2019, 19:11:16 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 22 (0x16) Unknown error number
Computer ID 1256552
Run time 3 days 2 hours 38 min 42 sec
CPU time 2 days 22 hours 48 min 23 sec
Validate state Invalid
Credit 389.03
Device peak FLOPS 1.28 GFLOPS
Application version UK Met Office HadAM4 at N144 resolution v8.08
i686-pc-linux-gnu
stderr out

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)
</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...

Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy

Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy

Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy

Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy

Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy

Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy
Sorry, too many model crashes! :-(
13:45:02 (3083): called boinc_finish(22)

</stderr_txt>
]]>
ID: 59652 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 59676 - Posted: 26 Feb 2019, 13:27:37 UTC

The good news is Sarah has worked out what is causing the problem.

The bad news is, so far she doesn't know how to fix it. :(
ID: 59676 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 59677 - Posted: 26 Feb 2019, 15:45:44 UTC - in response to Message 59676.  

The good news is Sarah has worked out what is causing the problem.

I wonder if they ever figured out what was causing the original Linux problem, or whether it just worked with a new batch?

That is, can we expect more Linux in the future? I find it easier to put a Linux machine on CPDN, as I already have them set up to run BOINC anyway.
ID: 59677 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : New Model Type HadAM4

©2024 climateprediction.net