climateprediction.net home page
Replanca Error/Sigseg fault.

Replanca Error/Sigseg fault.

Message boards : Number crunching : Replanca Error/Sigseg fault.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 380
Credit: 13,785,750
RAC: 20,400
Message 56439 - Posted: 22 Jun 2017, 21:00:07 UTC - in response to Message 56438.  

So I am letting mine run - have suspended work ahead of them in the queue to try and help resolve this issue as quickly as possible.


Ok then I will do the same and leave my two linux machines crunch 592s

One question being posed is whether it is the Natural Greenhouse Gas or other forcing files that are the issue.


Can I check this and provide feedback?
ID: 56439 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 331
Credit: 16,667,448
RAC: 3,683
Message 56441 - Posted: 22 Jun 2017, 22:29:02 UTC - in response to Message 56438.  

I've suspended some of mine to get a couple of 592 tasks to the front of the queue on my Win machine.
ID: 56441 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 381
Credit: 3,690,501
RAC: 647
Message 56443 - Posted: 23 Jun 2017, 2:20:52 UTC

On my 4-core 64-bit Xeon machine, I got my first segmentation fault in a long time.

wah2_sas50_namx_201612_8_592_011103518_0
Workunit 11103518
Created 21 Jun 2017, 17:06:08 UTC
Sent 22 Jun 2017, 9:04:36 UTC
Report deadline 4 Jun 2018, 14:24:36 UTC
Received 23 Jun 2017, 0:39:42 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 0 (0x0)
Computer ID 1256552
Run time 14 hours 1 min 12 sec
CPU time 12 hours 46 min 6 sec
Validate state Invalid
Credit 0.00
Device peak FLOPS 1.28 GFLOPS
Application version Weather At Home 2 (wah2) v8.25
i686-pc-linux-gnu
stderr out

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
SIGSEGV: segmentation violation
Stack trace (13 frames):
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x839e357]
[0x55555400]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x81443f4]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x814b133]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8141220]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x813ff46]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8077583]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x831cd74]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8330985]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x833318a]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x8334c8d]
/lib/libc.so.6(__libc_start_main+0xe6)[0x30ed26]
/home/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.25_i686-pc-linux-gnu[0x804c7a1]

Exiting...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=15331, iMonCtr=2
Model crash detected, will try to restart...
Leaving CPDN_ain::Monitor...
Calling boinc_finish...19:46:08 (15331): called boinc_finish(0)
In boinc_exit called with status 0
Calloing set_signal_exit_code with status 0

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_1.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_2.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_3.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_4.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_5.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_6.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_7.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_8.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_sas50_namx_201612_8_592_011103518_0_r590766589_restart.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>
ID: 56443 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2722
Credit: 3,313,664
RAC: 960
Message 56444 - Posted: 23 Jun 2017, 6:51:56 UTC

Can I check this and provide feedback?

All of batch 592 and two other batches that have this problem have natural GHG forcing.

Hi all,

I think from what you have been saying that these failures are seeming to affect the “Natural” batches and not the “Actual” ones as it seems as if batches 589 and 591 were ok and batches 590, 592 and 583 which are all “Natural” forcing batches. We have tried in a local run swapping the SST and Sea Ice fields and get the same answer so we are wondering if it could be an issue with the GHG forcing (which is updated once a year) or other natural forcing files that are causing the issue.

As I say we are actively trying local tests at the moment to try and work out what is happening here. Interestingly in one of the local runs that we did leaving the working directories in place that failed then continued to run to completion (so would have restarted running day 1 of the year in the global model again and going on to the regional model). Therefore you may find (if you happen to catch it) that if you suspend the job while it is running the first day of the new year in the global model and then restart it that it will then run to completion. This sort of error is making us think that it could be a memory issue somewhere…

As I say any info gratefully received here on this!

Best wishes,
Sarah


I have asked about an easy way to tell about catching job while running first day of new year in the absence of the graphics that used to tell us the model time as well as the timestep that would have made this easy.
ID: 56444 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2722
Credit: 3,313,664
RAC: 960
Message 56445 - Posted: 23 Jun 2017, 8:08:22 UTC - in response to Message 56444.  

And with regards to the date the model is up to:

In the working directory there is a file stdout_mon.txt if you tail that file it will say the date that the model it up to. Entries in it will look something like:

wah2_sas50_n50o_201612_1_d750_000005907 - PH 1 TS 0011611 A - 01/01/2017 22:45 - H:M:S=0007:18:06 AVG= 2.26 DLT= 1.87

The “A” before the date corresponds to running the global model and when it turns to “P” then it is running the regional model.
ID: 56445 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2722
Credit: 3,313,664
RAC: 960
Message 56446 - Posted: 23 Jun 2017, 9:41:25 UTC - in response to Message 56445.  

Though when I use the -F -s60 option I get an, "Unable to follow end of this type of file" message

I am sure there will be a way around this but I haven't looked deeply enough into the tail command yet to find it.

Works fine sudo tail filename as a single shot to see where the task is up to.
ID: 56446 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 380
Credit: 13,785,750
RAC: 20,400
Message 56447 - Posted: 23 Jun 2017, 11:07:45 UTC - in response to Message 56445.  
Last modified: 23 Jun 2017, 11:10:01 UTC

And with regards to the date the model is up to:

In the working directory there is a file stdout_mon.txt if you tail that file it will say the date that the model it up to. Entries in it will look something like:

wah2_sas50_n50o_201612_1_d750_000005907 - PH 1 TS 0011611 A - 01/01/2017 22:45 - H:M:S=0007:18:06 AVG= 2.26 DLT= 1.87

The “A” before the date corresponds to running the global model and when it turns to “P” then it is running the regional model.


So if I understood correctly, I could monitor this file and once/if a WU fails I should post back here the last line, before the project clears up the directories.

Two more failed on Linux - this one and this one
ID: 56447 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 1929
Credit: 41,067,747
RAC: 14,699
Message 56448 - Posted: 23 Jun 2017, 13:17:49 UTC - in response to Message 56446.  
Last modified: 23 Jun 2017, 13:19:08 UTC

Though when I use the -F -s60 option I get an, "Unable to follow end of this type of file" message

I am sure there will be a way around this but I haven't looked deeply enough into the tail command yet to find it.

Works fine sudo tail filename as a single shot to see where the task is up to.

Just use the -f option and it will sit there and scroll in the terminal window with each timestep. All mine that have failed with this sigsegv error fail on the first timestep of the regional model on Jan 1st.
ID: 56448 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 380
Credit: 13,785,750
RAC: 20,400
Message 56449 - Posted: 23 Jun 2017, 14:29:14 UTC
Last modified: 23 Jun 2017, 14:46:56 UTC

It looks the third one of mine also crashed on the first timestep of the regional model on Jan 1st
    wah2_sas50_n8f2_201612_8_592_011100643 - PH 1 TS 0011615 A - 01/01/2017 23:45 - H:M:S=0014:39:38 AVG= 4.54 DLT= 3.72
    wah2_sas50_n8f2_201612_8_592_011100643 - PH 1 TS 0011616 A - 02/01/2017 00:00 - H:M:S=0014:39:41 AVG= 4.54 DLT= 3.70
    wah2_sas50_n8f2_201612_8_592_011100643 - PH 1 TS 0011617 P - 01/01/2017 00:05 - H:M:S=0014:39:50 AVG= 4.54 DLT= 8.78
    Model crash detected, will try to restart...
    Leaving CPDN_Main::Monitor...
    Uploading out files...
    Queuing intermediate upload for CPDN/BOINC: cpdnout_out.zip



The 4th one I did not trace. I'm tracing 3 more.

ID: 56449 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2722
Credit: 3,313,664
RAC: 960
Message 56450 - Posted: 23 Jun 2017, 15:27:47 UTC - in response to Message 56449.  
Last modified: 23 Jun 2017, 15:58:24 UTC

Thanks George, currently 30-12-2016 12:15 so not long to go on the one I am monitoring.
Though that was global now on the regional bit of the day and up to 18:20
ID: 56450 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2722
Credit: 3,313,664
RAC: 960
Message 56452 - Posted: 23 Jun 2017, 19:27:34 UTC - in response to Message 56450.  

And all three failed during first timestep of regional bit of first day 2017
ID: 56452 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 331
Credit: 16,667,448
RAC: 3,683
Message 56453 - Posted: 23 Jun 2017, 23:17:58 UTC - in response to Message 56452.  

wah2_sas50_nc8r_201612_8_592_011105600_0 failed after 1 trickle on Win. Have got 4 others running and a couple of others in the queue.
ID: 56453 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 380
Credit: 13,785,750
RAC: 20,400
Message 56454 - Posted: 24 Jun 2017, 4:21:09 UTC - in response to Message 56449.  

Unfortunately I wasn't able to suspend them as suggested and all 3 failed

wah2_sas50_ncpy_201612_8_592_011106219 - PH 1 TS 0011617 P - 01/01/2017 00:05 - H:M:S=0014:10:02 AVG= 4.39 DLT= 8.74
Model crash detected, will try to restart...

wah2_sas50_n85n_201612_8_592_011100304 - PH 1 TS 0011617 P - 01/01/2017 00:05 - H:M:S=0012:13:27 AVG= 3.79 DLT= 7.10
Model crash detected, will try to restart...

wah2_sas50_ncmn_201612_8_592_011106100 - PH 1 TS 0011617 P - 01/01/2017 00:05 - H:M:S=0012:14:34 AVG= 3.79 DLT= 7.42
Model crash detected, will try to restart...

I have few more but will not be around to monitor and suspend them, so they will most probably fail as well
ID: 56454 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2722
Credit: 3,313,664
RAC: 960
Message 56455 - Posted: 24 Jun 2017, 5:00:15 UTC - in response to Message 56454.  

Unfortunately I wasn't able to suspend them as suggested and all 3 failed


Two out of three, I was able to suspend, I am assuming because of the same percentage completed before the crash that the third fell over at the exact same point.

Project people believe they are getting closer to identifying the problem but not there yet.
ID: 56455 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 331
Credit: 16,667,448
RAC: 3,683
Message 56458 - Posted: 24 Jun 2017, 9:21:17 UTC - in response to Message 56453.  

wah2_sas50_n8z3_201612_8_592_011101364_0 and wah2_sas50_n8kf_201612_8_592_011100836_0 both up to t/s 46,379 if this info helps.
ID: 56458 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1019
Credit: 5,218,143
RAC: 5,544
Message 56469 - Posted: 26 Jun 2017, 11:29:37 UTC

Two SAS50/8 from batch #592 have failed on my Mac at the same point and before sending the first Zip. Two models from batch #592 have completed successfully on my Windows machine.
ID: 56469 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 380
Credit: 13,785,750
RAC: 20,400
Message 56471 - Posted: 26 Jun 2017, 15:25:12 UTC - in response to Message 56454.  

All 592s under Linux failed at the same place Jan 1st 2017 when the regional model kicked in. Unfortunately I could not suspend them in time to test whether they will run to completion.

I have 4 running on a win machine and they seem fine.
ID: 56471 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2722
Credit: 3,313,664
RAC: 960
Message 56472 - Posted: 26 Jun 2017, 17:24:18 UTC - in response to Message 56471.  

All 592s under Linux failed at the same place Jan 1st 2017 when the regional model kicked in. Unfortunately I could not suspend them in time to test whether they will run to completion.


Seems pretty universal, even if suspended during first day. Project have been advised.
ID: 56472 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 380
Credit: 13,785,750
RAC: 20,400
Message 56572 - Posted: 27 Jul 2017, 8:28:57 UTC
Last modified: 27 Jul 2017, 9:08:17 UTC

I have 3 WUs from batch 617 that failed on my Linux box. Two failed with SIGSEGV: segmentation violation after 14 h,
https://www.cpdn.org/cpdnboinc/result.php?resultid=20564889
https://www.cpdn.org/cpdnboinc/result.php?resultid=20566748

the third one crashed at the 8 minute with
Model crashed:
Leaving CPDN_ain::Monitor...
Calling boinc_finish...09:30:49 (16432): called boinc_finish(0)
In boinc_exit called with status 0
Calloing set_signal_exit_code with status 0

EDIT: The third one seem to be fine on windows as it produced 3 trickles already

I have few of that batch on two linux machines
ID: 56572 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2722
Credit: 3,313,664
RAC: 960
Message 56573 - Posted: 27 Jul 2017, 9:56:47 UTC - in response to Message 56572.  
Last modified: 27 Jul 2017, 10:01:30 UTC

I have a retread on Linux that has already failed once on Darwin with a sigseg fault after about 8 hours. I have moved it to the top of the queue to see what happens.

I should say that it is looking likely that batches where a significant number of tasks fall over are not going to be uncommon. The restart files from these batches will often form the basis for a follow up batch which because the initial conditions have not forced it into an impossible climate etc. will have a much higher success rate.
ID: 56573 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Replanca Error/Sigseg fault.

©2020 climateprediction.net