climateprediction.net home page
New work Discussion

New work Discussion

Message boards : Number crunching : New work Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 24 · 25 · 26 · 27 · 28 · 29 · 30 . . . 91 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 59766 - Posted: 10 Mar 2019, 17:39:10 UTC - in response to Message 59764.  

My i7 grabbed 2 of the batch 797 tasks and they each failed with a Signal 11 error 2 minutes into the run.


TNC 2005(act799, nat**), 2010(act798, nat**), 2015 (act797, nat**) topup runs

Interesting, as top up runs any issues with the tasks should have been sorted out. I haven't yet worked out which batch they are a top up for.
ID: 59766 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 59767 - Posted: 10 Mar 2019, 17:40:30 UTC - in response to Message 59760.  

Small batch #796 of 38 global models at 25 km resolution for 1 month (batch list).


This is a test batch so please if you see anything untoward on these let us know so the project can be informed.
ID: 59767 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2173
Credit: 64,760,426
RAC: 3,180
Message 59769 - Posted: 10 Mar 2019, 19:08:54 UTC - in response to Message 59764.  
Last modified: 10 Mar 2019, 19:18:40 UTC

My i7 grabbed 2 of the batch 797 tasks and they each failed with a Signal 11 error 2 minutes into the run.

It's now crashed all 4 of these models from the SAM25 batches that its downloaded with signal 11, all at ~2 min 20 sec. This is when the regional part of the model starts. There are no attempted restarts, they just die after the last global timestep as the regional model starts.
Edit...on the other hand, I allowed my i5 to download one and it's gotten into the regional model without crashing. Very weird.
ID: 59769 · Report as offensive
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 59770 - Posted: 10 Mar 2019, 19:25:13 UTC - in response to Message 59769.  

Edit...on the other hand, I allowed my i5 to download one and it's gotten into the regional model without crashing. Very weird.

Some years ago, I noted that the first work units of a group crashed often, and then the later ones ran OK. I don't know any reason for that, but maybe it is happening here. But if so, it is probably how they generate their models, with the more extreme initial conditions going first. On the other hand, no one has explained what "Signal 11" is, so it may be something else.
ID: 59770 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59771 - Posted: 10 Mar 2019, 20:04:38 UTC

A signal 11 error, commonly know as a segmentation fault, means that the program accessed a memory location that was not assigned to it. A signal 11 error may be due to a bug in one of the software programs that is installed, or faulty hardware.

*********************

This is happening so often now, that perhaps it needs looking into.
ID: 59771 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2173
Credit: 64,760,426
RAC: 3,180
Message 59772 - Posted: 10 Mar 2019, 20:28:17 UTC - in response to Message 59771.  

A signal 11 error, commonly know as a segmentation fault, means that the program accessed a memory location that was not assigned to it. A signal 11 error may be due to a bug in one of the software programs that is installed, or faulty hardware.

It seems like certain processors get a lot more of these than others. I don't know if it's a generation of processor thing or a Windows thing. It'd be interesting to see a breakdown of Signal 11 errors by CPU. The i7 I have running is a laptop, but high end. I'm running at most 2 models at a time on it, and it's run plenty of models before. It's prime95 stable for 12 hours on 4 cores, and cpdn doesn't tax the processor near as much as prime95. So I'm at a loss as to what's going on here.
ID: 59772 · Report as offensive
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 59773 - Posted: 10 Mar 2019, 21:10:07 UTC - in response to Message 59771.  

Another Signal 11 fish floats belly-up:

wah2_sam25_a0bi_201412_24_797_011771638_0

This one died on my oldest I5 Desktop box (I5-3550) in Win10 after 2m50s CPU time and 3m14s wall time.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 59773 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 59774 - Posted: 10 Mar 2019, 21:26:24 UTC

This is happening so often now, that perhaps it needs looking into.


I agree, below are the batch statistics for another that looks like it has major problems,

I have messaged the, "owner" of 789, '90 and '91 and will update him along with batch statistics in the morning but it it isn't just his and seems to be across all those who submit batches to the system.

Batch: 795
Success: 0 (0%)
Fails: 13 (3%)
Hard Fail: 0 (0%)
Running: 20 (4%)
Unsent: 460 (96%)
ID: 59774 · Report as offensive
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 487
Credit: 30,493,229
RAC: 6,415
Message 59802 - Posted: 13 Mar 2019, 8:09:09 UTC - in response to Message 59772.  

I've just had 1 from batch 797 and 2 from 798 fail with segmantation error. One baych 797 is still going though.
ID: 59802 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 59803 - Posted: 13 Mar 2019, 8:55:50 UTC - in response to Message 59802.  
Last modified: 13 Mar 2019, 10:23:35 UTC

I have 2 797's, both on their second attempt. The first is now even allowing for my laptop being much slower than the machine that first tried it, is well past the point at which it failed. Second one just started so will wait and see. Unfortunately, I can't see the breakdown by processor type and OS to look for patterns. There is also the issue that my system is lying to CPDN and is running Windows tasks under WINE. I don't know how many of us are doing this and will skew the statistics?

Edit:The second one didn't fail till over 3 hours in so won't know whether that is failing at same point till about fifteen hours in on my slower system.

Edit2: Looked at around 30 failures and the running ones in between on batch 797, All showing Windows10 but then so were all of those still running! Looks like M$ have convinced nearly everyone to change. Failures I looked at covered i5, i7, xeon and AMD CPUs. My guess would be within the margin or error, representing the proportion of each. So in the absence of a statistical analysis by the project, I don't see anything of value in that line of enquiry.
ID: 59803 · Report as offensive
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 59804 - Posted: 13 Mar 2019, 12:17:49 UTC
Last modified: 13 Mar 2019, 12:20:25 UTC

All three of the 797s I have run have failed consistently at around 3 1/2 hours on a Ryzen 2600 (Win 10). It is the same for three 798's and 799's.

At least they are failing quickly. And the earlier ones (788 to 794) are running fine, at up to three days now.
ID: 59804 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 59805 - Posted: 13 Mar 2019, 12:35:15 UTC - in response to Message 59804.  

All three of the 797s I have run have failed consistently at around 3 1/2 hours on a Ryzen 2600 (Win 10). It is the same for three 798's and 799's.


Of the two 797s I have, one failed at about four or five minutes on its first attempt. The second at about the 3 1/2 hour mark. The first is now at 6 1/2 hours so way past where it failed on first go round but on my much slower machine probably another six hours before the second hurdle. Most of those I looked at on the task pages failed at a few minutes in with a much lower percentage failing a few hours in. I only found 2 that had gotten as far as the first zip.
ID: 59805 · Report as offensive
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 59807 - Posted: 13 Mar 2019, 23:45:08 UTC

Frustrating day. Ten SAM25 tasks were downloaded across four Intel boxes in Windoze 10. Tasks were approximately equal in distribution across the boxes and across SAM25 797/798/799. (One or two tasks at a time were downloaded on each box.) ALL TEN DIED AFTER ~3 SECONDS ON i5/i7 desktops. (Adding to the fun, this was M$ bug-fix day - and, as a self-defense measure, I micro-manage Windoze 'updates'.)

Perhaps that SAMxx scientist needs a bit of retraining in configuring input file structure -- and/or more attention to detail, eh?

Too common in SAMxx tasks...
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 59807 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 59808 - Posted: 14 Mar 2019, 0:56:12 UTC

Commiserations Astro.

For the non-mods, the following is part of a reply to my email about the high failures:

Yes we noticed the high failure rate with this region and we think it is when the model is setup to do the vegetation as well as the climate that the failure rates increase.

It is on our list of things to investigate and we will be doing some analysis of the failures on this one!


************************

Welcome to the World of Advanced Climate Modelling.
And it'll get worse when they start using the new high res models.
ID: 59808 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2173
Credit: 64,760,426
RAC: 3,180
Message 59809 - Posted: 14 Mar 2019, 1:50:57 UTC - in response to Message 59807.  

Looks like the ones that are failing on Jim's/astroWX's PCs are doing the same thing they did on mine...going through the global first day, then failing when starting the regional model on that day.

I tried again on my i7 and it grabbed one 799 task and has gone a month with it now with no Signal 11. But it's only running one right now. Before it was running an ANZ and the SAM25. No problems with the ANZ models that I've seen.
ID: 59809 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 59810 - Posted: 14 Mar 2019, 7:56:36 UTC
Last modified: 14 Mar 2019, 10:52:05 UTC

And hopefully away from discussions on failures,

Batch 800, 3,300 EU25 13 month tasks have been released.

Edit: And batch 801 another 8658 as part of the same experiment.
ID: 59810 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 59812 - Posted: 14 Mar 2019, 13:02:26 UTC

I may be tempting fate but of the two 797's I have one is well past the point where it failed on the first try. One failed a few minutes in and has now been running for almost 31 hours. The other failed about the 3.5 hour point and is now just over 16 hours in so I suppose on my slower machine it could still about to fall over on my much slower machine.

It would be interesting to know how many are on Linux machines using WINE and how easy it is or isn't to pick that up from the sched_request_climateprediction.net.xml that the server gets its information from. I need to change this machine to say it is using win10 rather than XP to see if it shows up as an identical win10version to my laptop. I will then need to look at what win10 machines show.

Reason for these musings is I don't know whether there are enough WINE machines to skew any OS based statistics that the project are looking at.
ID: 59812 · Report as offensive
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,280,735
RAC: 5,823
Message 59813 - Posted: 14 Mar 2019, 17:32:47 UTC

Is there something unusual about the wah2_sam25 models of batch 797? Are they unusually high resolution or something? I started one yesterday and it is progressing extremely slowly. At 24 hours it is only 0.31% complete. At this rate it will take more than 300 days to complete. The computer is an I3 2.7 GHz with 8 GB of RAM running Win10. It runs other WU’s at normal speed.
ID: 59813 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 59814 - Posted: 14 Mar 2019, 18:18:06 UTC - in response to Message 59813.  

At this rate it will take more than 300 days to complete. The computer is an I3 2.7 GHz with 8 GB of RAM running Win10. It runs other WU’s at normal speed.


I have two of this batch running on my 2.16GHz laptop. one is 3.274%complete after 35 hours, the other is 1.5% complete after a tad under 19 hours. So even on my slower box they should complete in under 50 days.

I suspect there is something wrong with that particular task. Have you tried suspending other tasks to see if it speeds up? (clutching at straws rather than expecting it to make a difference.)
ID: 59814 · Report as offensive
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 487
Credit: 30,493,229
RAC: 6,415
Message 59815 - Posted: 14 Mar 2019, 19:17:38 UTC - in response to Message 59813.  

I have one that is just over 6% after 1 day on my 3.5Ghz i5. One on my slower i5 failed after 4 minutes - seg violation!
ID: 59815 · Report as offensive
Previous · 1 . . . 24 · 25 · 26 · 27 · 28 · 29 · 30 . . . 91 · Next

Message boards : Number crunching : New work Discussion

©2024 cpdn.org