climateprediction.net home page
Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion

Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion

Message boards : Number crunching : Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 15 · Next

AuthorMessage
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 35651 - Posted: 5 Dec 2008, 20:31:10 UTC

Hi Zdespi

Thanks again for reporting the problem. That\'s very bad luck to get 4 iceworlds in such a short time. I don\'t think it\'s a problem with your computer (occasionally an unstable computer can cause HADSMs to become iceworlds) because other people with models from the same workunits also have iceworlds.

I\'ll send private messages to the other members with models from the same workunits.
Cpdn news
ID: 35651 · Report as offensive
m.mitch
Avatar

Send message
Joined: 10 Jan 06
Posts: 55
Credit: 1,252,851
RAC: 8,613
Message 35731 - Posted: 22 Dec 2008, 2:29:12 UTC
Last modified: 22 Dec 2008, 2:29:58 UTC

What a pity. I normaly leave these machines unattended but by chance checked this one (before going away for 3 weeks too). It looks like it has been returning trickles once evry month or two.

With the planet below 42 from north to south I\'m not surprised. :-)

Here \'Tis: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7307095

I\'ll abort it now.


Click here to join the #1 Aussie Alliance on Climate Prediction
ID: 35731 · Report as offensive
old_user22652

Send message
Joined: 3 Oct 04
Posts: 39
Credit: 13,172,838
RAC: 0
Message 35800 - Posted: 3 Jan 2009, 17:50:40 UTC

Help, please. Before retirement, my professional life was spent in a galaxy far, far away, but now I am trying to get to grips with the
technicalities of CPDN models; for me the learning curve is steep indeed.

On my computer designated E100, a mid-holo model resultid 8225658 turned \"blue\" and the backup failed at the same point,
so I ditched it.

The pressure and temperature graphs look pretty much OK; can I find any more info as to why the model crashed?

Any help gratefully received.

John
GW3PRV
ID: 35800 · Report as offensive
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 35801 - Posted: 3 Jan 2009, 18:23:50 UTC - in response to Message 35800.  
Last modified: 3 Jan 2009, 18:34:48 UTC

... On my computer designated E100, a mid-holo model resultid 8225658 turned \"blue\" and the backup failed at the same point,
so I ditched it.

The pressure and temperature graphs look pretty much OK; can I find any more info as to why the model crashed? ...

John,

The work unit from which your model was taken is 6256583. Every Windows/Intel computer in that work unit has the same problem: they\'ve all stalled after the first timestep in phase 3. This can be seen from the dramatic increase in the Average (sec/TS) figure, which is caused by the slowdown that accompanies an \'iceworld\'.

The usual explanation for this is that the model\'s parameters have caused the state of the model to become in some way unrealistic at the point of the slowdown. This is understandable, as the methods for generating sets of parameters aren\'t necessarily based on what are thought to be reasonable values - the project is simply trying to exercise as many sets as possible and determine which regions of \'parameter space\' are viable.

My one doubt about this explanation is that it seems that many more Windows/Intel computers are affected by this than AMD or Linux or Mac. But, of course, there are many more Windows/Intel computers participating in CPDN than AMD or Linux or Mac, so perhaps my worry is just a bias in the sample of people who pitch up at the message boards. Nonetheless, I would like to see a comparison of platform \'iceworld\' rates. Look, for example, at the exemplary geophi: I\'m a careful Windows/Intel-only participant and I can\'t match that completion rate.

Iain
ID: 35801 · Report as offensive
Profile Pooh Bear 27
Avatar

Send message
Joined: 5 Feb 05
Posts: 465
Credit: 1,914,189
RAC: 0
Message 35802 - Posted: 3 Jan 2009, 19:08:48 UTC

Look, for example, at the exemplary geophi: I\'m a careful Windows/Intel-only participant and I can\'t match that completion rate.

Iain, you need to look at apples to apples, geophi runs one type of model, and that\'s all. A lot easier to keep stable. You run many different types of models.

I have a pretty good completion rate on my Intels with the same models geophi does. I also have pretty good otherwise. I was here when there were several bad models that were distributed and there is a set of bad units all lumped together during that period.

Some model types are a lot more stable than others. The shorter ones for example. The longer ones can have more reasons for failure.

We are all doing our part for the project. Getting an iceworld is part of the experiment.

ID: 35802 · Report as offensive
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 35804 - Posted: 4 Jan 2009, 1:37:24 UTC
Last modified: 4 Jan 2009, 1:44:08 UTC

My own \'theory\' (=wild idea) about iceworlds is that their cause may be similar to whatever used to make quite a few BBC and CPDN HADCM models unstable and start looping. The occurrence of this HADCM instability and looping was often dependent on the type of CPU (AMD or Intel) with the result that some members got a restored backup past the previous looping point by transferring the backup to a computer with the other type of CPU.

Tolu managed to solve this problem in the HADCMs, first by making looping cause the model to crash instead of continuing indefinitely, and then by optimising HADCMs. How he did this I don\'t know, but I think it must have involved more than just improving the I/O activity. Unless the previous higher level of I/O activity was the root cause of the instability that caused looping.

The occurrence of iceworlds also often seems to be dependent on the type of CPU. But iceworld occurrence also sometimes seems to be OS-dependent; I don\'t think this was the case with HADCM loopers. On the other hand, for a long time we couldn\'t even look for OS-dependency in loopers because only one task from each BBC HADCM workunit was sent out.

What we do know for sure is that HADSMs are also particularly vulnerable to the slightest instability in any type of computer.
Cpdn news
ID: 35804 · Report as offensive
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 35806 - Posted: 4 Jan 2009, 2:09:02 UTC
Last modified: 4 Jan 2009, 2:28:27 UTC

Hi GW3

Because HADSMs only show us their graphs after the completion of complete phases, we usually only see graphs that illustrate the iceworld problem if the iceworld developed a few trickles before the end of a phase and the member persevered with the model until after the phase change.

I posted a couple of such graphs further up this thread but they now won\'t display. Let\'s see if one of them will display now. This member persevered because his iceworld was one of the more unusual fast-processing ones. He completed this entire phase, but although the temp graph looks normal, as soon as the iceworld developed the model stopped processing precipitation:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?field=Precipitation&resultid=6913610

Edit: Sorry, I can\'t make graph images display at all, but you can go to that page and look at the phase 1 graphs.

(Why doesn\'t [ img]...[/img] display an image now?)
Cpdn news
ID: 35806 · Report as offensive
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 35807 - Posted: 4 Jan 2009, 2:25:59 UTC

Unless someone says they\'ve already sent PMs to the other people stuck in the same iceworld as GW3, I\'ll send messages to them on Sunday evening.
Cpdn news
ID: 35807 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 35808 - Posted: 4 Jan 2009, 3:05:37 UTC
Last modified: 4 Jan 2009, 3:08:37 UTC

This is what Mo was trying to display:



As you can see, someone has cut the string, and most of the balloons are now up against the ceiling.
ID: 35808 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 35809 - Posted: 4 Jan 2009, 16:16:53 UTC

Running lots of hadsm3 and hadsm3h models over the last couple years, I\'ve had four that \"iceworlded\". Two were slow-processing iceballs, and two were fast-processing iceballs. All four of those were on Windows PCs that were either running one model at a time, or two models of the same kind. When running hadsm3 type models alongside hadcm3 in Linux, the hadsm3 models never crashed, although, overall, the hadcm3 models had a 30-35% failure rate on my PCs. So, my experience is that I have never seen a Linux iceworld, out of hundreds completed. There has been an occasional crash, as seen in the listing for my Phenom 9600 that Ian linked to (one out of 62), and Phenom 9950 (one out of 32). Perhaps those crashes are when an iceworld would have otherwise happened, but the Linux executable recognized it as a rewindable problem, whereas the Windows one doesn\'t and just keeps on processing? That would seem rather strange though, as one would think that the code that recognizes unrealistic atmospheric conditions should be carried through from one OS to another.

ID: 35809 · Report as offensive
JimMcCarthy_StellarSolns
Avatar

Send message
Joined: 3 Sep 08
Posts: 23
Credit: 41,989,607
RAC: 2,734
Message 35813 - Posted: 4 Jan 2009, 22:29:53 UTC

I\'m posting to report that the model \"hadsm3_ki4g_006004498_3\" I\'ve been processing has also turned into an \"iceworld\". Here is the rest of the information requested:

1. A link to the model/ResultID webpage

This link points to results for my Task ID 8152663.

2. A current timestep of that model (on the globe graphic)

Timestep is 27,555 of 259,248. Model date and time are 05/07/2052 01:30. Last trickle was reported on 09-Dec-2008, at timestep 21,604 (but this machine is only connected to the internet one or two days per week, and not at all between 12-Dec-2008 and 03-Jan-2009). For what it may be worth, I have WinZIP backup archives of my BOINC data (work) folders on 08-Dec-2008, 11-Dec-2008, and 04-Jan-2009. I hadn\'t noticed \"iceworld\" appearance until today (04-Jan-2009), when I checked it *after* I\'d re-established the internet connection and completed all pending transfer uploads, suspended all BOINC CPDN tasks, shutdown the BOINC client, closed the BOINC manager, backed-up my BOINC data (work) folders with WinZIP, defragmented the disk, shutdown the computer for a cold reboot, and then resumed the BOINC CPDN tasks.

3. The s/TS value (on the globe graphic. Remember, you can hit the Z key while viewing the globe and it will give you this additional text/status information.)

The ^Z key screen reports Hours Elapsed 1448:02:13 (6.47 s/TS).

4. Whether the temperature display of the globe graphic is blue.

Yes, temperature display is blue. Clouds only appear surrounding the coastline of Antarctica.

5. What your processor/CPU is (i.e. Intel, AMD)

Processor is GenuineIntel, Intel(R) Xeon(TM) CPU 2.80GHz [x86 Family 15 Model 4 Stepping 10]. Operating System is Microsoft Windows XP Professional x86 Editon, Service Pack 3, (05.01.2600.00).

6. Whether you are overclocking.

No. However, this is the 2nd HADSM3 model on this machine to turn into an iceworld. The first instance was described in this message thread.

I\'ve now aborted this model, but if asked, I\'d be happy to restore and restart it from one of the earlier backups cited above.

Regards,

--Jim
ID: 35813 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 35815 - Posted: 5 Jan 2009, 0:19:54 UTC
Last modified: 5 Jan 2009, 1:05:23 UTC

Looks like you did the right thing Jim.

From the same work unit, this one went iceworld at exactly the same place as yours did, as did this other one. But they are both still running, very, very slowly.

All three that went iceworld out of that work unit are on Intels with Windows. A Mac in Darwin completed it, as did an AMD Phenom in Windows.

It would be interesting to find out on any AMD iceworld models, if Intels in the same work units were able to complete one.

Edit...I should have just listened to Iain. All iceworld ResultIDs listed in this thread, and other iceworlds in the same work units, were Intel/Windows.
ID: 35815 · Report as offensive
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 35816 - Posted: 5 Jan 2009, 0:20:02 UTC - in response to Message 35813.  
Last modified: 5 Jan 2009, 0:43:47 UTC

[JimMcCarthy_NGST wrote:]I\'ve now aborted this model, but if asked, I\'d be happy to restore and restart it from one of the earlier backups cited above.

Jim,

[Edit: geophi text supersedes.]

Thanks for reporting that model. I will send a message to the other affected users in that work unit.

Iain

[PS And, Mo, I\'ll PM the affected users in the WU reported by GW3PRV. Now done.]
ID: 35816 · Report as offensive
DaveG27

Send message
Joined: 8 Nov 06
Posts: 18
Credit: 2,425,895
RAC: 0
Message 35817 - Posted: 5 Jan 2009, 0:43:08 UTC
Last modified: 5 Jan 2009, 0:43:59 UTC

The question of AMD vs Intel & Linux vs Windows I have been doing a little research and came up with this . I found two W.U. that will fit the bill (Found them surprisingly quickly this time). I have provided links to their P3 temperature graphs.
Because difference is small its best to view them like this. Open three new tabs and load one of the three graphs into each tab. You can the switch between tabs to see the difference as you switch (Works in FireFox presumably it works in IE)

AMD vs Intel W.U. 6176174 all Windows XP
Task
7609806 AMD comp. 834511
7609800 Intel comp. 206022
7609801 Intel comp. 673899

Windows XP vs Linux W.U. 6176172 all Intel
Task
7609778 Linux comp. 890788
7609779 XP comp. 903134
7609777 XP comp. 878045

Dave
ID: 35817 · Report as offensive
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 35819 - Posted: 5 Jan 2009, 1:42:03 UTC
Last modified: 5 Jan 2009, 1:42:58 UTC

Here are the graphs Dave found:

XP + AMD


XP + Intel


XP + Intel


Yes, the Intel graphs are more similar to each other and the AMD more different. But Tolu has said he expects almost all models from a single workunit to produce slightly different results.
Cpdn news
ID: 35819 · Report as offensive
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 35820 - Posted: 5 Jan 2009, 2:28:07 UTC
Last modified: 5 Jan 2009, 2:46:46 UTC

Here\'s Dave\'s second group, all from the same workunit:

Intel + Linux:



Intel + XP:



Intel + XP:



Yes, the two XP graphs are more similar to each other and the Linux graph is more different.

But I don\'t think this explains why more Intels seem to be generating more iceworlds than AMDs.

It\'s certainly possible for AMDs to produce iceworlds. My AMD produced two separate iceworlds in a Beta MH model last Spring whereas PeteB\'s AMD completed it without incident. It was the first iceworld that revealed an instability in my AMD (now corrected). That particular Beta version always failed on Intels, so we can\'t compare Intels and AMDs within the WU. I reported the two Beta iceworlds within the same model here.

Perhaps AMDs usually only produce iceworlds under the extreme circumstances of an unstable computer. Not that mine was very unstable; I hadn\'t noticed anything abnormal in the computer\'s behaviour since it had been put together from parts that had belonged to two computers. The overclock I didn\'t know about turned out to be 2½%.

The same model produced a second iceworld two phases later after the computer had been reset to stock speed. I assume that starting the model on an unstable computer damaged the model in some way and this caused the second iceworld later.
Cpdn news
ID: 35820 · Report as offensive
DaveG27

Send message
Joined: 8 Nov 06
Posts: 18
Credit: 2,425,895
RAC: 0
Message 35821 - Posted: 5 Jan 2009, 5:39:11 UTC

In W.U. 6176172 there where 2 more finisher\'s one of which was a Intel/Vista which is most similar to XP.
W.U. 6176166 has also 5 finishers 3 - Intel/XP. 1 - Intel/Linux 1 - AMD/Linux
3 Intel/XP very similar Intel/Linux slightly different to AMD/Linux which are both slightly different to the other 3
Dave
ID: 35821 · Report as offensive
JimMcCarthy_StellarSolns
Avatar

Send message
Joined: 3 Sep 08
Posts: 23
Credit: 41,989,607
RAC: 2,734
Message 35827 - Posted: 5 Jan 2009, 19:18:23 UTC - in response to Message 35815.  
Last modified: 5 Jan 2009, 19:31:10 UTC

geophi replied:

Looks like you did the right thing Jim.
<snip>

Edit...I should have just listened to Iain. All iceworld ResultIDs listed in this thread, and other iceworlds in the same work units, were Intel/Windows.


I have my PCs (including those I\'m using to crunch models for CPDN) configured to dual boot WinXP Pro SP3 and Linux. Specifically, the Linux distribution I run is Scientific Linux v4.x, which is built from Red Hat Enterprise Linux v4.x source RPMs (if you click on the link, expect to see a notice about an expired security certificate or something -- I believe it\'s safe to ignore the warning and proceed to the site). The non-root disk partitions where I keep my BOINC data files are FAT32, and thus are visible (read/write, no problems) on the machine regardless of whether it\'s running the WinXP or Linux operating system.

Thus far, I\'ve only been running BOINC while the machines are booted and running Windows (in part for convenience, and mostly that I recall reading somewhere that BOINC and/or CPDN on Linux requires RHEL v5, or at least an upgrade to the v4.x glibc library, and who-knows-what-else ... likely much more) -- but the reply posts in this thread raise in my mind the question whether the data files in the BOINC project folders (for models in progress) are OS neutral, and whether a Linux platform could continue to process a model begun on a Windows platform ? (Probably not ... and sorry if this is a FAQ ...). I realize that this probably wouldn\'t be simple to configure and accomplish, even if it were possible, but given the ease of being able to boot the machine to either OS and the desire to restore-from-backup-and-complete my models that became \"iceworlds\" under WinXP, I thought I\'d ask anyway ...

Thanks,

-- Jim
ID: 35827 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 35830 - Posted: 5 Jan 2009, 21:41:33 UTC - in response to Message 35827.  

Thus far, I\'ve only been running BOINC while the machines are booted and running Windows (in part for convenience, and mostly that I recall reading somewhere that BOINC and/or CPDN on Linux requires RHEL v5, or at least an upgrade to the v4.x glibc library, and who-knows-what-else ... likely much more) -- but the reply posts in this thread raise in my mind the question whether the data files in the BOINC project folders (for models in progress) are OS neutral, and whether a Linux platform could continue to process a model begun on a Windows platform ? (Probably not ... and sorry if this is a FAQ ...). I realize that this probably wouldn\'t be simple to configure and accomplish, even if it were possible, but given the ease of being able to boot the machine to either OS and the desire to restore-from-backup-and-complete my models that became \"iceworlds\" under WinXP, I thought I\'d ask anyway ...


It might be possible, with lots and lots of work...but be prepared for failure. An example of the difficulties can be found in the client_state.xml file. References to OS specific executables and their digital hashes. There would have to be careful copying of just the needed files into the right directories, including the slots directory, etc. My head hurts just thinking about.
ID: 35830 · Report as offensive
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 35839 - Posted: 6 Jan 2009, 1:47:47 UTC

Jim, I was going to say you could transfer the BOINC or BOINC Data folder from its Intel to an AMD with the same OS because the models and server tolerate a change of computer and CPU type. But I see your computers are all Intels.

I think you\'re going to have to let these iceworlds descend into oblivion. I know it\'s frustrating.
Cpdn news
ID: 35839 · Report as offensive
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 15 · Next

Message boards : Number crunching : Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion

©2024 climateprediction.net