climateprediction.net home page
Posts by deadsenator

Posts by deadsenator

1) Questions and Answers : Windows : Intel Visual Fortan run-time error (Message 46654)
Posted 20 Jul 2013 by deadsenator
Post:
As a rough rule of thumb, it has in the past been considered that the hadcmn3 models need 1 gig of ram each.
So, 20 models, 20 gigs, plus some more for the OS.


Les, I did not know that. Apparently 12GB isn't enough, so you've given me a great reason to add more RAM. Thanks!


Just to be clear, WUs are single-threaded...So with hyperthreading, machines get more done in a year, but each individual WU takes longer to finish.


Thank you, Greg. Yes, I know about WUs being single threaded and what you've stated aligns with how I understand it. I consider a "job" not to be just one WU, but the entire model being crunched. So, yes the time per WU increases, but since you are processing more WUs overall, the total job time will be reduced.

HadCM3Ns seem to be sensitive to disk i/o congestion--"impatient". Running fewer models reduces the probability of a "disk traffic jam" causing a model to crash because a disk read or write didn't complete quickly enough. (I think this is what Iain meant about model completion rates.) The degree of impatience seems to vary between different batches of HadCM3Ns.


Well, I am using an SSD drive (Samsung 840), so that should help, but your point is a good one. The takeaway for me is resource contention can occur at each level (CPU, RAM and disk) and the code is very sensitive to this.

... My own experience is that leaving HADCM3N models entirely undisturbed reduces the error rate to zero


Iain, this is echoing similar experiences I have had. After shutting down Boinc, ramping back up can be a tenuous experience and this is when I have experienced some problems. I have cut back on the number of interruptions and I have made the memory setting change I stated above in the attempt to quell any potential disturbances. Unfortunately, as a pesky human, I like to use this machine for other things too on occasion. I didn't build it *only* for Boinc.

Your thoughts regarding HT are noted and certainly could come into play with the instability we've discussed. I'll take the opposite track and continue to use it as I have not experienced any consistent instability that I could tie to such a global environment setting. Additionally, it seems to be only this system that experiences these errors. The other two don't seem to crash WUs, but are using HT. Perhaps if the errors continue, I will test your solution.

In addition to leaving the WU in memory, what I will do is look to increasing my RAM and see if this helps with resource contention.

Thank you all for your help. Your input is highly valued.
2) Questions and Answers : Windows : Intel Visual Fortan run-time error (Message 46649)
Posted 19 Jul 2013 by deadsenator
Post:

The machines you have are very powerful ones indeed, but the HADCM3N model is also large. Attempting to run 20 of them on any machine is likely to result in a significant failure rate. This type of model is particularly sensitive at the decade upload point (i.e. 25%, 50% etc.). The FORTRAN error is usually a sign of competition for resources, which will be a precursor to failure for HADCM3N.

The Xeon E5645, for example, has hyperthreading. The model completion rate might improve by limiting the number of CPUs in BOINC to the number of cores, which won't greatly affect the throughput as hyperthreading only gives a 20% or so advantage.


Thank you for your input, Iain. I have never before had any significant error rates and the system runs fine normally, except for the aforementioned spikes in errors back in Spring and the recent set that I've mentioned. This last error was on a small WU and died at 0%, but that seems to have been the exception for me. I did not do a thorough analysis, but most of my previous failed WUs then had been month-long exercises that failed towards the end. Perhaps because of the sensitive upload point you've mentioned.

I am somewhat confused by your statements above about model completion rate improving by limiting the cores (I presume you meant to only real cores), but then you state that HT gives a 20% advantage. I understand how HT works, but I am just asking for clarification about Climate WU processing efficiency. It is my experience that this type of processing is enhanced by using as many cores as possible. Whether they are HT or not, overall wall-clock time is reduced for the job. Somewhere along the lines of that 20%. This is significant in my opinion, but I also recognize that successful WU completion is the goal.

If the errors persist I will look to implement your advice, but I am initially reluctant to limit the cores and reduce my intended work unit production. I have made one change, and that is to keep the WU in memory when suspended. I feel foolish for not setting this before as some of those earlier errors seem to hit when re-activating the client.

Thank you again.
3) Questions and Answers : Windows : Intel Visual Fortan run-time error (Message 46643)
Posted 19 Jul 2013 by deadsenator
Post:
After a spate of these a few months ago, I am getting this error again.

The workunits then show up as a computation error in Boinc. Unlike what I have read here, some of my errors come from failed workunits that are 600+ hours in. 97% complete and blammo.




©2024 climateprediction.net