climateprediction.net home page
Intel Visual Fortan run-time error

Intel Visual Fortan run-time error

Questions and Answers : Windows : Intel Visual Fortan run-time error
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Pete(r) van der Spoel

Send message
Joined: 5 Aug 04
Posts: 6
Credit: 7,002,751
RAC: 0
Message 45980 - Posted: 19 Apr 2013, 14:13:36 UTC - in response to Message 45979.  

Sorry, my bad for not looking properly. The error's were all about the task I'd just downloaded and which was actually stuck at 0%. The other are running fine so I'll just abort that one task...
ID: 45980 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 46002 - Posted: 21 Apr 2013, 0:37:25 UTC

Pete, if the model's progress is stuck at 0% please abort it.
Cpdn news
ID: 46002 · Report as offensive     Reply Quote
Stuart

Send message
Joined: 2 Jan 11
Posts: 4
Credit: 1,782,807
RAC: 0
Message 46119 - Posted: 29 Apr 2013, 21:51:43 UTC

Hello,

Same error here, been cropping up over the last few weeks - had been aborting "bad" simulations but now its happening more often.

I dont appear able to copy and paste the error message and its alot to type!

Just aborted another task which was showing "computation error", task hadcm3n_3j00_1980_40_008352515

It was reported as having had 9h 55m 10s computation time which on my PC is around 2% completion.

Ive not really looked in detail to see when the others have failed in case there is a pattern.

Re-installed Boinc 7.0.64 for windows 64.

I run Windows 7 ultimate, 18Gb ram, i7 930 2.8Ghz which has been year long stable at 3.36GHz with an nVidia GTX 570 graphics card.

I only run the climate prediction and also GPU grid on boinc.

Hope this helps somebody fix things?

Stuart
ID: 46119 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 46122 - Posted: 29 Apr 2013, 22:27:05 UTC
Last modified: 29 Apr 2013, 22:28:07 UTC

Hi Stuart

Thank you for the report. The model certainly didn't fail because of a shortage of RAM on your computer, did it?

I hate to have to tell you that model hadcm3n_3bzy_1980_40_008349731 on your computer will also have to be aborted. If you see that any model in the same workunit has crashed with Exit status -529697949, please abort it straightaway.
Cpdn news
ID: 46122 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 46135 - Posted: 30 Apr 2013, 20:26:22 UTC

It's thought that the source of the FORTRAN errors has been found, so a small test batch was released.
These were grabbed immediately, and are apparently running OK.


Backups: Here
ID: 46135 · Report as offensive     Reply Quote
old_user447942

Send message
Joined: 5 May 07
Posts: 1
Credit: 2,153,004
RAC: 0
Message 46245 - Posted: 16 May 2013, 21:02:12 UTC

I have a Fortran error running
hadcm3n_4db9_1980_40_008348264_2
on a Dual Quad-Xeon processors, Windows 7 x64, all patched etc. The other 7 Projects are running without problems. I aborted this one.
ID: 46245 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46246 - Posted: 16 May 2013, 22:21:20 UTC - in response to Message 46245.  

I have a Fortran error running
hadcm3n_4db9_1980_40_008348264_2
on a Dual Quad-Xeon processors, Windows 7 x64, all patched etc. The other 7 Projects are running without problems. I aborted this one.


http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8499125

Yes, it shows all the hallmarks of being a bad workunit. Aborting it is the right thing to do :-)




I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46246 · Report as offensive     Reply Quote
deadsenator

Send message
Joined: 6 Aug 08
Posts: 3
Credit: 76,936,965
RAC: 4,846
Message 46643 - Posted: 19 Jul 2013, 3:44:22 UTC

After a spate of these a few months ago, I am getting this error again.

The workunits then show up as a computation error in Boinc. Unlike what I have read here, some of my errors come from failed workunits that are 600+ hours in. 97% complete and blammo.
ID: 46643 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,903,221
RAC: 6,722
Message 46644 - Posted: 19 Jul 2013, 8:31:59 UTC - in response to Message 46643.  

After a spate of these a few months ago, I am getting this error again.

The workunits then show up as a computation error in Boinc. Unlike what I have read here, some of my errors come from failed workunits that are 600+ hours in. 97% complete and blammo.

The machines you have are very powerful ones indeed, but the HADCM3N model is also large. Attempting to run 20 of them on any machine is likely to result in a significant failure rate. This type of model is particularly sensitive at the decade upload point (i.e. 25%, 50% etc.). The FORTRAN error is usually a sign of competition for resources, which will be a precursor to failure for HADCM3N.

The Xeon E5645, for example, has hyperthreading. The model completion rate might improve by limiting the number of CPUs in BOINC to the number of cores, which won't greatly affect the throughput as hyperthreading only gives a 20% or so advantage.
ID: 46644 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46647 - Posted: 19 Jul 2013, 13:30:59 UTC - in response to Message 45846.  

... On Windows it may be a different matter. It's possible they may sit there pretending to run but not clocking up any progress ...


This will be interesting ... I downloaded a bunch yesterday after the servers came back, and I am away from home for 9 days. Unfortunate timing.

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46647 · Report as offensive     Reply Quote
deadsenator

Send message
Joined: 6 Aug 08
Posts: 3
Credit: 76,936,965
RAC: 4,846
Message 46649 - Posted: 19 Jul 2013, 16:59:32 UTC - in response to Message 46644.  


The machines you have are very powerful ones indeed, but the HADCM3N model is also large. Attempting to run 20 of them on any machine is likely to result in a significant failure rate. This type of model is particularly sensitive at the decade upload point (i.e. 25%, 50% etc.). The FORTRAN error is usually a sign of competition for resources, which will be a precursor to failure for HADCM3N.

The Xeon E5645, for example, has hyperthreading. The model completion rate might improve by limiting the number of CPUs in BOINC to the number of cores, which won't greatly affect the throughput as hyperthreading only gives a 20% or so advantage.


Thank you for your input, Iain. I have never before had any significant error rates and the system runs fine normally, except for the aforementioned spikes in errors back in Spring and the recent set that I've mentioned. This last error was on a small WU and died at 0%, but that seems to have been the exception for me. I did not do a thorough analysis, but most of my previous failed WUs then had been month-long exercises that failed towards the end. Perhaps because of the sensitive upload point you've mentioned.

I am somewhat confused by your statements above about model completion rate improving by limiting the cores (I presume you meant to only real cores), but then you state that HT gives a 20% advantage. I understand how HT works, but I am just asking for clarification about Climate WU processing efficiency. It is my experience that this type of processing is enhanced by using as many cores as possible. Whether they are HT or not, overall wall-clock time is reduced for the job. Somewhere along the lines of that 20%. This is significant in my opinion, but I also recognize that successful WU completion is the goal.

If the errors persist I will look to implement your advice, but I am initially reluctant to limit the cores and reduce my intended work unit production. I have made one change, and that is to keep the WU in memory when suspended. I feel foolish for not setting this before as some of those earlier errors seem to hit when re-activating the client.

Thank you again.
ID: 46649 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 46650 - Posted: 19 Jul 2013, 21:42:07 UTC - in response to Message 46649.  

As a rough rule of thumb, it has in the past been considered that the hadcmn3 models need 1 gig of ram each.
So, 20 models, 20 gigs, plus some more for the OS.

ID: 46650 · Report as offensive     Reply Quote
Profile Greg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 46651 - Posted: 19 Jul 2013, 22:55:40 UTC - in response to Message 46649.  

I understand how HT works, but I am just asking for clarification about Climate WU processing efficiency. It is my experience that this type of processing is enhanced by using as many cores as possible. Whether they are HT or not, overall wall-clock time is reduced for the job. Somewhere along the lines of that 20%. This is significant in my opinion, but I also recognize that successful WU completion is the goal.

Just to be clear, WUs are single-threaded.

On my machine (core i7 SNB, 4 cores, 8 threads), with 4 models running concurrently, each takes about 1.0 seconds per time step (s/ts). With 8 running, each takes about 1.5 s/ts. Doing the arithmetic, doubling the number of WUs running concurrently increases total throughput by one third. It also increases the clock time required to complete any one WU by half.

So with hyperthreading, machines get more done in a year, but each individual WU takes longer to finish.

HadCM3Ns seem to be sensitive to disk i/o congestion--"impatient". Running fewer models reduces the probability of a "disk traffic jam" causing a model to crash because a disk read or write didn't complete quickly enough. (I think this is what Iain meant about model completion rates.) The degree of impatience seems to vary between different batches of HadCM3Ns.

(For an idea of the numbers: on my machine, at 1.5 s/ts, each model averages about 0.85 MB/s continual disk writing, with spikes up to 7 MB/s during checkpoints (every 72 time steps). During the decadal zip-file uploads, disk activity goes as high as the disk system will support (over 65 MB/s reads and 35 MB/s writes at the same time) for a few seconds.)
ID: 46651 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,903,221
RAC: 6,722
Message 46652 - Posted: 19 Jul 2013, 23:02:56 UTC - in response to Message 46649.  

I am somewhat confused by your statements above about model completion rate improving by limiting the cores (I presume you meant to only real cores), but then you state that HT gives a 20% advantage. I understand how HT works, but I am just asking for clarification about Climate WU processing efficiency. It is my experience that this type of processing is enhanced by using as many cores as possible. Whether they are HT or not, overall wall-clock time is reduced for the job. Somewhere along the lines of that 20%. This is significant in my opinion, but I also recognize that successful WU completion is the goal.

A distinction needs to be made between the effect that HT has on machine throughput and the effect that it has on the time taken to complete each task. Assuming all the HT pseudo-cores are running tasks, then HT increases throughput (for which credits/RAC are a suitable metric) but decreases the rate at which each task progresses - i.e. each task takes longer to complete, almost double the time. The cause of the HADCM3N decadal sensitivity is thought to be a timing error: in other words, the sequence of actions the various parts of a HADCM3N task needs to perform gets messed up, so when the Zip file comes to be created the required files aren't there, so the model crashes. Slowing a task down might, in principle, work either way: it could reduce the probability of the timing/sequencing error or it could increase that probability, depending on what precisely the error is. So when I say "completion rate" I don't mean the rate of progress a model makes whether it completes or not, I mean the proportion of viable models that actually finish. If all models completed then the throughput would measure both the rate of progress and the rate of completion; in the presence of errors the two rates diverge.

My own experience is that leaving HADCM3N models entirely undisturbed reduces the error rate to zero - i.e. I have had no failures at all since leaving them alone. (There will, of course, be "invalid theta" and other physics errors, and download errors on occasion too; there's nothing we volunteers can do about that.) My prejudice is therefore that the HT process represents "disturbance" and is to be discouraged: it is, however, merely a prejudice: almost all the machines to which I have access have been running work simulations solidly for six months, so I have simply not been running CPDN nor have I tested HT/multi-core/HADCM3N interactions since being told about the timing error (at the Guardian University Awards in February). Unfortunately, the real world does intrude from time to time.

I know we say that incomplete models are useful to the project: it's a true but nonetheless rather lawyerly evasion - it's just got to be better for the project to get complete models.
ID: 46652 · Report as offensive     Reply Quote
deadsenator

Send message
Joined: 6 Aug 08
Posts: 3
Credit: 76,936,965
RAC: 4,846
Message 46654 - Posted: 20 Jul 2013, 1:44:38 UTC - in response to Message 46652.  

As a rough rule of thumb, it has in the past been considered that the hadcmn3 models need 1 gig of ram each.
So, 20 models, 20 gigs, plus some more for the OS.


Les, I did not know that. Apparently 12GB isn't enough, so you've given me a great reason to add more RAM. Thanks!


Just to be clear, WUs are single-threaded...So with hyperthreading, machines get more done in a year, but each individual WU takes longer to finish.


Thank you, Greg. Yes, I know about WUs being single threaded and what you've stated aligns with how I understand it. I consider a "job" not to be just one WU, but the entire model being crunched. So, yes the time per WU increases, but since you are processing more WUs overall, the total job time will be reduced.

HadCM3Ns seem to be sensitive to disk i/o congestion--"impatient". Running fewer models reduces the probability of a "disk traffic jam" causing a model to crash because a disk read or write didn't complete quickly enough. (I think this is what Iain meant about model completion rates.) The degree of impatience seems to vary between different batches of HadCM3Ns.


Well, I am using an SSD drive (Samsung 840), so that should help, but your point is a good one. The takeaway for me is resource contention can occur at each level (CPU, RAM and disk) and the code is very sensitive to this.

... My own experience is that leaving HADCM3N models entirely undisturbed reduces the error rate to zero


Iain, this is echoing similar experiences I have had. After shutting down Boinc, ramping back up can be a tenuous experience and this is when I have experienced some problems. I have cut back on the number of interruptions and I have made the memory setting change I stated above in the attempt to quell any potential disturbances. Unfortunately, as a pesky human, I like to use this machine for other things too on occasion. I didn't build it *only* for Boinc.

Your thoughts regarding HT are noted and certainly could come into play with the instability we've discussed. I'll take the opposite track and continue to use it as I have not experienced any consistent instability that I could tie to such a global environment setting. Additionally, it seems to be only this system that experiences these errors. The other two don't seem to crash WUs, but are using HT. Perhaps if the errors continue, I will test your solution.

In addition to leaving the WU in memory, what I will do is look to increasing my RAM and see if this helps with resource contention.

Thank you all for your help. Your input is highly valued.
ID: 46654 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 46656 - Posted: 20 Jul 2013, 5:51:55 UTC

I can't prove this - working on stats - but preliminary indications here --
If you have more than four cores -- leaving one of them free for the OS - might help total throughput -- just a thought - I'm not sure but seems to work here.
ID: 46656 · Report as offensive     Reply Quote
Profile Norman Guinasso

Send message
Joined: 28 Jan 05
Posts: 2
Credit: 993,922
RAC: 0
Message 46687 - Posted: 24 Jul 2013, 1:47:51 UTC

getting visual fortran run time error w7
I have not changed anything.
I cannot delete error window.
ID: 46687 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,053,321
RAC: 4,417
Message 46689 - Posted: 24 Jul 2013, 3:29:58 UTC - in response to Message 46687.  

getting visual fortran run time error w7
I have not changed anything.
I cannot delete error window.


Have you tried exiting the model or models, closing down Boinc and rebooting? The WU that is throwing errors will most likely crash, but, there is nothing that can be done about that. At least it will allow you to delete the error message window. The Wu is most likely non-viable anyway.

ID: 46689 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,903,221
RAC: 6,722
Message 46693 - Posted: 24 Jul 2013, 12:01:27 UTC - in response to Message 46687.  

getting visual fortran run time error w7
I have not changed anything.
I cannot delete error window.

The only time a model running on a machine of mine produced a sequence of these FORTRAN errors, there was an unrelated process running 100% in the background (a berserk printer driver). Killing that other process first saved the CPDN model, though that was pre-HADCM3N. HADCM3N models do not seem very robust, so JIM is probably right: the model may now fail whatever you do ...
ID: 46693 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 46697 - Posted: 24 Jul 2013, 16:55:31 UTC - in response to Message 46687.  

getting visual fortran run time error w7
I have not changed anything.
I cannot delete error window.

A friend had two tasks throw Fortran errors at about the same time. She held them for me to see. We tried to salvage the tasks, in case they were the 'soft' type (irritations, but not fatal), to no avail. Both failed.

Six Fortran error popups are thrown by each failed task of that type. It seems, at the time, that we can't get rid of the things (doubly so for twelve popups with two simultaneous failures).

'Luck of the draw' whether we inherit reruns of old, flawed, tasks.

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 46697 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Questions and Answers : Windows : Intel Visual Fortan run-time error

©2024 climateprediction.net