Intel Visual Fortan run-time error

Author	Message
Pete(r) van der Spoel Send message Joined: 5 Aug 04 Posts: 6 Credit: 7,002,751 RAC: 0	Message 45980 - Posted: 19 Apr 2013, 14:13:36 UTC - in response to Message 45979. Sorry, my bad for not looking properly. The error's were all about the task I'd just downloaded and which was actually stuck at 0%. The other are running fine so I'll just abort that one task... ID: 45980 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 46002 - Posted: 21 Apr 2013, 0:37:25 UTC Pete, if the model's progress is stuck at 0% please abort it. Cpdn news ID: 46002 · Reply Quote

Stuart Send message Joined: 2 Jan 11 Posts: 4 Credit: 1,782,807 RAC: 0	Message 46119 - Posted: 29 Apr 2013, 21:51:43 UTC Hello, Same error here, been cropping up over the last few weeks - had been aborting "bad" simulations but now its happening more often. I dont appear able to copy and paste the error message and its alot to type! Just aborted another task which was showing "computation error", task hadcm3n_3j00_1980_40_008352515 It was reported as having had 9h 55m 10s computation time which on my PC is around 2% completion. Ive not really looked in detail to see when the others have failed in case there is a pattern. Re-installed Boinc 7.0.64 for windows 64. I run Windows 7 ultimate, 18Gb ram, i7 930 2.8Ghz which has been year long stable at 3.36GHz with an nVidia GTX 570 graphics card. I only run the climate prediction and also GPU grid on boinc. Hope this helps somebody fix things? Stuart ID: 46119 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 46122 - Posted: 29 Apr 2013, 22:27:05 UTC Last modified: 29 Apr 2013, 22:28:07 UTC Hi Stuart Thank you for the report. The model certainly didn't fail because of a shortage of RAM on your computer, did it? I hate to have to tell you that model hadcm3n_3bzy_1980_40_008349731 on your computer will also have to be aborted. If you see that any model in the same workunit has crashed with Exit status -529697949, please abort it straightaway. Cpdn news ID: 46122 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 46135 - Posted: 30 Apr 2013, 20:26:22 UTC It's thought that the source of the FORTRAN errors has been found, so a small test batch was released. These were grabbed immediately, and are apparently running OK. Backups: Here ID: 46135 · Reply Quote

old_user447942 Send message Joined: 5 May 07 Posts: 1 Credit: 2,153,004 RAC: 0	Message 46245 - Posted: 16 May 2013, 21:02:12 UTC I have a Fortran error running hadcm3n_4db9_1980_40_008348264_2 on a Dual Quad-Xeon processors, Windows 7 x64, all patched etc. The other 7 Projects are running without problems. I aborted this one. ID: 46245 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 46246 - Posted: 16 May 2013, 22:21:20 UTC - in response to Message 46245. I have a Fortran error running hadcm3n_4db9_1980_40_008348264_2 on a Dual Quad-Xeon processors, Windows 7 x64, all patched etc. The other 7 Projects are running without problems. I aborted this one. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8499125 Yes, it shows all the hallmarks of being a bad workunit. Aborting it is the right thing to do :-) I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 46246 · Reply Quote

deadsenator Send message Joined: 6 Aug 08 Posts: 3 Credit: 77,283,077 RAC: 21,757	Message 46643 - Posted: 19 Jul 2013, 3:44:22 UTC After a spate of these a few months ago, I am getting this error again. The workunits then show up as a computation error in Boinc. Unlike what I have read here, some of my errors come from failed workunits that are 600+ hours in. 97% complete and blammo. ID: 46643 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,007,720 RAC: 4,289	Message 46644 - Posted: 19 Jul 2013, 8:31:59 UTC - in response to Message 46643. After a spate of these a few months ago, I am getting this error again. The workunits then show up as a computation error in Boinc. Unlike what I have read here, some of my errors come from failed workunits that are 600+ hours in. 97% complete and blammo. The machines you have are very powerful ones indeed, but the HADCM3N model is also large. Attempting to run 20 of them on any machine is likely to result in a significant failure rate. This type of model is particularly sensitive at the decade upload point (i.e. 25%, 50% etc.). The FORTRAN error is usually a sign of competition for resources, which will be a precursor to failure for HADCM3N. The Xeon E5645, for example, has hyperthreading. The model completion rate might improve by limiting the number of CPUs in BOINC to the number of cores, which won't greatly affect the throughput as hyperthreading only gives a 20% or so advantage. ID: 46644 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 46647 - Posted: 19 Jul 2013, 13:30:59 UTC - in response to Message 45846. ... On Windows it may be a different matter. It's possible they may sit there pretending to run but not clocking up any progress ... This will be interesting ... I downloaded a bunch yesterday after the servers came back, and I am away from home for 9 days. Unfortunate timing. I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 46647 · Reply Quote

deadsenator Send message Joined: 6 Aug 08 Posts: 3 Credit: 77,283,077 RAC: 21,757	Message 46649 - Posted: 19 Jul 2013, 16:59:32 UTC - in response to Message 46644. The machines you have are very powerful ones indeed, but the HADCM3N model is also large. Attempting to run 20 of them on any machine is likely to result in a significant failure rate. This type of model is particularly sensitive at the decade upload point (i.e. 25%, 50% etc.). The FORTRAN error is usually a sign of competition for resources, which will be a precursor to failure for HADCM3N. The Xeon E5645, for example, has hyperthreading. The model completion rate might improve by limiting the number of CPUs in BOINC to the number of cores, which won't greatly affect the throughput as hyperthreading only gives a 20% or so advantage. Thank you for your input, Iain. I have never before had any significant error rates and the system runs fine normally, except for the aforementioned spikes in errors back in Spring and the recent set that I've mentioned. This last error was on a small WU and died at 0%, but that seems to have been the exception for me. I did not do a thorough analysis, but most of my previous failed WUs then had been month-long exercises that failed towards the end. Perhaps because of the sensitive upload point you've mentioned. I am somewhat confused by your statements above about model completion rate improving by limiting the cores (I presume you meant to only real cores), but then you state that HT gives a 20% advantage. I understand how HT works, but I am just asking for clarification about Climate WU processing efficiency. It is my experience that this type of processing is enhanced by using as many cores as possible. Whether they are HT or not, overall wall-clock time is reduced for the job. Somewhere along the lines of that 20%. This is significant in my opinion, but I also recognize that successful WU completion is the goal. If the errors persist I will look to implement your advice, but I am initially reluctant to limit the cores and reduce my intended work unit production. I have made one change, and that is to keep the WU in memory when suspended. I feel foolish for not setting this before as some of those earlier errors seem to hit when re-activating the client. Thank you again. ID: 46649 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 46650 - Posted: 19 Jul 2013, 21:42:07 UTC - in response to Message 46649. As a rough rule of thumb, it has in the past been considered that the hadcmn3 models need 1 gig of ram each. So, 20 models, 20 gigs, plus some more for the OS. ID: 46650 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 46651 - Posted: 19 Jul 2013, 22:55:40 UTC - in response to Message 46649. I understand how HT works, but I am just asking for clarification about Climate WU processing efficiency. It is my experience that this type of processing is enhanced by using as many cores as possible. Whether they are HT or not, overall wall-clock time is reduced for the job. Somewhere along the lines of that 20%. This is significant in my opinion, but I also recognize that successful WU completion is the goal. Just to be clear, WUs are single-threaded. On my machine (core i7 SNB, 4 cores, 8 threads), with 4 models running concurrently, each takes about 1.0 seconds per time step (s/ts). With 8 running, each takes about 1.5 s/ts. Doing the arithmetic, doubling the number of WUs running concurrently increases total throughput by one third. It also increases the clock time required to complete any one WU by half. So with hyperthreading, machines get more done in a year, but each individual WU takes longer to finish. HadCM3Ns seem to be sensitive to disk i/o congestion--"impatient". Running fewer models reduces the probability of a "disk traffic jam" causing a model to crash because a disk read or write didn't complete quickly enough. (I think this is what Iain meant about model completion rates.) The degree of impatience seems to vary between different batches of HadCM3Ns. (For an idea of the numbers: on my machine, at 1.5 s/ts, each model averages about 0.85 MB/s continual disk writing, with spikes up to 7 MB/s during checkpoints (every 72 time steps). During the decadal zip-file uploads, disk activity goes as high as the disk system will support (over 65 MB/s reads and 35 MB/s writes at the same time) for a few seconds.) ID: 46651 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,007,720 RAC: 4,289	Message 46652 - Posted: 19 Jul 2013, 23:02:56 UTC - in response to Message 46649. I am somewhat confused by your statements above about model completion rate improving by limiting the cores (I presume you meant to only real cores), but then you state that HT gives a 20% advantage. I understand how HT works, but I am just asking for clarification about Climate WU processing efficiency. It is my experience that this type of processing is enhanced by using as many cores as possible. Whether they are HT or not, overall wall-clock time is reduced for the job. Somewhere along the lines of that 20%. This is significant in my opinion, but I also recognize that successful WU completion is the goal. A distinction needs to be made between the effect that HT has on machine throughput and the effect that it has on the time taken to complete each task. Assuming all the HT pseudo-cores are running tasks, then HT increases throughput (for which credits/RAC are a suitable metric) but decreases the rate at which each task progresses - i.e. each task takes longer to complete, almost double the time. The cause of the HADCM3N decadal sensitivity is thought to be a timing error: in other words, the sequence of actions the various parts of a HADCM3N task needs to perform gets messed up, so when the Zip file comes to be created the required files aren't there, so the model crashes. Slowing a task down might, in principle, work either way: it could reduce the probability of the timing/sequencing error or it could increase that probability, depending on what precisely the error is. So when I say "completion rate" I don't mean the rate of progress a model makes whether it completes or not, I mean the proportion of viable models that actually finish. If all models completed then the throughput would measure both the rate of progress and the rate of completion; in the presence of errors the two rates diverge. My own experience is that leaving HADCM3N models entirely undisturbed reduces the error rate to zero - i.e. I have had no failures at all since leaving them alone. (There will, of course, be "invalid theta" and other physics errors, and download errors on occasion too; there's nothing we volunteers can do about that.) My prejudice is therefore that the HT process represents "disturbance" and is to be discouraged: it is, however, merely a prejudice: almost all the machines to which I have access have been running work simulations solidly for six months, so I have simply not been running CPDN nor have I tested HT/multi-core/HADCM3N interactions since being told about the timing error (at the Guardian University Awards in February). Unfortunately, the real world does intrude from time to time. I know we say that incomplete models are useful to the project: it's a true but nonetheless rather lawyerly evasion - it's just got to be better for the project to get complete models. ID: 46652 · Reply Quote

deadsenator Send message Joined: 6 Aug 08 Posts: 3 Credit: 77,283,077 RAC: 21,757	Message 46654 - Posted: 20 Jul 2013, 1:44:38 UTC - in response to Message 46652. As a rough rule of thumb, it has in the past been considered that the hadcmn3 models need 1 gig of ram each. So, 20 models, 20 gigs, plus some more for the OS. Les, I did not know that. Apparently 12GB isn't enough, so you've given me a great reason to add more RAM. Thanks! Just to be clear, WUs are single-threaded...So with hyperthreading, machines get more done in a year, but each individual WU takes longer to finish. Thank you, Greg. Yes, I know about WUs being single threaded and what you've stated aligns with how I understand it. I consider a "job" not to be just one WU, but the entire model being crunched. So, yes the time per WU increases, but since you are processing more WUs overall, the total job time will be reduced. HadCM3Ns seem to be sensitive to disk i/o congestion--"impatient". Running fewer models reduces the probability of a "disk traffic jam" causing a model to crash because a disk read or write didn't complete quickly enough. (I think this is what Iain meant about model completion rates.) The degree of impatience seems to vary between different batches of HadCM3Ns. Well, I am using an SSD drive (Samsung 840), so that should help, but your point is a good one. The takeaway for me is resource contention can occur at each level (CPU, RAM and disk) and the code is very sensitive to this. ... My own experience is that leaving HADCM3N models entirely undisturbed reduces the error rate to zero Iain, this is echoing similar experiences I have had. After shutting down Boinc, ramping back up can be a tenuous experience and this is when I have experienced some problems. I have cut back on the number of interruptions and I have made the memory setting change I stated above in the attempt to quell any potential disturbances. Unfortunately, as a pesky human, I like to use this machine for other things too on occasion. I didn't build it only for Boinc. Your thoughts regarding HT are noted and certainly could come into play with the instability we've discussed. I'll take the opposite track and continue to use it as I have not experienced any consistent instability that I could tie to such a global environment setting. Additionally, it seems to be only this system that experiences these errors. The other two don't seem to crash WUs, but are using HT. Perhaps if the errors continue, I will test your solution. In addition to leaving the WU in memory, what I will do is look to increasing my RAM and see if this helps with resource contention. Thank you all for your help. Your input is highly valued. ID: 46654 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,888,554 RAC: 1,481,373	Message 46656 - Posted: 20 Jul 2013, 5:51:55 UTC I can't prove this - working on stats - but preliminary indications here -- If you have more than four cores -- leaving one of them free for the OS - might help total throughput -- just a thought - I'm not sure but seems to work here. ID: 46656 · Reply Quote

Norman Guinasso Send message Joined: 28 Jan 05 Posts: 2 Credit: 993,922 RAC: 0	Message 46687 - Posted: 24 Jul 2013, 1:47:51 UTC getting visual fortran run time error w7 I have not changed anything. I cannot delete error window. ID: 46687 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,084,862 RAC: 2,317	Message 46689 - Posted: 24 Jul 2013, 3:29:58 UTC - in response to Message 46687. getting visual fortran run time error w7 I have not changed anything. I cannot delete error window. Have you tried exiting the model or models, closing down Boinc and rebooting? The WU that is throwing errors will most likely crash, but, there is nothing that can be done about that. At least it will allow you to delete the error message window. The Wu is most likely non-viable anyway. ID: 46689 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,007,720 RAC: 4,289	Message 46693 - Posted: 24 Jul 2013, 12:01:27 UTC - in response to Message 46687. getting visual fortran run time error w7 I have not changed anything. I cannot delete error window. The only time a model running on a machine of mine produced a sequence of these FORTRAN errors, there was an unrelated process running 100% in the background (a berserk printer driver). Killing that other process first saved the CPDN model, though that was pre-HADCM3N. HADCM3N models do not seem very robust, so JIM is probably right: the model may now fail whatever you do ... ID: 46693 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 46697 - Posted: 24 Jul 2013, 16:55:31 UTC - in response to Message 46687. getting visual fortran run time error w7 I have not changed anything. I cannot delete error window. A friend had two tasks throw Fortran errors at about the same time. She held them for me to see. We tried to salvage the tasks, in case they were the 'soft' type (irritations, but not fatal), to no avail. Both failed. Six Fortran error popups are thrown by each failed task of that type. It seems, at the time, that we can't get rid of the things (doubly so for twelve popups with two simultaneous failures). 'Luck of the draw' whether we inherit reruns of old, flawed, tasks. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 46697 · Reply Quote