1)
Message boards :
Number crunching :
\'Maximum CPU time exceeded\' crash fixed
(Message 33498)
Posted 21 Apr 2008 by old_user428438 Post: Could this be another side-effect of the 8-zip 160-year models? I have one these models here which I suspended several weeks ago. I had edited the client_state file as documented here and it happily crunched to the 80 year mark and returned 8 zip files. Shortly thereafter it reset to zero CPU time. Since there was a wingman who seemed to be progressing satisfactorily, I suspended that WU and allowed a shorter model to crunch through and then started on another 160 year model. It now appears that my co-cruncher on this model may have hit a problem since his last trickle was on 20 March (see here. Once I have completed my current model (in just over a month), I could go back to this one but I guess that, since the time was reset, that would give 35 days of duplicate trickles before it reaches the point at which it reset and then I find out if it resets again!! My current target is to have at least as many successful runs as failed runs on my account so I am looking for any thoughts on anything further that I can check to increase the likelyhood that my next attempt with this model will succeed. F. |
2)
Message boards :
Number crunching :
Model crashes
(Message 33421)
Posted 18 Apr 2008 by old_user428438 Post:
My reading of this is that the \"shared memory\" is at the server end, not the host. And it is similar to messages I have had from Oracle server when working with large databases. But I may be totally wrong - that would not be unusual. F. |
3)
Message boards :
Number crunching :
Timestamps on Trickles
(Message 33272)
Posted 9 Apr 2008 by old_user428438 Post: Or change Sent to Received. Oh I do like a bit of lateral thinking ;) F. |
4)
Message boards :
Number crunching :
Timestamps on Trickles
(Message 33262)
Posted 9 Apr 2008 by old_user428438 Post:
That is what I surmised was happening, rather than displaying in users\' local time. In which case, someone needs to do something more fundamental to get the timestamp to match the column header rather than vice versa. F. |
5)
Message boards :
Number crunching :
Timestamps on Trickles
(Message 33257)
Posted 9 Apr 2008 by old_user428438 Post: I have just checked the Trickle info for the model that I am currently running. The leftmost column in the table is headed \"Time Sent (UTC)\" and the most recent entry is for 09 Apr 2008 09:24:16. As I look at my clock the time is 09 Apr 2008 09:39:00 BST which equals 08:39:00 UTC. It appears that I am reporting in the future (or could the server be reporting receipt in local time??) F. |
6)
Message boards :
Number crunching :
Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion
(Message 32924)
Posted 11 Mar 2008 by old_user428438 Post: ... @Mike: V25.5 of Prime95 automatically loads all the cores that you have - so no need to run multiple instances. F. |
7)
Message boards :
Number crunching :
Trickles not showing.
(Message 32918)
Posted 11 Mar 2008 by old_user428438 Post: I just had a huge jump in RAC, so they must have processed my trickles and zip files last night. :) All my trickles caught up yesterday (but not in time for the overnight update of stats - that comes tonight :) Perfect timing as my latest model is due to complete in about 20 hours, then I can let the one that I paused off the leash to finish its last couple of trickles and the zip. So, by my calculation, that has taken about 10 days to get things sorted out. F. |
8)
Message boards :
Number crunching :
Lost trickles after cpu merge?
(Message 32903)
Posted 10 Mar 2008 by old_user428438 Post: After I merged two computer id\'s for the same CPU, I noticed CPDN now thinks I\'m not sending up trickles. My log says I am. It looks like the client may be using the old computer id. Restarting the client doesn\'t fix it. Hold off for a couple of days and you may well see the trickles registered. The system is still in the process of catching up with trickles after its problems a week or so ago. It has suddenly started populating my backlog of about 50 trickles today (about another 40 left to catch up). F. |
9)
Message boards :
Number crunching :
Trickles not showing.
(Message 32875)
Posted 6 Mar 2008 by old_user428438 Post: I\'ve run my 80yr coupled model to within 3hrs of completion and, with the end of the week coming, I wondered how things were getting on and whether it is deemed safe now to send the final trickle and result upload zips? Trickles are still not getting through promptly. I would hang off until the middle of next week and then assess the situation again if I were you. F. |
10)
Message boards :
Number crunching :
Trickles not showing.
(Message 32854)
Posted 4 Mar 2008 by old_user428438 Post: Ok thanks, since I\'m running various projects, suspending the task is the better option here although I did suspend network activity for a couple of days last week when the server probs were at their worst. Indeed I also backed up the model at 98% to protect the final lunge to the finish post. I second the \"keep us posted\" sentiment - pretty please? I\'ve just suspended a model that is a couple of trickles and the zip away from completion. F. |
11)
Message boards :
Number crunching :
Trickles not showing.
(Message 32750)
Posted 27 Feb 2008 by old_user428438 Post: Thanks for the prompt replies, all. I was not really whinging (well... I suppose I was, really!!). I guess my real beef is that I would have expected some explanation of what had caused the system to go down to be one of the first items to appear on these boards as an announcement once the service returned, rather than as a cryptic response to a post here in NC. Don\'t get me wrong - I\'m not knocking the valuable service provided by the mods here but, being used the rather more anarchic boards on SETI where any unannounced downtime creates a storm of posts I was surprised at the lack of obvious information. I have now bookmarked the alternative pages for announcements so should be more relaxed in the event of similar occurences in the future. F. |
12)
Message boards :
Number crunching :
Trickles not showing.
(Message 32746)
Posted 26 Feb 2008 by old_user428438 Post:
Have I missed something? You say \"filled up\" as though it is common knowledge why my browser could not connect. Could you point to somewhere that explains why the service was down, please? F. |
13)
Message boards :
Number crunching :
hadcm slight deceleration of secs/TS
(Message 32737)
Posted 26 Feb 2008 by old_user428438 Post: my 80yr coupled model is running fine, under a week away from finishing. I\'m just slightly curious at noticing on the full trickles results page that the seconds per timestep (s/TS) is gradually slowing down throughout the model. It started at abt 3.4s/TS then settled around 3.79s/TS for much of the long haul and during the last month it has slipped further to 3.96s/TS. Typically I find that the s/TS decreases consistently throughout a model run. I am surprised when I see a value that is greater than its predecessor. F. |
14)
Message boards :
Number crunching :
Work done reverted back to Zero
(Message 32703)
Posted 22 Feb 2008 by old_user428438 Post:
Fair enough. I guess my machine is now booked for a thorough health check early next week. F. |
15)
Message boards :
Number crunching :
Work done reverted back to Zero
(Message 32701)
Posted 22 Feb 2008 by old_user428438 Post: Fred, Thanks for the concern. I am hoping, one day, to have more completed models than failed ones but won\'t hold my breath waiting for that - perhaps a dream more than an expectation! I had not noticed that the 80 year run was short on trickles, just celebrated a \"Success\", but I guess there is nothing I can do about that now. I can track pretty well all of the failed WU\'s to specific events in my overclocking adventures on my earlier E6400 or my current Q6600 setup. I do run at least 2 complete cycles of Memtest and 8 hours of Prime95 v25.x (i.e. all 4 cores) after any hardware changes and continuously monitor core temps but occasionally things still happen - e.g. temporarily attach a laptop SATA HD to copy some data and the machine won\'t boot, NB VERY hot; eventually have to replace MoBo; or Windoze decides it is corrupted after an AV update triggered reboot; etc. These events, and others similar, have required a re-load of or re-attach to Boinc and have resulted in the error reports or WU\'s that the system thinks I have, but my machine has lost them. I was getting quite excited, relatively speaking, at the prospect of completing a 160 year model - machine has been totally stable at 3.336GHz with core temps of 51C since Xmas - and having got past half way I was beginning to feel I was on the downhill. Then it errors and restarts from zero! Still, another 4 days should see me through the HADAM3 model I am currently crunching and I guess I should schedule a day out to re-run Memtest and Prime95 to be on the safe side (and blow out the dust bunnies while I am at it). F. |
16)
Message boards :
Number crunching :
Work done reverted back to Zero
(Message 32697)
Posted 21 Feb 2008 by old_user428438 Post: You wouldn\'t read about it (well you will after I finish writing this). I had exactly the same message a couple of days ago, also on a 160 year model (hadcm3istd_4439_1920_160_15921780_5) but it also set my \"Progress\" back to 0%. It had uploaded 83 trickles, so was just over half way through and I don\'t fancy waiting another 40 days for it to catch up with itself. I am crunching CPDN on only 1 core of my quaddy and the other 3 WU\'s that have been downloaded are relative \"quickies\" so, since I had a wingman with a faster box who was already a few years ahead of me, I have suspended that WU (and set \"No New Tasks\") and will crunch the other 3 tasks by which time he should just about have completed it. If he returns an error, meantime, then I will return to it on completion of any of my \"shorties\". I made no changes whatsoever to my machine that could have caused this error - its occurrence is a total mystery. F. |
17)
Message boards :
Number crunching :
Hadsm3 - memory required
(Message 32674)
Posted 19 Feb 2008 by old_user428438 Post:
Gotcha. Thanks for the explanation. F. |
18)
Message boards :
Number crunching :
Hadsm3 - memory required
(Message 32665)
Posted 19 Feb 2008 by old_user428438 Post:
So the Boinc code interrogates the Science App to determine the amount of memory that may be required? That\'s too clever by half... F. |
19)
Message boards :
Number crunching :
Hadsm3 - memory required
(Message 32657)
Posted 19 Feb 2008 by old_user428438 Post: I can remove this message from the scheduler if it would help. Would it not be possible to modify the message to \"Your computer has 1023.53MB of memory. Please ensure your preference settings exlude HADAM3 type jobs which require 1464.84MB\" F. |
20)
Message boards :
Number crunching :
Modelcrash?
(Message 32552)
Posted 9 Feb 2008 by old_user428438 Post: When you restore the contents of the BOINC folder there aren\'t many extra steps compared with backing up. This assumes you are running only CPDN. It\'s when you are running multiple projects that it gets \"tricky\". F. |
©2024 climateprediction.net