climateprediction.net home page
Posts by old_user428438

Posts by old_user428438

1) Message boards : Number crunching : \'Maximum CPU time exceeded\' crash fixed (Message 33498)
Posted 21 Apr 2008 by old_user428438
Post:
Could this be another side-effect of the 8-zip 160-year models?
...


That\'s an interesting point, we could be seeing a lot more of these over the next few months ... it may be worth a few extra stickys?

Iain has identified the right model. It has always run on this computer; I haven’t noticed any abnormal benchmarks.
...


I think Les has identified the cause - there were a bunch of models produced which thought they were 80-year models when they were really 160 year models. So they had the lower limits on cpu usage and so forth. Fortunately being coupled models most of the important data is uploaded in the trickles as the model runs, so reaching the end, while ideal, is not critical.

I have one these models here which I suspended several weeks ago. I had edited the client_state file as documented here and it happily crunched to the 80 year mark and returned 8 zip files. Shortly thereafter it reset to zero CPU time. Since there was a wingman who seemed to be progressing satisfactorily, I suspended that WU and allowed a shorter model to crunch through and then started on another 160 year model.

It now appears that my co-cruncher on this model may have hit a problem since his last trickle was on 20 March (see here.

Once I have completed my current model (in just over a month), I could go back to this one but I guess that, since the time was reset, that would give 35 days of duplicate trickles before it reaches the point at which it reset and then I find out if it resets again!!

My current target is to have at least as many successful runs as failed runs on my account so I am looking for any thoughts on anything further that I can check to increase the likelyhood that my next attempt with this model will succeed.

F.
2) Message boards : Number crunching : Model crashes (Message 33421)
Posted 18 Apr 2008 by old_user428438
Post:

...
<snip>
4/18/2008 9:08:12 AM|climateprediction.net|Message from server: Project encountered internal error: shared memory


Why shared memory should be a problem I don\'t know.
</snip>
...

My reading of this is that the \"shared memory\" is at the server end, not the host. And it is similar to messages I have had from Oracle server when working with large databases.

But I may be totally wrong - that would not be unusual.

F.
3) Message boards : Number crunching : Timestamps on Trickles (Message 33272)
Posted 9 Apr 2008 by old_user428438
Post:
Or change Sent to Received.


Oh I do like a bit of lateral thinking ;)

F.
4) Message boards : Number crunching : Timestamps on Trickles (Message 33262)
Posted 9 Apr 2008 by old_user428438
Post:

<snip>
Ergo, I reckon that what we\'re seeing is a timestamp in the server\'s local (Oxford or Milton Keynes) wall-clock time. If someone can match up their message log \'trickle-up\' entries with trickle timestamps, we could even work out where the server is (Oxford\'s are properly sychronised with an NTP time server: Milton Keynes, notoriously, is not, and the php board is currently about seven minutes slow).
</snip>

That is what I surmised was happening, rather than displaying in users\' local time. In which case, someone needs to do something more fundamental to get the timestamp to match the column header rather than vice versa.

F.
5) Message boards : Number crunching : Timestamps on Trickles (Message 33257)
Posted 9 Apr 2008 by old_user428438
Post:
I have just checked the Trickle info for the model that I am currently running. The leftmost column in the table is headed \"Time Sent (UTC)\" and the most recent entry is for 09 Apr 2008 09:24:16. As I look at my clock the time is 09 Apr 2008 09:39:00 BST which equals 08:39:00 UTC. It appears that I am reporting in the future (or could the server be reporting receipt in local time??)

F.
6) Message boards : Number crunching : Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion (Message 32924)
Posted 11 Mar 2008 by old_user428438
Post:
...
My hardware ran Einstein@home just fine, but with these climate models I have had some equipment freezes (but no ice world freezes, thank goodness! ;), and have had to throttle back my overclocking several times. The freezes are less and less frequent with each throttle back, and hopefully my latest larger throttle back will work on a long-term basis! ;)


For overclocking, I\'d recommend a full 24 hours of Prime95 before running the climate model (one copy running for each core you have - so on a quad core you\'d have 4 copies, using -A0, -A1, -A2, and -A3). It took a lot of work before I could get my Q6600 working properly overclocked.



@Mike: V25.5 of Prime95 automatically loads all the cores that you have - so no need to run multiple instances.

F.
7) Message boards : Number crunching : Trickles not showing. (Message 32918)
Posted 11 Mar 2008 by old_user428438
Post:
I just had a huge jump in RAC, so they must have processed my trickles and zip files last night. :)

All my trickles caught up yesterday (but not in time for the overnight update of stats - that comes tonight :)

Perfect timing as my latest model is due to complete in about 20 hours, then I can let the one that I paused off the leash to finish its last couple of trickles and the zip. So, by my calculation, that has taken about 10 days to get things sorted out.

F.
8) Message boards : Number crunching : Lost trickles after cpu merge? (Message 32903)
Posted 10 Mar 2008 by old_user428438
Post:
After I merged two computer id\'s for the same CPU, I noticed CPDN now thinks I\'m not sending up trickles. My log says I am. It looks like the client may be using the old computer id. Restarting the client doesn\'t fix it.
Two questions: is there a way to get the client to use the new computer id? And will the \"lack\" of trickles cause CPDN to do something like give the same WU to someone else?

==Mike

Hold off for a couple of days and you may well see the trickles registered. The system is still in the process of catching up with trickles after its problems a week or so ago. It has suddenly started populating my backlog of about 50 trickles today (about another 40 left to catch up).

F.
9) Message boards : Number crunching : Trickles not showing. (Message 32875)
Posted 6 Mar 2008 by old_user428438
Post:
I\'ve run my 80yr coupled model to within 3hrs of completion and, with the end of the week coming, I wondered how things were getting on and whether it is deemed safe now to send the final trickle and result upload zips?

Trickles are still not getting through promptly. I would hang off until the middle of next week and then assess the situation again if I were you.

F.
10) Message boards : Number crunching : Trickles not showing. (Message 32854)
Posted 4 Mar 2008 by old_user428438
Post:
Ok thanks, since I\'m running various projects, suspending the task is the better option here although I did suspend network activity for a couple of days last week when the server probs were at their worst. Indeed I also backed up the model at 98% to protect the final lunge to the finish post.

I can see there is still quite a delay before trickles are showing in the model results. Keep us posted with Milo\'s progress - we\'re all cheering him on.

/pg


I second the \"keep us posted\" sentiment - pretty please?

I\'ve just suspended a model that is a couple of trickles and the zip away from completion.

F.
11) Message boards : Number crunching : Trickles not showing. (Message 32750)
Posted 27 Feb 2008 by old_user428438
Post:
Thanks for the prompt replies, all.

I was not really whinging (well... I suppose I was, really!!). I guess my real beef is that I would have expected some explanation of what had caused the system to go down to be one of the first items to appear on these boards as an announcement once the service returned, rather than as a cryptic response to a post here in NC.

Don\'t get me wrong - I\'m not knocking the valuable service provided by the mods here but, being used the rather more anarchic boards on SETI where any unannounced downtime creates a storm of posts I was surprised at the lack of obvious information.

I have now bookmarked the alternative pages for announcements so should be more relaxed in the event of similar occurences in the future.

F.
12) Message boards : Number crunching : Trickles not showing. (Message 32746)
Posted 26 Feb 2008 by old_user428438
Post:

The trickle server was one of two that had filled up, so it may take a while for trickles to appear on accounts.
As long as the upload message was followed by a \"succeeded\" message, everything will be fine.



Have I missed something? You say \"filled up\" as though it is common knowledge why my browser could not connect.

Could you point to somewhere that explains why the service was down, please?

F.
13) Message boards : Number crunching : hadcm slight deceleration of secs/TS (Message 32737)
Posted 26 Feb 2008 by old_user428438
Post:
my 80yr coupled model is running fine, under a week away from finishing. I\'m just slightly curious at noticing on the full trickles results page that the seconds per timestep (s/TS) is gradually slowing down throughout the model. It started at abt 3.4s/TS then settled around 3.79s/TS for much of the long haul and during the last month it has slipped further to 3.96s/TS.

do models typically slow up slightly or could my pc have built up sludge processes using up crunching resources? I\'m not very tecchie about \'looking under the hood\'... lol

for the power of my pc it also seems a bit slower than others who typically report speeds under 3s/TS.

I\'ve opened the gate and let a new model download (a 160yr one this time) and run in the other core; it\'s kicked out the other projects and started crunching - probably some long term debt on cpdn. This is going a similar speed to the old 80yr model - currently 3.98s/TS.


Typically I find that the s/TS decreases consistently throughout a model run. I am surprised when I see a value that is greater than its predecessor.

F.
14) Message boards : Number crunching : Work done reverted back to Zero (Message 32703)
Posted 22 Feb 2008 by old_user428438
Post:

When I\'m overclocking I use 24 hours of Prime95 (one per core). My Q6600 took a lot to get it completely stable, it had to go down a long way from being \'nearly stable\' to being \'entirely stable\'. The AMDs I\'ve overclocked were much easier, since the grey area between stable and unstable was much narrower.

Fair enough. I guess my machine is now booked for a thorough health check early next week.

F.
15) Message boards : Number crunching : Work done reverted back to Zero (Message 32701)
Posted 22 Feb 2008 by old_user428438
Post:
Fred,

Your Quad had 37 Models. As best I can tell, two short Runs of that lot completed successfully. An 80-year Run, though logged a success, is short some Trickles.

How far overclocked is your machine?

Whether overclocked or not, some stability tests are due. (Hours of Memtest-86+, plus a full day of Prime95 Torture Test (four simultaneous copies).

Your results are not usual for such a machine.

Thanks for the concern. I am hoping, one day, to have more completed models than failed ones but won\'t hold my breath waiting for that - perhaps a dream more than an expectation!

I had not noticed that the 80 year run was short on trickles, just celebrated a \"Success\", but I guess there is nothing I can do about that now.

I can track pretty well all of the failed WU\'s to specific events in my overclocking adventures on my earlier E6400 or my current Q6600 setup. I do run at least 2 complete cycles of Memtest and 8 hours of Prime95 v25.x (i.e. all 4 cores) after any hardware changes and continuously monitor core temps but occasionally things still happen - e.g. temporarily attach a laptop SATA HD to copy some data and the machine won\'t boot, NB VERY hot; eventually have to replace MoBo; or Windoze decides it is corrupted after an AV update triggered reboot; etc. These events, and others similar, have required a re-load of or re-attach to Boinc and have resulted in the error reports or WU\'s that the system thinks I have, but my machine has lost them.

I was getting quite excited, relatively speaking, at the prospect of completing a 160 year model - machine has been totally stable at 3.336GHz with core temps of 51C since Xmas - and having got past half way I was beginning to feel I was on the downhill. Then it errors and restarts from zero!

Still, another 4 days should see me through the HADAM3 model I am currently crunching and I guess I should schedule a day out to re-run Memtest and Prime95 to be on the safe side (and blow out the dust bunnies while I am at it).

F.
16) Message boards : Number crunching : Work done reverted back to Zero (Message 32697)
Posted 21 Feb 2008 by old_user428438
Post:
You wouldn\'t read about it (well you will after I finish writing this).
It has happened again this time during a WU that the stats had already reset once on, now they have reset again.

The percentage done did not change (around 69%) but hours processed and hours to go changed, hours done reset to zero and hours to go reset to a new value of about 100 hours less than before.

It may have something to do with this message :-
(This all appears to have happened as the WU was getting it\'s information ready to send a trickle up message and I recall this is when this problem happened last time as well)


2008-02-20 01:43:05 [climateprediction.net] Task hadcm3inct_cmf7_1920_160_55869263_1 exited with zero status but no \'finished\' file
2008-02-20 01:43:05 [climateprediction.net] If this happens repeatedly you may need to reset the project.
2008-02-20 01:43:05 [climateprediction.net] Restarting task hadcm3inct_cmf7_1920_160_55869263_1 using hadcm3i version 544
2008-02-20 01:43:08 [climateprediction.net] Sending scheduler request: To send trickle-up message
2008-02-20 01:43:08 [climateprediction.net] (not requesting new work or reporting completed tasks)
2008-02-20 01:43:13 [climateprediction.net] Scheduler RPC succeeded [server version 509]

It appears that the WU started again from last checkpoint and in the process Boinc Manager resets the time counters but the progress stays the same.

I did not notice the last time it did this if the WU \'restarted\' or \'resumed\'. If it restarted then that is why the counters reset. If it resumed then it was a Boinc Manager thing?

The WU is still going and should of trickled again since this hiccup but it remains a mystery.

I am unsure if I have changed the Boinc Client version since the last time this happened.
I still think it is a Boinc thing as no other project has had any trouble.


I had exactly the same message a couple of days ago, also on a 160 year model (hadcm3istd_4439_1920_160_15921780_5) but it also set my \"Progress\" back to 0%. It had uploaded 83 trickles, so was just over half way through and I don\'t fancy waiting another 40 days for it to catch up with itself. I am crunching CPDN on only 1 core of my quaddy and the other 3 WU\'s that have been downloaded are relative \"quickies\" so, since I had a wingman with a faster box who was already a few years ahead of me, I have suspended that WU (and set \"No New Tasks\") and will crunch the other 3 tasks by which time he should just about have completed it. If he returns an error, meantime, then I will return to it on completion of any of my \"shorties\".

I made no changes whatsoever to my machine that could have caused this error - its occurrence is a total mystery.

F.
17) Message boards : Number crunching : Hadsm3 - memory required (Message 32674)
Posted 19 Feb 2008 by old_user428438
Post:

Nope, that\'s just a configured entry in a database table. The wording of the message, and when it appears, is the problem...

Gotcha. Thanks for the explanation.

F.
18) Message boards : Number crunching : Hadsm3 - memory required (Message 32665)
Posted 19 Feb 2008 by old_user428438
Post:

The trouble is that this is code which is maintained by Berkeley, it can be modified here but the change would disappear again when the server-side code is updated.

So the Boinc code interrogates the Science App to determine the amount of memory that may be required? That\'s too clever by half...

F.
19) Message boards : Number crunching : Hadsm3 - memory required (Message 32657)
Posted 19 Feb 2008 by old_user428438
Post:
I can remove this message from the scheduler if it would help.
But this would apply in all situations.


Would it not be possible to modify the message to \"Your computer has 1023.53MB of memory. Please ensure your preference settings exlude HADAM3 type jobs which require 1464.84MB\"

F.
20) Message boards : Number crunching : Modelcrash? (Message 32552)
Posted 9 Feb 2008 by old_user428438
Post:
When you restore the contents of the BOINC folder there aren\'t many extra steps compared with backing up.

* In BOINC manager suspend all work in progress. Close X BOINC manager.

* Exit from BOINC manager by right-clicking on the BOINC icon & selecting Exit.

* Go to your BOINC folder, probably C\\Program files\\BOINC.

* Double-click on the BOINC folder to open up its contents.

* Now the apparently scary bit. You have to empty the BOINC folder to make room for what you\'re going to restore. Edit > Select all > Edit > Delete. Everything disappears. What you\'ve deleted will now be in the Recycle bin and in a worst-case scenario (eg the restore of your backup didn\'t work) you could send the BOINC files in the Recycle bin back to where they came from, and they\'d work again. So you never empty the Recycle bin until the restore is up and running.

* Go back one page so you see the BOINC folder in the list again.

* Keep that window open, make it half-size.

* In a new window, go to your backup, wherever you saved it.

* Double-click on the backup to open its contents.

* Edit > Select all > Edit > Copy

* Make this second window half-size.

* Take your mouse cursor over to the first window, right-click on the BOINC folder and in the menu that opens up, select Paste.

* When all the files have finished transferring, close the 2 windows.

* Start > Programs > click on the BOINC shortcut to start BOINC up again. You\'ll need to open the BOINC manager if it doesn\'t open up automatically, then resume tasks.



This assumes you are running only CPDN. It\'s when you are running multiple projects that it gets \"tricky\".

F.


Next 20

©2024 climateprediction.net