climateprediction.net home page
Posts by MikeMarsUK

Posts by MikeMarsUK

81) Message boards : Number crunching : More Work (Message 46921)
Posted 31 Aug 2013 by Profile MikeMarsUK
Post:

Look on the bright side. One day, hundreds of years hence, our distant descendants will get a surprise present from the past - a whole bunch of credits :-)


Just kidding. Sooner or later all the credits will suddenly appear, the admins are still looking at it when they can (between firefighting).
82) Message boards : Number crunching : Both tasks crashed with no heartbeat problem. (Message 46889)
Posted 27 Aug 2013 by Profile MikeMarsUK
Post:
For the past two days I have also had weird and not so wonderful things happening: computer errors, tasks ended without any credit claimed or otherwise, one new task downloaded. We will probably be told to grin and bear it...


* computer errors
We'd have to look through them - some computer errors might be able to be resolved at your end, some are actually model errors. What I tend to do is to have a look at the previous runs of that workunit, if they all crashed in the same way then it is not a problem at your end. If however a different user got further than you did on the same workunit, then it indicates that something happened on your end.

* tasks ended without any credit claimed
This will be fixed sooner or later... the credit processing hasn't been run yet. Once it has, then everyone will suddenly catch up with all their credit (& end up with crazily high RAC)

* one new task downloaded
New work was generated last week, but it was all picked up over the weekend. Currently everyone is hunting for reissued tasks.

* We will probably be told to grin and bear it...
Yeah that's about right :-)



Going back to the original query about 'no heartbeat' - this is a warning which appears when Boinc loses contact with the project application (i.e., the model in this case). If there are 100 consecutive 'no hearbeat' messages, Boinc itself will abort the job - this is very irritating and several of us complained to Berkeley back when the heartbeat feature was first introduced. The heartbeat can be lost for several reasons - the app might be stuck on something, the PC might be busy doing something I/O intensive, or the network stack on your PC might have frozen.

(Here's a ticket from 6 years ago regarding the vulnerability of Boinc to network interruptions).
https://boinc.berkeley.edu/trac/ticket/113#
83) Message boards : Number crunching : Waiting to run (scheduler wait) (Message 46877)
Posted 26 Aug 2013 by Profile MikeMarsUK
Post:


'Waiting to run' usually simply means that the Boinc task is waiting for an available core to run on, while the other cores are busy with other tasks.

Jord says on a different forum:
The "scheduler wait" message in 6.13 is essentially the same as the "waiting for memory" message in 6.12; a GPU does not have enough memory to continue the work.

But in 6.12 it would also come up when other causes left no memory to be used. Essentially, what this message now means is that the application was temporarily exited by BOINC and is waiting to be rescheduled, to run again at a later time when we hope enough memory is available.
...

In that context, he is talking about the GPU, but I presume that the same applies for CPU tasks.




I have something similar, which I also haven't seen before. Four tasks running normally, and one that looks as though it is running, but it is stuck at timestep 670536 and using no CPU time. ...
Update: see also this thread in the Unix section. Same problem.



The 25/50/75% stuff is something else ... these points are where the model does some extra work - firstly validation, to make sure that nothing has gone out of realistic bounds, and secondly when it generates extra output files. Note that one of the things that the project is trying to find out is which parameter sets are viable and which will end up with unrealistic models. So a crash at this point is not necessarily bad (depending on whether the error came from the original input parameters, or the PC).

So if something had gone wrong earlier, this is the point where the task is supposed to crash out. However some people find rather than crashing out, it gets stuck & needs to be aborted.

Also, the model does not like to be interrupted at this point either - on my PC, I have changed the boinc settings so that it does not try to suspend the job when it sees CPU activity, and it stays in memory rather than being migrated to disk.

If you see a lot of problems at these points (rather than just the occasional model), then it is worth running a stability check (& also dialling down overclocking if you are O/Ced).
84) Message boards : Number crunching : Trickle-up message (Message 46856)
Posted 23 Aug 2013 by Profile MikeMarsUK
Post:
Have you rebooted your PC recently? Sometimes the IP addresses are cached locally & so won't be updated until a reboot.
85) Message boards : Number crunching : Trickle-up message (Message 46824)
Posted 21 Aug 2013 by Profile MikeMarsUK
Post:
...

Has anyone been able to change these settings from the account page, or is this function broken.


I just changed it from (blank) to 99% and back to 0 (blank) again, seemed OK.

Have you got multiple preferences set up? i.e., home/school/work, or just the default?

86) Questions and Answers : Unix/Linux : Unable to verify using certificates (Message 46822)
Posted 20 Aug 2013 by Profile MikeMarsUK
Post:

I wonder if the certificate is for the old URL...
87) Questions and Answers : Windows : Project has no tasks available (Message 46819)
Posted 20 Aug 2013 by Profile MikeMarsUK
Post:
I got a bunch of workunits last night (I think they started generating them late yesterday), and they are available in the queue.


If you aren't getting any, make sure that your climateprediction.net preferences (via the 'your account' link in the blue bar) allow HadCM3 jobs to be picked up.

If that is OK, then check your Boinc manager ... there are a number of reasons why it might not want to ask for more work. Make sure that it is allowed to pick up work from the project (first tab). Then if that does not pick up anything, click 'update', and do 'advanced/event log'. The log will show what happens when (if) your manager is asking for work.
88) Message boards : Number crunching : Trickle-up message (Message 46812)
Posted 19 Aug 2013 by Profile MikeMarsUK
Post:
I did try entering zero, it seemed to work OK (i.e., still showed ---/blank afterwards).

At one point I had multiple sets of preferences (default + home) which was confusing. I deleted all but the default preferences.
89) Message boards : Number crunching : Trickle-up message (Message 46809)
Posted 19 Aug 2013 by Profile MikeMarsUK
Post:
Mine shows '--- %' when viewing preferences, and (empty)% when editing preferences, which I presume indicates that it has no limit. I set that up a few months ago. What happens if you clear the field, rather than entering a zero?
90) Message boards : Number crunching : Trickle-up message (Message 46807)
Posted 19 Aug 2013 by Profile MikeMarsUK
Post:
The credit processing is a significant overhead, so perhaps the admins were/are waiting for the backlog of all the uploads etc to go away before running it (after all, there was nearly 2 weeks of work waiting to be uploaded & imported).
91) Message boards : Number crunching : Minus points (Message 46786)
Posted 15 Aug 2013 by Profile MikeMarsUK
Post:

Someone may have changed teams (their points will follow them).
92) Message boards : Number crunching : Trickle-up message (Message 46777)
Posted 15 Aug 2013 by Profile MikeMarsUK
Post:
That can only be fixed on the browser-side, not the server side. The temporary (blank) page is in the browser cache (without an 'expires'-header) & hence the browser will keep hold of it until it gets bored or the user refreshes. Nothing on the server side can prompt the browser to reload the page if the original cached copy did not have an expiry hint.



http://gtmetrix.com/add-expires-headers.html
93) Message boards : Number crunching : Trickle-up message (Message 46775)
Posted 15 Aug 2013 by Profile MikeMarsUK
Post:
...
Yes, I have been able to download one work unit on each of my two computers, but they have been stuck in "Transfer" for more than 12 hours now. "Retry" shows anywhere from 10 minutes to more than 2 hours. Should I go back into patience mode?


My guess is that because everyone has 12 days of uploads to transfer, it has temporarily flooded the bandwidth to the servers. (I have not yet been home to see if my PC also has a backlog of things to upload, but I suspect it does).


PS ... if anyone has bookmarked the old server status page / forums etc, you will need to update your bookmarks because the address climateapps2.oucs.ox.ac.uk has changed to climateapps2.oerc.ox.ac.uk.
94) Message boards : Number crunching : Trickle-up message (Message 46750)
Posted 13 Aug 2013 by Profile MikeMarsUK
Post:

I had two models crash during the downtime (no idea why, it's been rock solid until now). I've been checking every few minutes in hope that the problem is fixed, so my home PC can connect & upload the error logs...
95) Questions and Answers : Getting started : Cannot attach to climateprediction (Message 46746)
Posted 13 Aug 2013 by Profile MikeMarsUK
Post:
Yes, the hope is that the new server will solve all the problems the old server was having (which as you noticed, has been causing a lot of downtime recently).
96) Questions and Answers : Windows : Server can't find key file (Message 46745)
Posted 13 Aug 2013 by Profile MikeMarsUK
Post:
Yes, they're still working on it.
97) Questions and Answers : Getting started : Cannot attach to climateprediction (Message 46741)
Posted 13 Aug 2013 by Profile MikeMarsUK
Post:
Today I've tested it again. Adding CP using boinc-client (linux) and boincmgr.
And it worked!
Unfortunately I got the message: "Server can't find key file".

What's wrong?


The original server had to be replaced in a big rush because it was falling to pieces (multiple failures on the RAID array, probably due to the controller), the new server is up but not quite yet working correctly. The 'server can't find key file' basically says that the server configuration is still wrong.
98) Message boards : Number crunching : Nice To See CPDN Is Back Online - More Or Less (Message 46648)
Posted 19 Jul 2013 by Profile MikeMarsUK
Post:

It is probably OK. From what I can see the trickle job has not run since the downtime, once it does (probably tonight) then you should suddenly see them all.
99) Questions and Answers : Windows : Intel Visual Fortan run-time error (Message 46647)
Posted 19 Jul 2013 by Profile MikeMarsUK
Post:
... On Windows it may be a different matter. It's possible they may sit there pretending to run but not clocking up any progress ...


This will be interesting ... I downloaded a bunch yesterday after the servers came back, and I am away from home for 9 days. Unfortunate timing.
100) Message boards : Number crunching : Servers up, trickles up, downloads going (Message 46635)
Posted 18 Jul 2013 by Profile MikeMarsUK
Post:
Yeah, it wasn't a planned downtime. As Mo said in the other post (linked above), there was a problem with a RAID array. Jonathan said that it had to be rebuilt twice (once to replace the failed drive, and then again to proactively replace a second drive which was suspect).




Previous 20 · Next 20

©2024 climateprediction.net