climateprediction.net home page
Posts by MikeMarsUK

Posts by MikeMarsUK

41) Message boards : Number crunching : Trickle echo (Message 47102)
Posted 18 Sep 2013 by Profile MikeMarsUK
Post:

The moderators have found some other examples also, all from the 17th onwards. It's been reported to the administrators.

42) Message boards : Number crunching : Task won't finish? (Message 47099)
Posted 18 Sep 2013 by Profile MikeMarsUK
Post:

The model has to do work at the end (packaging up the files for upload etc), so it is normal for there to be a period when it is still running at 100%. However if it is still running for more a couple of hours after this point, then it may have got stuck, and you'll have to abort it. This does sometimes happen (although it is rare).

43) Questions and Answers : Wish list : Single status page (Message 47097)
Posted 18 Sep 2013 by Profile MikeMarsUK
Post:
...
BTW Mike, The News, Announcements and README posts on your signature are 404.


Ah, OK, I didn't even realise that I was still posting with a signature! (I had hide-signatures turned on). They were pointing to the old phpBB forum, so I have adjusted them to point to the appropriate places on the Boinc forum.

I have passed on the suggestion.

Cheers.
44) Message boards : Number crunching : Trickle echo (Message 47093)
Posted 18 Sep 2013 by Profile MikeMarsUK
Post:
Strange. I am wondering if it is somehow related to the trickle->credit process (there is a new version of the credit script).
45) Message boards : Number crunching : still don't get credits since last breakdown (Message 47090)
Posted 18 Sep 2013 by Profile MikeMarsUK
Post:
Its looking promising ... the trickles are now being converted to credits from what I can see. It will still take a day or two for the external credit statistics sites to pick up the new numbers.



46) Message boards : Number crunching : News and Announcements (Message 47088)
Posted 18 Sep 2013 by Profile MikeMarsUK
Post:

The credit system is running now- models are being marked with credit based on trickles, and all work since the old server crashed looks like it has been credited. The export process will send this to external statistics sites within the next day or so.


  • There is an anomaly with Beta-project credits, which is being looked at (EDIT - appears to be resolved now).
  • Duplicated trickles are appearing for a few models. This is being looked at.

47) Questions and Answers : Wish list : Single status page (Message 47087)
Posted 18 Sep 2013 by Profile MikeMarsUK
Post:

There is the news & announcements thread:

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=5447

If you bookmark it, it's then easier to find.

48) Questions and Answers : Wish list : Longer Deadlines - That's What I'd Like To See (Message 47080)
Posted 17 Sep 2013 by Profile MikeMarsUK
Post:

BOINC is a work in progress obviously.


There are a lot of things about Boinc that I am personally unhappy with. It was designed for running short, disposable jobs which can be validated by bytewise comparison of result files, and does not cope well with jobs which can last for weeks or even months of CPU time.

However, it's simple, widespread, and easy to set up, and thats why CPDN uses it. The bottom line is that CPDN benefits overall using Boinc, despite the various issues.
49) Questions and Answers : Windows : Optimise PC build for CPDN (Message 47077)
Posted 17 Sep 2013 by Profile MikeMarsUK
Post:
I do have a UPS also - an APC SmartUPS-2200 which can run my PC for 20-30 minutes. Second-hand, and very cheap from ebay because it was so heavy (had to collect it). But if you get an ebay one you will need to replace the batteries.

At the time I was getting 10 powercuts / month. I'm not currently running it because the power supply has been much improved and I no longer get powercuts.
50) Message boards : Number crunching : Best practices running CPDN as a BOINC project (Message 47076)
Posted 17 Sep 2013 by Profile MikeMarsUK
Post:
The main problem is that admins have been unable to restart the process to calculate the credit based on the trickles since the big server crash. However, once it runs, everyone will get all outstanding credit. It sounds like there is progress.

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=5447&nowrap=true#46955

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7649


For example, despite the lack of external reporting, are incremental "trickle-up" messages of any value, and are completed tasks correctly reported at this time?


Yes


Do repeated trickle-up messages impair the recovery process?


No


Would an option to disable trickle-up messages facilitate the recovery process?


No


-- Edit: Dave is quicker than me :-)
51) Questions and Answers : Wish list : Longer Deadlines - That's What I'd Like To See (Message 47064)
Posted 16 Sep 2013 by Profile MikeMarsUK
Post:

A lot of it depends on how many Boinc projects you are running simultaneously. The more there are on a single machine, the more likely that work units are going to go into high priority mode.

52) Questions and Answers : Wish list : Longer Deadlines - That's What I'd Like To See (Message 47062)
Posted 16 Sep 2013 by Profile MikeMarsUK
Post:
What I meant was this:

It'll grab more than its fair share until the workunit has finished, but then other projects will get the priority for a long while until it has evened out again.


Boinc has a complicated 'debt' system which means that CPDN will 'owe' the other projects a lot of CPU time. Until that debt has been paid back, Boinc should prevent CPDN from downloading new units.

But it has been years since I looked at this 'debt' functionality last. It may have changed.
53) Questions and Answers : Wish list : Longer Deadlines - That's What I'd Like To See (Message 47060)
Posted 16 Sep 2013 by Profile MikeMarsUK
Post:
... However, contrary to an earlier statement, ...


Which statement are you referring to? If it's mine, 'short term' = duration of the WU, 'long term' is the year or so afterwards that it will take the processing-debt to sort itself out.

54) Message boards : Number crunching : still don't get credits since last breakdown (Message 47056)
Posted 16 Sep 2013 by Profile MikeMarsUK
Post:
... p.s.: If my information is still valid, credits calculation in this project is a full run over all models and all trickles ever crunched anyway, so a normal standard call will collect everything from the beginning of (CPDN) time up to now. This is why (unlike in other projects) team member movements always moves all member credits to the new team btw.


Yes, it still works like that, once the correct part of the credit generation process is fixed, then everyone will get all outstanding credit.

One of the moderators (mo.v) met up with Andy & some of the senior researchers on Friday to discuss several issues including credits, and it looks like there is progress. Among other things we are hoping that a science update can be published showing what the current projects are.

55) Questions and Answers : Wish list : Longer Deadlines - That's What I'd Like To See (Message 47051)
Posted 15 Sep 2013 by Profile MikeMarsUK
Post:

Keep in mind that:
a) the deadlines aren't enforced - the models will still be accepted regardless.
b) in the long run, CPDN will still end up with your preferred resource-share even if it has a high-priority task. It'll grab more than its fair share in the short term, but then other projects will get the priority for a long while until it has evened out again.


56) Message boards : Number crunching : failed upload: can't resolve hostname (Message 47041)
Posted 13 Sep 2013 by Profile MikeMarsUK
Post:


Well, what I was worried about was if the server was checking the incoming host name somehow (some web servers do this).


57) Message boards : Number crunching : still don't get credits since last breakdown (Message 47039)
Posted 13 Sep 2013 by Profile MikeMarsUK
Post:
... The project admins/developers have been aware of this for several weeks.

So far, the answer is that they simply don't know the cause of the problem.

Frankly this extended unresolved issue suggests to me that the people who need to research and resolve the issue either are not available to do this, or have other higher priority tasks which they are focused on. ...


Richard's post in another thread reminded me that one of the things that the admins are working on is replacing the CPDN back end with an up-to-date version of the Boinc server (the installed version is both ancient & highly customised, therefore it is a big job).

Among other things, it would require that the CPDN credit system be either rewritten or migrated.
58) Message boards : Number crunching : Why do I keep getting a 'Computation Error'? (Message 47034)
Posted 13 Sep 2013 by Profile MikeMarsUK
Post:
The computer with the issue is the one with ID: 1047569 running Windows 7.

The other computer is aborted often when it d/l data, and says it won't complete until after the deadline, even if it's run 24/7 (that computer is the family computer). It is never ran 24/7 since three other people use it. I should just remove Climate from that one. Thanks for the reply.

The link to the computer is:
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1047569

The link to the most recent task is:
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15802084



What I am seeing from this crash is: exit code 193, and signal 11. The other crashes on the same PC seem similar. The following thread is from someone who was having the same combination of error codes & seems to have fixed them now. The two things that he did was to exclude the temporary directories from the A/V scan, and he replaced an elderly disk drive.
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7602&nowrap=true#46964


Also, I can see that your models in 2012 were running OK, but it started crashing in 2013. Can you think of anything which changed then?



(unknown error) - exit code 193 (0xc1)
...
andled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x77E843D0 read attempt to address 0x40CDE394

Engaging BOINC Windows Runtime Debugger...



Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x77E83AB3 read attempt to address 0x40CDE390

Engaging BOINC Windows Runtime Debugger...

Cannot serialize file C:\ProgramData\BOINC/projects/climateprediction.net/hadcm3n_o6zx_2020_40_008373608/dataout/shmem_restart.day
Signal 11 received, exiting...
Called boinc_finish



Could I suggest the following as a starting point:

* Change your settings to 'Leave tasks in memory when suspended' = Y, 'suspend if CPU usage is above %' to 0%, 'Use at most ... % of CPU' to 100.00. This will prevent the model being swapped out of memory.

* Make sure you shut down Boinc first prior to shutting down windows (right-click on the Boinc icon, snooze, wait for a few seconds, then right-click and exit). Similarly if you are about to do something CPU intensive, such as gaming, put it into snooze mode.

* Make sure that the Boinc data directories, and also temporary directories (c:\temp, c:\windows\temp, c:\users\your-user-id\appdata\local\temp or whatever) are excluded from any antivirus scans




Feel free to ask for help if you have trouble with any of these. Once you've made the change, monitor it for a while to see if it has helped or not. Either way, we would appreciate knowing the outcome.
59) Message boards : Number crunching : failed upload: can't resolve hostname (Message 47026)
Posted 12 Sep 2013 by Profile MikeMarsUK
Post:



Well, lets see if the address is accessible for you. Try visiting both of these in turn from your PC ... in theory you should get the same (minimal) web page on both.

http://apid-wattch.badc.rl.ac.uk/
http://rapid-watch.badc.rl.ac.uk/

If I look at them, only the second will work (since I have not touched 'hosts' on my PC).
60) Message boards : Number crunching : failed upload: can't resolve hostname (Message 47021)
Posted 12 Sep 2013 by Profile MikeMarsUK
Post:


Thanks, than it is clear why the upload does not work. Can I fix it on my end somehow?



Yes ... you can map apid-wattch.badc.rl.ac.uk onto the IP address 130.246.191.84 (= rapid-watch.badc.rl.ac.uk)



First, I will try to explain how this works: Addresses on the internet are actually all numeric, even although we see textual names. When your computer wants to know what apid-wattch.badc.rl.ac.uk means, it will ask a 'DNS' (domain name server) to translate it into a numeric internet protocol address (IP address). In this case, the server will reply that the address is unknown.

However, if it had been supplied with rapid-watch.badc.rl.ac.uk instead, the DNS server would have replied with the magic number 130.246.191.84.

Prior to asking the DNS server, the computer actually first checks a local list of hostnames & their IP addresses. We can add apid-wattch to this list on your computer.


On Windows, this is done by finding the file 'hosts', and editing it. On my PC, it is in the location C:\WINDOWS\system32\drivers\etc

Add the follwing line to the end of the file:

130.246.191.84 apid-wattch.badc.rl.ac.uk # redirecting apid-wattch to rapid-watch for CPDN


(The bit after the # is just a comment).

Note that this 'hosts' file is a system file, and it may be hidden (depending on the options in your windows explorer). Therefore a firewall / antivirus may try to prevent you from changing it.


However... I think you have something like 2 weeks before the upload fails. So you can simply sit back & hopefully the project might make this same change at Rutherford.


Previous 20 · Next 20

©2024 climateprediction.net