climateprediction.net home page
Posts by EclipseHA

Posts by EclipseHA

1) Message boards : Number crunching : Why is Climate Prediction grabbing all the CPU time (Message 29633)
Posted 20 Jul 2007 by EclipseHA
Post:
Good idea. The current BS can\'t take into account that cpdn is the only project that accepts results past their deadline. Nor would we want the BS to take decisions about how far past the deadline is acceptable to Oxford.

If Berkeley can\'t offer a choice of BSs for a while, could the cpdn deadline be lengthened? That would surely make climate models demand a bit less time when they\'re first started by multi-project crunchers?



Thank you mo.v and Carl. Extending the deadline (as it\'s not really used anyway) could really help this problem.

As far as the CC being a bit smarter, it wouldn\'t have to be CPDN specific. It could, for example be based simply on the size of the WU. If the WU will take more than 30(?) days from beginning to end, round robin it, otherwise use the normal scheduling.

I know, like many here, I started CPDN as a \"fallback\" project - when there were only 4-5 projects, there was time work was scarce. It had a low resource share, and would crunch full bore when the other projects had issues. But it still would get a slice at other times. Now I have a share of 25% to 60% for CPDN, with CPDN getting a much bigger chunk than some projects, but I\'m not willing to have CPDN only machines.
2) Message boards : Number crunching : Why is Climate Prediction grabbing all the CPU time (Message 29614)
Posted 19 Jul 2007 by EclipseHA
Post:
Hey folks, I\'m giving you real world data, for a LINUX box..

No where in this thread has \"Linux\" even been acknowledged. We also got mods that have an RAC=0 on the production code, and while they may be running the beta, the term \"linux\" again is missing.

Here\'s the full story:

Been running CPDN for years on Linux and windows, and all has has been fine.

Been updating the Boinc CC regularly, and all has been fine.

Been mixing CPDN, Einstien, etc, with CPDN at about 25%, and all has been fine.

I had a CPDN WU finish a couple months back on the LINUX box <-note \"LINUX\".

The new WU on LINUX started using all the available CP about a week after it started. The only way new work from other projects would be downloaded was to suspend CPDN. I tried all that was suggested (resetting stuff in client_state.xml), and end result was that CPDN would dominate the machine within a few days.

Ok, so what changed in my config? The HW? nope... The OS? nope... The Boinc CC? nope... A new CPDN WU? Yes!!!!!!

The LINUX box is running FC3, and is my webserver. It\'s been constant for a long, long time, including the level of hits I get.

Seems with the new CPDN WU, something has changed. Could it be the deadline estimate has \"shrunk\" since the last CPDN WU I\'d processed on this LINUX box? (early 2006 was when I got it, IIRC), or could it be a change in the LINUX application? Or the way the \"benchmark code\" plays with the CPDN app...

Anyway, it\'s not BS that this is what I\'ve seen.

What I\'m sensing from you all is that you really don\'t know what\'s happening, and I think LINUX is a missing variable... Any mods running current production code on LINUX?

As I\'ve said, I\'ll provide you any additional information you may need, but the \"CPDN Centric\" behavior started with the last WU I got on this LINUX box - and that\'s not BS....

I\'ve killed the CPDN WU on the LINUX box, and plan on crunching to completion the CPDN WU\'s I have on windows (where everything is fine).
3) Message boards : Number crunching : Why is Climate Prediction grabbing all the CPU time (Message 29579)
Posted 17 Jul 2007 by EclipseHA
Post:
You folks at CPDN are lost...

The Mods don\'t even run the \"production code\" and provide answers based on theory, but not on how this project ran since the beginning...... And a machine at 1.8 ghz is now slow????????????

I\'ve been running CPDN with the same share (Linux FC3) for years, and even with updating the CC from BOINC, task switching was happening on a \"normal\" basis - that is until I had a WU complete (It did take about a year).

The machine is on 24/7, and its been 25% CPDN and 50% Einstien for a long time (and 750% LHC but there\'s never work there, so it\'s a non-issue....)

Reality is that all worked fine until I got a new CPDN WU a couple months back. Now, that\'s all that crunches! Seems to me it is a CPDN problem, in that the date for \"completion\" is now set in such a way that for a machine that\'s a year or two old (on Linux at least), the only way for BOINC to schedule CPDN is to give it all resources for weeks or months!

You guys seem to assume that everyone is running the latest HW, but you\'re wrong - The mods are ignoring the problem - that is clear.

As I said, I will be more than willing to provide any information you need to debug the problem you\'re having with CPDN NOW consuming all the BOINC time, where it hasn\'t done so for most of the time I\'ve been crunching CPDN.

Pointing me at the main BOINC site doesn\'t help, as your admins need to work with the BOINC developers to solve YOUR problem.

Not a problem for me as your attitude lead me to just abort the WU that\'s sucking all the time!

Hey guys, I\'ve got almost 300K credits with CPDN, and been contributing for a long time.. This isn\'t a \"newbie\" reporting a problem, but someone that has used \"production code\" for a long time....
4) Message boards : Number crunching : Why is Climate Prediction grabbing all the CPU time (Message 29555)
Posted 16 Jul 2007 by EclipseHA
Post:
On my Linux box, CP has been the only running project for weeks now.

I\'ve zapped the long term/short term debts, reset the \"correction factor\" but still only CP!

(the resource share for CP is much lower than what I use for other projects, as CP doesn\'t have a fixed deadline....)

This is crazy!

Something has been lost in the BOINC scheduler, in that it used to have a specific \"case\" for CPDN, where deadlines didn\'t throw it into panic mode for CPDN. It could be the CPDN cruncher changed something too, as I noticed the \"CPDN\" centric behavior not long after getting a new WU.

For me, this is ONLY happening on my Linux box.. The windoz boxes are load leveling between projects just fine.

Les... Please in the future try running the project (RAC=0 I note) before you offer stock answers. Seems others are seeing this \"CPDN\" centric crunching, and all we get is \"that\'s the way BOINC works\". I\'d kind of expect a forum moderator to be running the software. Be a help and not a hinerence to others looking for a fix.

Let me know if anyone connected with CPDN wants additional information from me to help resolve this problem.
5) Message boards : Number crunching : Why is Climate Prediction grabbing all the CPU time (Message 29232)
Posted 13 Jun 2007 by EclipseHA
Post:
I believe that the next major release of BOINC will show debt on the project tab.


If it doesn\'t it should. It might help people understand on how it allocates time.

Not in the 5.10.x versions, maybe in the 6.x.x ones. The first of the RPCs needed was added recently but the manager does not use it yet.


Boincview had displayed debt for something like a year. The API has been around for a long time, but maybe not in a way that boincmgr could use it!
6) Message boards : Number crunching : Why is Climate Prediction grabbing all the CPU time (Message 29231)
Posted 13 Jun 2007 by EclipseHA
Post:
Just rethink how it all works.
Which is that BOINC will run climate exclusively for some time, perhaps as long as a week. (It depends on your computer speed, the percentage of the day the computer is on, etc.)
Then it will realise that there isn\'t a problem after all, and go back to sharing the cpu among the projects.




Prior to 5.8.x, the CC knew that CPDN wasn\'t ever a problem. Like I said in another post, this was also when there was a \"new and improved\" scheduler in the CC.

Les, I do notice that your RAC is zero, and that kind of indicates that you\'re not even crunching CPDN right now!
7) Message boards : Number crunching : Why is Climate Prediction grabbing all the CPU time (Message 29230)
Posted 13 Jun 2007 by EclipseHA
Post:
This does seem to be a \"twist\" starting with Boinc CC 5.8.x.

BTW folks, Boinc Manager is only a tool to display what\'s happening. The Core Client is what\'s doing the work.

The reason I say this, is that I don\'t run Boinc Manager - I run the CC with BoincView. BV has displayed short/long term debt for a long time!

For some time, the CC \"knew\" about CPDN, and ignored the \"due date\" for the WU. That way the CC didn\'t devote all it\'s time to CPDN if the was a chance to miss the \"unused\" deadline.

Since moving to the 5.8.16 CC, on a slower box of mine, I do notice the CC really favoring CPDN due to debt (which it shouldn\'t, as the CPDN is kind of meaningless, and logic in the code knew this). I\'ll see nothing but CPDN for days on that box, where project switching used to work quite well.

I suspect that somewhere between 5.4.x and 5.8.x this \"knowledge\" of CPDN was removed in the CC - someone probably wanted to clean up project specific conditions in the standard code (not a bad thing by itself, but not understanding why the code was there resulted in a problem with the local scheduler). Wasn\'t there a new scheduler introduced in 5.8.x of the CC?

Seems there\'s two things that could be done to make multi-project scheduling work better:

1) put back in the CPDN specific code in the CC (not the cleanest solution, but read on!)

2) Increase the deadlines for CPDN WU\'s (double them to allow for slower machines - BTW, by \"slower\", I\'m seeing the problem on a 1.8ghz machine)

The thing is, #2 won\'t take effect until a new CPDN WU is downloaded, which could be a few months. While #1 isn\'t the clean solution, it\'s something that can be done in the middle of a WU, with an update to the CC, and could be done in a much shorter time.

I use CPDN as a \"backup project\", in that if LHC, for example, has work, I want to crunch LHC as much as possible (CPDN have a 25% share, LHC has a 750% share on my machine). With the current scheduler scheme in the CC, even if LHC had work (any day now they say!), my machine wouldn\'t even request work from LHC, but would keep on doing my 10 month long CPDN WU!

BTW, boincview is a pretty good \"viewer\" for what\'s happening with the CC, and can deal with multiple clients at once. I\'ve tried Boinc Manager a few time, and it\'s really limited (things like debts, for example)

See http://boincview.amanheis.de/
8) Message boards : Number crunching : Error on File Upload (Message 28249)
Posted 29 Apr 2007 by EclipseHA
Post:

There\'s already a couple of automatic backup scripts.



Sorry, but doing a regular backup on my local system\'s CPDN data in the event there is a problem with the CPDN servers makes little sense to me. (backwards infact!) Do I also do backups on the other five projects I run in the event their servers fail? What about the folks that are cruncing on 5, 10, 25, or 50 systems?

I\'m looking forward to the edit required in the xml files, as I\'ve been using editors on many different OS\'s for 30 years.

\"I am not a number (or a geek) but a free man!\" :) (as big baloons chace me down a beach)
9) Message boards : Number crunching : Error on File Upload (Message 28232)
Posted 28 Apr 2007 by EclipseHA
Post:
azwoody
And I\'d be interested in knowing how you would restart a model from the very beginning, if this is what you think I meant. And why everyone would assume that\'s what I meant.



Some may see this as \"de-attach\" followed by \"attach\" (i.e. \"retarting\" the project)

You and the other moderators seem as confused as the rest of us on what this will really do..

I could \"restart\" a model\" by hacking the XML!
10) Message boards : Number crunching : Error on File Upload (Message 28231)
Posted 28 Apr 2007 by EclipseHA
Post:
Les, isn\'t there one way to recreate a 10-year zip file in a worst-case scenario. Even if the model hasn\'t crashed, restore a backup made before the file creation point (Dec of year ending ...0).

For example, I have a model crunching 2007. I must be certain to back up the complete contents of the boinc folder before the model reaches Autumn 2010.

Backup and restore instructions available through my sig. Les\'s method there, item #1 in the README about avoiding crashes, is really easy to follow.



Guys... No backups for 99.9999999999999% of the folks here.

They assume that with BOINC, a project will recover from it\'s own problems, and not require crunchers to do a darn thing.

So what can I hack in the XML to help the project with their problem?
11) Message boards : Number crunching : Error on File Upload (Message 28229)
Posted 28 Apr 2007 by EclipseHA
Post:
Hi Azwoody

Trickles are getting through to cpdn in Oxford. The problem is only with the 10-year zip uploads. These appear in the Transfers tab of boinc manager at the beginning of the December of every model year ending in 0, eg 1980, 2030. During all the other model years, as long as there\'s no cpdn zip file waiting in the Transfers tab, we can allow network activity.

But as our models reach a year ending in 0, Les suggests for multi-project crunchers

For those APPROACHING a 10 year point in their crunching, I\'d recommend that:
1) Make sure that the project is set to \"No new tasks\" in the Projects tab
2) Suspend the model in the Tasks tab
3) Wait until the problem is resolved
4) Then restart the model




Seems you also don\'t understand that \"restart vs resume\" is kind of bogus. I do understand, but it\'s bad info to others, IMHO!



Doing things this way allows other projects to crunch and contact their server. It avoids the zip file problem by stopping the climate model before this file is produced.

However, single-project cpdn crunchers can if they wish allow the 10-year file to be produced but avoid problems by suspending network activity before the file tries to upload ie before Dec of any year ending in 0.

Anyone (single or multi-project) with a zip file already waiting in the Transfers tab to upload (whether it\'s already produced upload error messages or not) should suspend network activity until the problem is solved. But the model may continue running. This should keep most computers busy until the extra space becomes available, ?next Thursday? Workunits from other projects could be suspended to keep the computer busy with the climate model.


But most wont check for a problem at CPDN until there is a zip file waiting to upload! You say I can\'t do any work on the machine, for any project that requires network access, until CPDN gets things fixed? (\"suspend network activity\") That\'s nuts, and not the way BOINC was designed! Seems it\'s a server problem! I got a dual core machine, with zip files waiting to send, and only a cache of other projects for .75 days!

What we must try to avoid is multiple attempts to upload the same zip file. If we can avoid ALL attempts to upload them, that\'s better still. Every failed upload puts the zip file at risk.

Why? - do you expect folks to keep an eye on bonic 24/7?? Things should work unattended 24/7/365.25. If the zip get\'s lost, it\'s a server problem! Seems CPDN needs more help than just new HW!


The 10-year zip files can lie in the Transfers tab for up to two weeks after the first attempted upload and still be accepted by the server.


So, how do \"geeks\" extend this????


If no attempt is made to upload the zip files, I think they will be accepted up to 6 weeks later.

If we can avoid editing the xml files, this is preferable.



Come on, get real... Most folks (like me) won\'t know there is a problem UNTIL there\'s a zip file stuck in the transfers tab!

There\'s bogus code on the server - or within the CPDN app, in dealing with a problem like this...

What needs to be hacked in the xml, so the WU I got on one box that has been crunching for over a year, and with days to complete wont get trashed?

I think we need to hear from someone that really understands the code, and not a moderator, to chime in on this one...

That way I can \"resume\" the discussion and not \"restart\" it!
12) Message boards : Number crunching : Error on File Upload (Message 28225)
Posted 28 Apr 2007 by EclipseHA
Post:

And for those who are \"a bit geeky\", it\'s apparently possible to edit an xml file and alter some slot files, to extend the 14 day time limit.



Ok Les, if there\'s a real problem if the 10 year zip gets lost (and that\'s still a big question to me!), why not post the info on how the xml might be modified to extend the limit? (what needs to be changed and where...)

\"suspend everything CPDN and network access, and wait\" seems to be what you\'re saying, but that means no other projects can \"phone home\" to request work or report results...

As you are a Moderator here, I\'d really hoped your posts were informational.....
13) Message boards : Number crunching : Error on File Upload (Message 28223)
Posted 28 Apr 2007 by EclipseHA
Post:
There are 2 issues:
2) People who have been doing nothing for several days before posting here are in danger of losing their 10 year zip before the new server is in place.
This has already happened to at least one person.


But again, what is the result? Will the Wu crash and burn? Will the next 10 year zip fill in the missing data?

As you said, most folks don\'t check here, so will there be major confustion when all kinds of WU\'s \"crap out\"? I\'m still not clear why it\'s not just recomended to \"sit back and don\'t worry\"? Isn\'t that the way Boinc was designed to work when a project has problems or is down?


Restart / Resume / Get-it-going-again, or whatever wording is used in whichever version of BOINC people are using.
The BOINC people keep changing the wording, and I\'ve given up trying to keep up with it.


There\'s a big difference - in common terms of everyday people. \"restart\" is \"start again, from the beginning\". \"Resume\" is \"pick up from where you left off.. \"Pause/resume\" is a function that most know from their DVD/VCR/TIVO... \"restart\" means watching the recording from the beginning.

14) Message boards : Number crunching : Error on File Upload (Message 28221)
Posted 28 Apr 2007 by EclipseHA
Post:
If people already have the failing upload problem, then it\'s too late to do much about it.

For those APPROACHING a 10 year point in their crunching, I\'d recommend that:
....
4) Then restart the model.



I sure hope you meant to say \"resume\" and not \"restart\"!
15) Message boards : Number crunching : Error on File Upload (Message 28220)
Posted 28 Apr 2007 by EclipseHA
Post:
What exactly is the long time issue if a user does nothing? I can see this could be a pain for dialup users, but I see no impact for non-dialup.

Seems there might be log meesages that the upload failed, but won\'t things recover cleanly when the servers are upgraded? (might be multiple uploads, I understand that).

If stuff is still crunching correctly, why do anything?

16) Message boards : Number crunching : Hot to make climae run on my dual core mobile? (Message 27727)
Posted 5 Apr 2007 by EclipseHA
Post:
Also, be careful of Vista...

If you reboot, vista may trash the running WU (vista shuts down too fast and stuff isn\'t saved correctly). You may have hit your daily limit and not know it!

See this thread on the boinc forums....

http://boinc.berkeley.edu/dev/forum_thread.php?id=1689

BTW, I\'m running with 60000 as the value discussed.....
17) Message boards : Number crunching : Looks like something went wrong at CP (Message 15925)
Posted 12 Sep 2005 by EclipseHA
Post:
OK.. Here\'s a specific. When I go to \"your account\", trickles no longer appear at the bottom, or infact, anyplace. I\'m running NS 7.1 under W2k.
18) Message boards : Number crunching : Looks like something went wrong at CP (Message 15878)
Posted 11 Sep 2005 by EclipseHA
Post:
As of tonight (US MST) things are not working when reviewing \"my account\".

For \"results\" all trickles seem to be lost.

My guess is that they are upgrading the server software. Hope this doesn\'t last too long!
19) Message boards : Number crunching : Unable to upload? (Message 15558)
Posted 31 Aug 2005 by EclipseHA
Post:
&gt; I've just posted <a> href="http://www.climateprediction.net/board/viewtopic.php?p=29490#29490"&gt;advice
&gt; on the phpBB forum</a> on how to ensure you don't lose a result because it
&gt; hasn't been able to upload for 2 weeks.
&gt; <br><a href="http://www.teampicard.net/"><img> src="http://www.teampicard.net/images/picardmini.gif"&gt;</a><a> href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3"&gt;Join
&gt; us here</a>
&gt;

Any ETA for when things might be back to normal? Just finished a WU today. Sulphur downloaded fine and is running just fine. I'd not paid much attention, but it seems the upload problem has been going on for a couple weeks. Now I have a set of results that cant be sent back. (I'm in the US and it seems that I'm trying to hit a server in the UK)

CPDN really needs to get everyone posting in one forum (IMHO). I don't care if it's the boinc stuff, or phpbb, but it seems info seems to not make it to both forums, unless someone posts a cross link...
20) Message boards : Number crunching : Possible hadsm4.13 question. (Message 15509)
Posted 29 Aug 2005 by EclipseHA
Post:
&gt; I've had a lot of 4.13 WUs crashing one after the other at less than 1%
&gt; crunched. Using BOINC 4.45. So I suspended the project for about 3 weeks. I
&gt; re-activated it today and got 8/28/2005 3:31:19
&gt; PM|climateprediction.net|Finished download of
&gt; sulphur_se_4.19_windows_intelx86.zip
&gt; I hope this won't have as many problems as 4.13! :)
&gt; <br>***********************************<br>
&gt; Win2KPro, P4 1.8GHz, 512Mb RAM. Running Folding<br>
&gt; WinXP Home, P4 3.2GHz HT, 512Mb RAM. Running SETI, CPDN, Predictor, LHC,
&gt; Einstein, Orbit and Folding
&gt;

"sulphur" is a new cruncher (new science). From what I can see it takes about 2x the disk space and runs even longer! I hope to download my first Linux/sulphur in the next day or so (only a few trickles left on the current one), but won't be asking for a win2k version for another 1-2 weeks (still in phase 2).

If you still have problems, something you may want to check is is your available diskspace, as well as your available "virtual" memory. It could be your VM is getting tight with 7 projects, especialy if you have "keep app in memory" selected for Boinc.

If you do get errors, please post them here. It might not be that other crunchers will tell you what's wrong, but if we see the same errors, we can send a "me too" to the project folks, and that might help to narrow it down.


Next 20

©2024 climateprediction.net