climateprediction.net home page
Some PCs Trickling, Some Not

Some PCs Trickling, Some Not

Message boards : Number crunching : Some PCs Trickling, Some Not
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2168
Credit: 64,535,199
RAC: 6,573
Message 41396 - Posted: 31 Dec 2010, 1:44:19 UTC

I'm a bit confused. The PCs that I'm talking about in the title are all running Linux. They are also all running multiple FAMOUS tasks on BOINC 6.10.58.

Some of these PCs are trickling fine, despite the fact that the 10 year zip file uploads are stuck in the Transfers tab. Other PCs are not trickling (and not contacting any cpdn server) at all, despite progressing along on their remaining FAMOUS models.

Any ideas on why certain PCs are not trickling, or even trying to contact the trickle server, while others are?

I probably should know the answer to this, but obviously don't.

Thanks.
ID: 41396 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 41397 - Posted: 31 Dec 2010, 3:43:06 UTC - in response to Message 41396.  

My Linux box is trickling but the zip files are stuck.
Tullio
ID: 41397 · Report as offensive     Reply Quote
old_user294426

Send message
Joined: 20 Feb 06
Posts: 158
Credit: 1,251,176
RAC: 0
Message 41398 - Posted: 31 Dec 2010, 4:19:39 UTC
Last modified: 31 Dec 2010, 4:35:04 UTC

Look at the News post by Mo at top of Number Crunching forum and the Server Status page.

You will see that there are now 3 Upload servers down.
There may be a reason why your tasks are "allocated" to one of these servers and are therefore reacting differently.

Take a look back on News Announcements and you will see details of the new staff that will be taking over in the New Year. Also Milo has said he would do his best to call in over the holiday period to see if he can restart some of the Servers by moving data from those that have filled.

Keith
ID: 41398 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2168
Credit: 64,535,199
RAC: 6,573
Message 41399 - Posted: 31 Dec 2010, 4:49:30 UTC

Believe me, I know the project problems Keith.

I'm thinking something got corrupted with all the communications difficulties. By random chance, this happened on a couple of my PCs, and not others. I was able to get the trickles uploaded (most of them anyway). I accidentally deleted some that had a ".sent" suffix on them, but they actually hadn't been sent, or at least stored on the server. There were hundreds of trickle files in each of those PC's climateprediction.net folder.

I suspended BOINC, exited, then deleted sched_request_climateprediction.net.xml, sched_reply_climateprediction.net.xml, stdoutdae.txt, master_climateprediction.net.xml and job_log_climateprediction.net.xml (I know, more files than I needed to), then restarted BOINC and Resumed. Somehow the trickles went up, but those PCs are still acting differently than the others that are working automatically. They immediately go to a communication backoff of 24 hrs when they can't upload their zip files, unlike the ones that are working correctly.

Oh well, maybe I can nurse these through to the end. I'd hate to lose 12 or more models to these problems.
ID: 41399 · Report as offensive     Reply Quote
3rkko

Send message
Joined: 12 Feb 08
Posts: 66
Credit: 4,877,652
RAC: 0
Message 41401 - Posted: 31 Dec 2010, 8:34:13 UTC

I have the same problem with my Linux box. It has not trickled since 28 Dec, even though Famous models are still running fine. Transfer tab is full of stuck uploads just like on my Windows machine, which is successfully trickling.
ID: 41401 · Report as offensive     Reply Quote
Darmok

Send message
Joined: 29 Dec 09
Posts: 34
Credit: 18,395,130
RAC: 0
Message 41402 - Posted: 31 Dec 2010, 10:15:39 UTC - in response to Message 41401.  
Last modified: 31 Dec 2010, 10:38:29 UTC

On Windows, my last trickle was on 12-28;1pm UTC on famous while the last zip upload occurred on 12-27;2am UTC on a hadcm zip. All are on backoff regardless of the model types. To paraphrase Geophi, I, and surely many others, would also dislike loosing 3000 hours of runtime as hadcm models are close to their end.
ID: 41402 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 108
Credit: 19,169,869
RAC: 30,585
Message 41404 - Posted: 31 Dec 2010, 12:49:44 UTC - in response to Message 41399.  
Last modified: 31 Dec 2010, 12:57:52 UTC

I suspended BOINC, exited, then deleted sched_request_climateprediction.net.xml, sched_reply_climateprediction.net.xml, stdoutdae.txt, master_climateprediction.net.xml and job_log_climateprediction.net.xml (I know, more files than I needed to), then restarted BOINC and Resumed.

sched_request* and sched_reply* is generated new for each scheduler-request, so manually deleting them should only have any effect if they somehow had been write-protected in such a way that BOINC couldn't delete them...

master* is the home-page, and is also re-generated each time the master-url is being tried, so again deleting this has no effect.

stdoutdae.txt is the log that contains all the various info, like communication-errors, and is automatically re-cycled as needed. So, deleting this makes it impossible to look-up any errors that can be relevant to track-down the problem.



As to why many has problems trickling... something that's easy to overlook, if you're going directly to these forums is, THE HOMEPAGE IS DOWN. If BOINC-client has had 10 failed scheduler-request in a row, the homepage (the master-url) is tried re-downloaded, and if this fails, you'll immediately get a 24-hour deferral. So, until the homepage is back up and running again, you can't do any scheduler-requests, this includes uploading trickles.

For anyone that haven't had so many scheduling-errors in a row that they needs to re-check the homepage, they've not aware of any problems with trickles. ;)


Edit - it seems Milo has fixed the problem, so the home-page is finally up and running again. So, if you either manually does a scheduler-request, or just let the upto 24-hour deferral count-down, so everyone should upload their waiting trickles within 24 hours.
ID: 41404 · Report as offensive     Reply Quote
Profile Milo Thurston
Volunteer moderator
Volunteer developer

Send message
Joined: 2 Mar 06
Posts: 253
Credit: 363,646
RAC: 0
Message 41405 - Posted: 31 Dec 2010, 13:21:18 UTC

The home page is indeed up (there was a faulty network switch, which I've replaced), although the phpBB board has developed some sort of bizarre fault and is still down. Other servers are full but I am emptying them as fast as I can. My rough calculations were that they would last until next year but they have filled up more rapidly than planned.
ID: 41405 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2168
Credit: 64,535,199
RAC: 6,573
Message 41407 - Posted: 31 Dec 2010, 14:21:31 UTC - in response to Message 41404.  

sched_request* and sched_reply* is generated new for each scheduler-request, so manually deleting them should only have any effect if they somehow had been write-protected in such a way that BOINC couldn't delete them...

master* is the home-page, and is also re-generated each time the master-url is being tried, so again deleting this has no effect.

stdoutdae.txt is the log that contains all the various info, like communication-errors, and is automatically re-cycled as needed. So, deleting this makes it impossible to look-up any errors that can be relevant to track-down the problem.

Not surprising. I did look back as far as I could in stdoutdae.txt and didn't see anything that indicated what the error was, but I easily could have overlooked it in that long, long, long listing. I was gone for 8 days, and those two PCs stopped trickling in the middle of those 8 days.

As to why many has problems trickling... something that's easy to overlook, if you're going directly to these forums is, THE HOMEPAGE IS DOWN. If BOINC-client has had 10 failed scheduler-request in a row, the homepage (the master-url) is tried re-downloaded, and if this fails, you'll immediately get a 24-hour deferral. So, until the homepage is back up and running again, you can't do any scheduler-requests, this includes uploading trickles.


This, I didn't know and is very useful to know. In all these years, I hadn't had this type of problem before, at least not to the point that it became really evident. Certainly I was aware of the homepage being down, just not the 10 consecutive failed download request limit. Perhaps it's been explained in these forums awhile ago, but I failed to comprehend it because I hadn't been affected by it to any noticeable extent.

Thanks for the explanation Ingleside.
ID: 41407 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 108
Credit: 19,169,869
RAC: 30,585
Message 41414 - Posted: 1 Jan 2011, 16:06:15 UTC - in response to Message 41407.  

This, I didn't know and is very useful to know. In all these years, I hadn't had this type of problem before, at least not to the point that it became really evident. Certainly I was aware of the homepage being down, just not the 10 consecutive failed download request limit. Perhaps it's been explained in these forums awhile ago, but I failed to comprehend it because I hadn't been affected by it to any noticeable extent.

It's possible this has been an issue before, but I don't remember any such outages but granted it can take months between each time look on the "main" CPDN-webpage. The "main" CPDN-webpage being on a separate server is an advantage, since even with the frequent outages on the BOINC-side of things, the "main" webpage is normally still up and there isn't any problems accessing the "master"-url.

ID: 41414 · Report as offensive     Reply Quote

Message boards : Number crunching : Some PCs Trickling, Some Not

©2024 climateprediction.net