Some PCs Trickling, Some Not

Author	Message
geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2168 Credit: 64,535,199 RAC: 6,573	Message 41396 - Posted: 31 Dec 2010, 1:44:19 UTC I'm a bit confused. The PCs that I'm talking about in the title are all running Linux. They are also all running multiple FAMOUS tasks on BOINC 6.10.58. Some of these PCs are trickling fine, despite the fact that the 10 year zip file uploads are stuck in the Transfers tab. Other PCs are not trickling (and not contacting any cpdn server) at all, despite progressing along on their remaining FAMOUS models. Any ideas on why certain PCs are not trickling, or even trying to contact the trickle server, while others are? I probably should know the answer to this, but obviously don't. Thanks. ID: 41396 · Reply Quote

tullio Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0	Message 41397 - Posted: 31 Dec 2010, 3:43:06 UTC - in response to Message 41396. My Linux box is trickling but the zip files are stuck. Tullio ID: 41397 · Reply Quote

old_user294426 Send message Joined: 20 Feb 06 Posts: 158 Credit: 1,251,176 RAC: 0	Message 41398 - Posted: 31 Dec 2010, 4:19:39 UTC Last modified: 31 Dec 2010, 4:35:04 UTC Look at the News post by Mo at top of Number Crunching forum and the Server Status page. You will see that there are now 3 Upload servers down. There may be a reason why your tasks are "allocated" to one of these servers and are therefore reacting differently. Take a look back on News Announcements and you will see details of the new staff that will be taking over in the New Year. Also Milo has said he would do his best to call in over the holiday period to see if he can restart some of the Servers by moving data from those that have filled. Keith ID: 41398 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2168 Credit: 64,535,199 RAC: 6,573	Message 41399 - Posted: 31 Dec 2010, 4:49:30 UTC Believe me, I know the project problems Keith. I'm thinking something got corrupted with all the communications difficulties. By random chance, this happened on a couple of my PCs, and not others. I was able to get the trickles uploaded (most of them anyway). I accidentally deleted some that had a ".sent" suffix on them, but they actually hadn't been sent, or at least stored on the server. There were hundreds of trickle files in each of those PC's climateprediction.net folder. I suspended BOINC, exited, then deleted sched_request_climateprediction.net.xml, sched_reply_climateprediction.net.xml, stdoutdae.txt, master_climateprediction.net.xml and job_log_climateprediction.net.xml (I know, more files than I needed to), then restarted BOINC and Resumed. Somehow the trickles went up, but those PCs are still acting differently than the others that are working automatically. They immediately go to a communication backoff of 24 hrs when they can't upload their zip files, unlike the ones that are working correctly. Oh well, maybe I can nurse these through to the end. I'd hate to lose 12 or more models to these problems. ID: 41399 · Reply Quote

3rkko Send message Joined: 12 Feb 08 Posts: 66 Credit: 4,877,652 RAC: 0	Message 41401 - Posted: 31 Dec 2010, 8:34:13 UTC I have the same problem with my Linux box. It has not trickled since 28 Dec, even though Famous models are still running fine. Transfer tab is full of stuck uploads just like on my Windows machine, which is successfully trickling. ID: 41401 · Reply Quote

Darmok Send message Joined: 29 Dec 09 Posts: 34 Credit: 18,395,130 RAC: 0	Message 41402 - Posted: 31 Dec 2010, 10:15:39 UTC - in response to Message 41401. Last modified: 31 Dec 2010, 10:38:29 UTC On Windows, my last trickle was on 12-28;1pm UTC on famous while the last zip upload occurred on 12-27;2am UTC on a hadcm zip. All are on backoff regardless of the model types. To paraphrase Geophi, I, and surely many others, would also dislike loosing 3000 hours of runtime as hadcm models are close to their end. ID: 41402 · Reply Quote

Ingleside Send message Joined: 5 Aug 04 Posts: 108 Credit: 19,169,869 RAC: 30,585	Message 41404 - Posted: 31 Dec 2010, 12:49:44 UTC - in response to Message 41399. Last modified: 31 Dec 2010, 12:57:52 UTC I suspended BOINC, exited, then deleted sched_request_climateprediction.net.xml, sched_reply_climateprediction.net.xml, stdoutdae.txt, master_climateprediction.net.xml and job_log_climateprediction.net.xml (I know, more files than I needed to), then restarted BOINC and Resumed. sched_request* and sched_reply* is generated new for each scheduler-request, so manually deleting them should only have any effect if they somehow had been write-protected in such a way that BOINC couldn't delete them... master* is the home-page, and is also re-generated each time the master-url is being tried, so again deleting this has no effect. stdoutdae.txt is the log that contains all the various info, like communication-errors, and is automatically re-cycled as needed. So, deleting this makes it impossible to look-up any errors that can be relevant to track-down the problem. As to why many has problems trickling... something that's easy to overlook, if you're going directly to these forums is, THE HOMEPAGE IS DOWN. If BOINC-client has had 10 failed scheduler-request in a row, the homepage (the master-url) is tried re-downloaded, and if this fails, you'll immediately get a 24-hour deferral. So, until the homepage is back up and running again, you can't do any scheduler-requests, this includes uploading trickles. For anyone that haven't had so many scheduling-errors in a row that they needs to re-check the homepage, they've not aware of any problems with trickles. ;) Edit - it seems Milo has fixed the problem, so the home-page is finally up and running again. So, if you either manually does a scheduler-request, or just let the upto 24-hour deferral count-down, so everyone should upload their waiting trickles within 24 hours. ID: 41404 · Reply Quote

Milo Thurston Volunteer moderator Volunteer developer Send message Joined: 2 Mar 06 Posts: 253 Credit: 363,646 RAC: 0	Message 41405 - Posted: 31 Dec 2010, 13:21:18 UTC The home page is indeed up (there was a faulty network switch, which I've replaced), although the phpBB board has developed some sort of bizarre fault and is still down. Other servers are full but I am emptying them as fast as I can. My rough calculations were that they would last until next year but they have filled up more rapidly than planned. ID: 41405 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2168 Credit: 64,535,199 RAC: 6,573	Message 41407 - Posted: 31 Dec 2010, 14:21:31 UTC - in response to Message 41404. sched_request* and sched_reply* is generated new for each scheduler-request, so manually deleting them should only have any effect if they somehow had been write-protected in such a way that BOINC couldn't delete them... master* is the home-page, and is also re-generated each time the master-url is being tried, so again deleting this has no effect. stdoutdae.txt is the log that contains all the various info, like communication-errors, and is automatically re-cycled as needed. So, deleting this makes it impossible to look-up any errors that can be relevant to track-down the problem. Not surprising. I did look back as far as I could in stdoutdae.txt and didn't see anything that indicated what the error was, but I easily could have overlooked it in that long, long, long listing. I was gone for 8 days, and those two PCs stopped trickling in the middle of those 8 days. As to why many has problems trickling... something that's easy to overlook, if you're going directly to these forums is, THE HOMEPAGE IS DOWN. If BOINC-client has had 10 failed scheduler-request in a row, the homepage (the master-url) is tried re-downloaded, and if this fails, you'll immediately get a 24-hour deferral. So, until the homepage is back up and running again, you can't do any scheduler-requests, this includes uploading trickles. This, I didn't know and is very useful to know. In all these years, I hadn't had this type of problem before, at least not to the point that it became really evident. Certainly I was aware of the homepage being down, just not the 10 consecutive failed download request limit. Perhaps it's been explained in these forums awhile ago, but I failed to comprehend it because I hadn't been affected by it to any noticeable extent. Thanks for the explanation Ingleside. ID: 41407 · Reply Quote

Ingleside Send message Joined: 5 Aug 04 Posts: 108 Credit: 19,169,869 RAC: 30,585	Message 41414 - Posted: 1 Jan 2011, 16:06:15 UTC - in response to Message 41407. This, I didn't know and is very useful to know. In all these years, I hadn't had this type of problem before, at least not to the point that it became really evident. Certainly I was aware of the homepage being down, just not the 10 consecutive failed download request limit. Perhaps it's been explained in these forums awhile ago, but I failed to comprehend it because I hadn't been affected by it to any noticeable extent. It's possible this has been an issue before, but I don't remember any such outages but granted it can take months between each time look on the "main" CPDN-webpage. The "main" CPDN-webpage being on a separate server is an advantage, since even with the frequent outages on the BOINC-side of things, the "main" webpage is normally still up and there isn't any problems accessing the "master"-url. ID: 41414 · Reply Quote