The uploads are stuck

Author	Message
nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 67592 - Posted: 12 Jan 2023, 1:15:23 UTC Good job its not the weekend....... At least I got 2 complete w/u uploaded. ID: 67592 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1061 Credit: 16,546,621 RAC: 2,321	Message 67593 - Posted: 12 Jan 2023, 1:36:38 UTC - in response to Message 67591. climateprediction.net 12-01-2023 01:22 [error] Error reported by file upload server: Server is out of disk space Yes, I've seen it and reported it back. I believe that's 27Tb filled then. The uploaded tasks should be moving off to a transfer server, maybe that's not working. I got my 40 tasks all uploaded before this happened. I then started work on 5 new tasks and two or three of those 14 MegaByte files accumulated on my machine, but they went went up in a bunch. Now I have two screens full to send up with a 35 minute backoff. Glad my disk space is very large, and was just cleaned up. ID: 67593 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077	Message 67597 - Posted: 12 Jan 2023, 6:44:28 UTC - in response to Message 67572. Glenn Carver wrote: xii5ku wrote: FWIW, when I came home from work 3 hours ago it worked somewhat. It quickly worsened, and now _all_ transfer attempts fail. I've got the exact same situation now as the one @Stony666 described. The bad: The fact that _nothing_ is moving anymore for myself and evidently for some others doesn't make me optimistic that the backlog (however modest or enormous it might be) would clear anytime soon. And as long as nothing is moving, folks as myself can't resume computation. There may be some boinc-ness things going on. I vaguely remember Richard saying something about uploads will stop processing if it tries & fails 3 times? Or something like that? Uploads are still ok for me, I've got another 1000 to do. It's not for a lack of the client's trying. First, I have still 1 running task per each of my two active computers, which causes a new untried file to be produced every ~7.5 minutes. Second, even if I intervene when the client applies an increasing "project backoff", i.e. make it retry regardless, nothing changes. Earlier yesterday, while I still was able to transfer some (with decreasing success), the unsuccessful transfers gut stuck at random transfer percentages. Later on, still long before the new 'server is out of disk space' issue came up, eventually all of the transfer attempts got stuck at 0 bytes transferred. Unlike some other contributors here, I have a somewhat limited upload bandwidth of about 8 Mbit/s. Maybe this plays a role why I have been less successful than others, maybe not. ID: 67597 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 807 Credit: 13,593,584 RAC: 7,495	Message 67602 - Posted: 12 Jan 2023, 10:16:41 UTC - in response to Message 67597. It's not for a lack of the client's trying. First, I have still 1 running task per each of my two active computers, which causes a new untried file to be produced every ~7.5 minutes. Second, even if I intervene when the client applies an increasing "project backoff", i.e. make it retry regardless, nothing changes. Earlier yesterday, while I still was able to transfer some (with decreasing success), the unsuccessful transfers gut stuck at random transfer percentages. Later on, still long before the new 'server is out of disk space' issue came up, eventually all of the transfer attempts got stuck at 0 bytes transferred. Unlike some other contributors here, I have a somewhat limited upload bandwidth of about 8 Mbit/s. Maybe this plays a role why I have been less successful than others, maybe not. Do have these issues with other projects? And, when CPDN isn't playing catch up with their uploads, do you have the problem then? Just trying to understand if this is a general problem you have or whether it's just related to what's going on with CPDN at the moment. ID: 67602 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 67604 - Posted: 12 Jan 2023, 10:45:57 UTC - in response to Message 67597. Unlike some other contributors here, I have a somewhat limited upload bandwidth of about 8 Mbit/s. Maybe this plays a role why I have been less successful than others, maybe not. *Mbit/s is 80 times faster than what I have. I don't have problems when the servers are working properly. - No issues with recent batches of Hadley models but problems yesterday evening even when uploads were going through due to the congestion. Now we are waiting for data to be moved off the server which should have happened automatically to prevent it filling up. Those above my pay grade (£0/hour) are investigating this. ID: 67604 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 807 Credit: 13,593,584 RAC: 7,495	Message 67605 - Posted: 12 Jan 2023, 11:16:30 UTC Update on the upload server 11:15GMT Had email from CPDN that they are moving data off the upload server, will be sometime before they can enable httpd again. Wasn't given a time estimate but they have to move 25Tb and last downtime it took them best part of a day to move the data from the broken upload server. ID: 67605 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,307,352 RAC: 11,277	Message 67606 - Posted: 12 Jan 2023, 11:21:48 UTC - in response to Message 67605. Surely, for the long-term future (and with a weekend coming up), they have to configure a solution where the forwards pipe (upload server --> backing storage) runs faster than the inwards pipe (users --> upload server)? Even if that involves throttling the inwards pipe ... ID: 67606 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 186 Credit: 27,123,458 RAC: 3,218	Message 67607 - Posted: 12 Jan 2023, 12:06:18 UTC - in response to Message 67605. Thank you for the update Glenn. Update on the upload server 11:15GMT Had email from CPDN that they are moving data off the upload server, will be sometime before they can enable httpd again. Wasn't given a time estimate but they have to move 25Tb and last downtime it took them best part of a day to move the data from the broken upload server. ID: 67607 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 807 Credit: 13,593,584 RAC: 7,495	Message 67608 - Posted: 12 Jan 2023, 12:42:08 UTC - in response to Message 67606. Last modified: 12 Jan 2023, 12:42:26 UTC Surely, for the long-term future (and with a weekend coming up), they have to configure a solution where the forwards pipe (upload server --> backing storage) runs faster than the inwards pipe (users --> upload server)? Even if that involves throttling the inwards pipe ... That is how it works when it's functioning normally, the transfer server runs to keep the upload server under quota. ID: 67608 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,307,352 RAC: 11,277	Message 67610 - Posted: 12 Jan 2023, 13:47:38 UTC - in response to Message 67608. That is how it works when it's functioning normally, the transfer server runs to keep the upload server under quota. Judging by the timestamps in this thread, the upload server was open to users between about 10:30 and 00:30 yesterday - around 14 hours. We don't know how much was transferred to backing store in that time, but the excess of incoming over outgoing was enough to fill intermediate storage. If it's going to take 24 hours to transfer that excess, then the two rates - in practice, before final tuning - are seriously out of alignment. ID: 67610 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 67612 - Posted: 12 Jan 2023, 14:55:21 UTC - in response to Message 67610. I am guessing the transfer server could probably cope with normal operation, just not with the number of backed up computers throwing zips at the upload server. ID: 67612 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 807 Credit: 13,593,584 RAC: 7,495	Message 67614 - Posted: 12 Jan 2023, 16:50:11 UTC - in response to Message 67610. That is how it works when it's functioning normally, the transfer server runs to keep the upload server under quota. Judging by the timestamps in this thread, the upload server was open to users between about 10:30 and 00:30 yesterday - around 14 hours. We don't know how much was transferred to backing store in that time, but the excess of incoming over outgoing was enough to fill intermediate storage. If it's going to take 24 hours to transfer that excess, then the two rates - in practice, before final tuning - are seriously out of alignment. There's been a period of disruption because of the difficulty with filesystems that's lead to Tb of files being where they shouldn't. I don't think there's any serious issues from the bits I know. I'm told that all project (inc: Had* models) data transfers are complete except the OpenIFS ones, that are yet to get to the point where the upload can be opened up again. ID: 67614 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077	Message 67616 - Posted: 12 Jan 2023, 18:21:08 UTC - in response to Message 67602. Last modified: 12 Jan 2023, 18:44:40 UTC Glenn Carver wrote: xii5ku wrote: It's not for a lack of the client's trying. First, I have still 1 running task per each of my two active computers, which causes a new untried file to be produced every ~7.5 minutes. Second, even if I intervene when the client applies an increasing "project backoff", i.e. make it retry regardless, nothing changes. Earlier yesterday, while I still was able to transfer some (with decreasing success), the unsuccessful transfers gut stuck at random transfer percentages. Later on, still long before the new 'server is out of disk space' issue came up, eventually all of the transfer attempts got stuck at 0 bytes transferred. Unlike some other contributors here, I have a somewhat limited upload bandwidth of about 8 Mbit/s. Maybe this plays a role why I have been less successful than others, maybe not. Do have these issues with other projects? And, when CPDN isn't playing catch up with their uploads, do you have the problem then? Just trying to understand if this is a general problem you have or whether it's just related to what's going on with CPDN at the moment. It's specific to CPDN. IIRC I had basically flawless transfers early after the start of the current OpenIFS campaign, but soon this changed to a state in which a certain amount of transfers got stuck at random transfer percentage. I worked around this by increasing max_file_xfers_per_project a little. This period ended at the server's X-Mas outage. — When I came home from work yesterday (evening in the UTC+0100 zone), I found one of the two active computers transferring flawlessly, the other one having frequent but not too many transfers getting stuck at random percentage. This worsened within a matter of hours on both computers, to the point that all transfer attempts got stuck at 0%. (Another few hours later, the server ran out of disk space, which changed the client log messages accordingly.) I just came home from work again and am seeing "connect() failed" messages now. Gotta read up in the message board for the current server status. As for other projects: I don't have projects with large uploads active at the moment. TN-Grid and Asteroids with infrequent small transfers and Universe with frequent smallish transfers are working fine right now on the same two computers, in the same boinc client instances as CPDN. Earlier in December, I took part in the PrimeGrid challenge which had one large result file for each task, and they transferred flawlessly too. So to summarize: It's a problem specifically with CPDN's upload server, and this problem existed with lower severity before the Holidays outage. ________ Edit: So… is the new "connect() failed" failure mode expected due to the current need to move data off of the upload server? Since last night, I have been logging the number of files to transfer on both of my active computers in 30min intervals, and these numbers have been monotonically increasing. (Each computer has got one OIFS task running.) I haven't done the math if the increase of the backlog is exactly the same as the file creation rate of the running tasks, but I guess it is. The client logs of the past 2 or 3 hours don't show a single success. ID: 67616 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,307,352 RAC: 11,277	Message 67618 - Posted: 12 Jan 2023, 19:07:22 UTC - in response to Message 67616. Edit: So… is the new "connect() failed" failure mode expected due to the current need to move data off of the upload server? I'd say that the answer is embedded in message 67605, but you have to decode it. Had email from CPDN that they are moving data off the upload server, will be sometime before they can enable httpd again. Wasn't given a time estimate but they have to move 25Tb and last downtime it took them best part of a day to move the data from the broken upload server. The upload server filled up overnight. CPDN staff are moving files to another place, but it can't handle new files while the old ones are in the way. So staff have disabled our ability to upload files for the time being - they've "disabled httpd". The disabling of httpd has the effect of blocking our attempts to connect to the server when we want to upload files. Hence, the message "connect() failed". It's similar to the message you sometimes see, "Project is down for maintenance" - a planned stoppage, rather than an unplanned one (this time at least). ID: 67618 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 807 Credit: 13,593,584 RAC: 7,495	Message 67619 - Posted: 12 Jan 2023, 21:13:14 UTC - in response to Message 67616. xii5ku wrote: It's specific to CPDN. IIRC I had basically flawless transfers early after the start of the current OpenIFS campaign, but soon this changed to a state in which a certain amount of transfers got stuck at random transfer percentage. I worked around this by increasing max_file_xfers_per_project a little. This period ended at the server's X-Mas outage. — When I came home from work yesterday (evening in the UTC+0100 zone), I found one of the two active computers transferring flawlessly, the other one having frequent but not too many transfers getting stuck at random percentage. Maybe the first is occupying available bandwidth/slot. I see similar with my faster broadband. Getting stuck is perhaps to be expected giving your remote location?? As for other projects: I don't have projects with large uploads active at the moment. TN-Grid and Asteroids with infrequent small transfers and Universe with frequent smallish transfers are working fine right now on the same two computers, in the same boinc client instances as CPDN. Earlier in December, I took part in the PrimeGrid challenge which had one large result file for each task, and they transferred flawlessly too. Depends what you mean by 'large' & small in relation to transfer size. We opted for more 'smaller' files to upload rather than less 'larger' files to upload, precisely for people further afield. Each upload file is ~15Mb for this project. I still think that's the right approach even though I recognise people got a bit alarmed by the number of upload files that was building up. Despite the problem with 'too many uploads' blocking new tasks, I am reluctant to up the maximum upload size limit, esp if you are having problems now. So to summarize: It's a problem specifically with CPDN's upload server, and this problem existed with lower severity before the Holidays outage. I think that's part of your answer there. When there isn't a backlog of transfer clogging the upload server, it can tick along fine. Since last night, I have been logging the number of files to transfer on both of my active computers in 30min intervals, and these numbers have been monotonically increasing. (Each computer has got one OIFS task running.) I haven't done the math if the increase of the backlog is exactly the same as the file creation rate of the running tasks, but I guess it is. The client logs of the past 2 or 3 hours don't show a single success. It probably is as the upload server is not allowing any connections at present. If it really troubles people to see so many uploads building up in number we can modify it for these longer model runs (3 month forecasts). ID: 67619 · Reply Quote

Yeti Send message Joined: 5 Aug 04 Posts: 171 Credit: 10,307,111 RAC: 26,815	Message 67620 - Posted: 12 Jan 2023, 21:44:50 UTC - in response to Message 67619. If it really troubles people to see so many uploads building up in number we can modify it for these longer model runs (3 month forecasts). HM, 121 Trickle Files for a job, that lasts between 12 and 14 hours is way to much from my view. Does each trickle contain really usefull data or only the sign "WU is still alive" ? Supporting BOINC, a great concept ! ID: 67620 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 186 Credit: 27,123,458 RAC: 3,218	Message 67622 - Posted: 12 Jan 2023, 22:32:46 UTC - in response to Message 67619. I'm OK with the current mix of task run time, file size, file numbers and our upload broadband speed, even with the outage, it's manageable for me. If needed, I can increase the VM disc for ubuntu to over 900GB. If it really troubles people to see so many uploads building up in number we can modify it for these longer model runs (3 month forecasts). ID: 67622 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 247 Credit: 12,036,705 RAC: 22,374	Message 67623 - Posted: 12 Jan 2023, 22:36:14 UTC - in response to Message 67606. Surely, for the long-term future (and with a weekend coming up), they have to configure a solution where the forwards pipe (upload server --> backing storage) runs faster than the inwards pipe (users --> upload server)? Even if that involves throttling the inwards pipe ... I'd have also thought that from the time the uploads started flowing again, there would be hawk-eyes on that server until everything is back to normal. It kind of seems like we've been reacting to problems rather than proactively trying to prevent them. ID: 67623 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 87 Credit: 32,981,759 RAC: 14,695	Message 67626 - Posted: 13 Jan 2023, 1:08:38 UTC - in response to Message 67623. I'd have also thought that from the time the uploads started flowing again, there would be hawk-eyes on that server until everything is back to normal. It kind of seems like we've been reacting to problems rather than proactively trying to prevent them. This is the most worrying part honestly. Disk filling up should be easily predictable based on ingress and egress rate. It's quite clear by now no one is monitoring the system, or has alerting, or trying to get ahead of trouble. While I know this is not a high available service, that doesn't mean more care isn't needed when it's having real trouble. If no one is bothering to watch more carefully after almost three weeks of downtime and multiple failed recovery, it start to feel keeping it up is not a priority at all. ID: 67626 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 87 Credit: 32,981,759 RAC: 14,695	Message 67628 - Posted: 13 Jan 2023, 1:36:20 UTC - in response to Message 67620. Does each trickle contain really usefull data or only the sign "WU is still alive" ? I am curious about this too if anyone care to educate a bit. From reading other posts, my wild guess is that all trickles contain real results. It might just be a mean of breaking up the otherwise huge final result and spread the upload across the lifetime of a task. Otherwise, a whole lot of partial 2GB files could be really problematic for the server to hold onto. I could be totally wrong though. ID: 67628 · Reply Quote