The uploads are stuck

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,528,667 RAC: 5,893	Message 67980 - Posted: 23 Jan 2023, 7:40:52 UTC Hi, We have had to pause uploads as we have filled the upload server, filled the JASMIN group workspace that we transfer everything to and the tape drives that we use to remove from the GWS are offline and have been for 4 days. We have the possibility of 37TB of space available once the tape system is back. Jamies batches look like they'll be >50TB if they all come in. Sarah is sowing to talk to him about thinning them from 1k per year to 500 per year. Kind Regards David And I have also requested another 25TB emergency TB from JASMIN in our GWS. Kind Regards David ID: 67980 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,269,184 RAC: 11,001	Message 67981 - Posted: 23 Jan 2023, 9:22:16 UTC I started building a reserve of tasks to run when the uploads began to stack up on Saturday evening. But I stopped again when I saw the first of those emails last night. It feels that we need to wait while they fix the tape drives; wait while they transfer data from workspace to tape; wait while they transfer data from the upload server to the workspace; wait while our machines upload their data to the upload server. I'm not going to add any more, until they're ready for it. ID: 67981 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,528,667 RAC: 5,893	Message 67982 - Posted: 23 Jan 2023, 10:36:27 UTC - in response to Message 67981. I started building a reserve of tasks to run when the uploads began to stack up on Saturday evening. But I stopped again when I saw the first of those emails last night. It feels that we need to wait while they fix the tape drives; wait while they transfer data from workspace to tape; wait while they transfer data from the upload server to the workspace; wait while our machines upload their data to the upload server. I'm not going to add any more, until they're ready for it. I have enough to keep going given that I am just running two tasks at a time because uploading them even on a good day with a following wind takes so long. And I am turning the machine off at night while nothing is moving as well. ID: 67982 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,269,184 RAC: 11,001	Message 67983 - Posted: 23 Jan 2023, 11:29:37 UTC - in response to Message 67982. I'm still running five tasks at a time on both machines, because my faster upload line isn't stressed at that level. The machine I upgraded last week has about a week's worth of tasks, and that can continue indefinitely when we get the nod from the project that the cavalry are riding over the hill. But my second machine is quota-limited to one task per day, because of the tasks I lost before Christmas and which time-expired yesterday. And I can't buy my way out of jail until I can report fully-uploaded tasks ... That machine has enough work to last until Thursday, when I was planning to repeat the upgrade process (now I know how to do it!). We'll see how we're placed after that. ID: 67983 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 185 Credit: 27,123,458 RAC: 3,218	Message 67984 - Posted: 23 Jan 2023, 14:24:57 UTC - in response to Message 67982. The ubuntu VM here has nine tasks waiting to upload their 122 transfers and a task queue of about 12 days. At the present task-rate, the ubuntu VM disc is estimated to fill up in four to five days. ID: 67984 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 806 Credit: 13,593,584 RAC: 7,495	Message 67985 - Posted: 23 Jan 2023, 14:46:03 UTC CPDN update. Due a failure of the tape archive at the JASMIN site, CPDN are not able to offload any more results to the archive and both the upload & transfer server disks (~50Tb) are now full. The batch server has been paused so no further workunits will go out and the upload server will stay disabled for probably a few days at the very least until capacity is restored. ID: 67985 · Reply Quote

gemini8 Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,182,959 RAC: 836	Message 67986 - Posted: 23 Jan 2023, 14:55:50 UTC Well, Boinc can be some kind of stress-test for all sorts of computer-related functions, be it cooling, space in memory, space on disks, or network throughput beside others. ;-) - - - - - - - - - - Greetings, Jens ID: 67986 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,528,667 RAC: 5,893	Message 67989 - Posted: 23 Jan 2023, 15:18:03 UTC - in response to Message 67985. CPDN update. Due a failure of the tape archive at the JASMIN site, CPDN are not able to offload any more results to the archive and both the upload & transfer server disks (~50Tb) are now full. The batch server has been paused so no further workunits will go out and the upload server will stay disabled for probably a few days at the very least until capacity is restored. I will turn off network activity and if still no joy when queue finished suspend project to allow work elsewhere though my main alternative ARP on WCG is often lacking in work at the moment. ID: 67989 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,269,184 RAC: 11,001	Message 67991 - Posted: 23 Jan 2023, 15:34:24 UTC - in response to Message 67989. I will turn off network activity and if still no joy when queue finished suspend project to allow work elsewhere though my main alternative ARP on WCG is often lacking in work at the moment. That makes sense - no point in trying to upload each file as it's created, when we know this will be a multi-day outage. I'll finish my current hoard in peace, and then switch the GPUs back on for other projects. That will need network activity, of course, but it won't be being re-tried every couple of minutes. ID: 67991 · Reply Quote

wateroakley Send message Joined: 6 Aug 04 Posts: 185 Credit: 27,123,458 RAC: 3,218	Message 68006 - Posted: 23 Jan 2023, 18:42:41 UTC - in response to Message 67985. Thank you for the update Glenn. I have suspended network activity until cpdn storage capacity becomes available. ID: 68006 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,536,681 RAC: 2,009	Message 68011 - Posted: 23 Jan 2023, 21:15:18 UTC Last modified: 23 Jan 2023, 21:51:53 UTC There is a software update to my machine that involves rebooting, so now (when three CPDN tasks finish later today) seems like a good time to install it. This morning, I set all my projects to NoNewTasks. I then suspended all CPDN tasks on my machine that had not started yet. 15 tasks are waiting to upload, not counting the three that are about 70% done. I also did that to Einstein and WCG. I also did that to all but a few hours worth of MilkyWay and Universe. MW tasks take a little less than an hour to run, and Universe tasks take a little more than an hour to run. (Might as well keep those cores busy.) ID: 68011 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 247 Credit: 11,971,267 RAC: 23,423	Message 68012 - Posted: 24 Jan 2023, 8:56:06 UTC JASMIN, these people are like bad referees or umpires. They should be in the background doing their work, we shouldn't even know they exist. Like sports officials, if they do their job well, they're just part of the background and never mentioned on any sports talks. On the contrary, if they don't - they're the talk of the show instead of the players and game play. Tape drive system down for 4 days?! To repeat myself from a previous post, this OIFS thing is no joke with its demanding requirements for both users and the project. That's part of its appeal. I'm looking forward to the really demanding, high resolution ones, but hopefully not on VBox. With all these issues and the time it'll take to fully resolve and make the necessary infrastructure changes - Glenn should have plenty of time to get native Linux versions working first. If Hadley models weren't also being uploaded to upload11, I'd say release any of them that are ready but alas. I guess it's back to LHC full time, I have under 700k points to go to get the highest badge in ATLAS. I should be there by the time the OIFS tasks I got over the last couple of days are due. ID: 68012 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 806 Credit: 13,593,584 RAC: 7,495	Message 68013 - Posted: 24 Jan 2023, 10:18:21 UTC - in response to Message 68012. Last modified: 24 Jan 2023, 10:19:42 UTC It's unfair to say JASMIN should be 'doing their work'. They are, but they have support contracts in place for equipment they are not maintaining themselves and have to wait like we do. When there are very large systems to deal with failures happen every day; server/compute blades, disk arrays, tape drives (usually mechanical failures). If the site is running 24/7 operations and can afford it they will have vendor engineers on-site but sometimes they have to wait for a specialist engineer or a replacement part to arrive. I've worked in HPCF environments for most of my career and it's nothing new. I could be somewhat cheeky and say; if you are complaining about the current batch, why are you looking forward to an OpenIFS configuration that needs >16Gb RAM, 2-4x I/O & upload filesize... ;) (I am only joking, good to know people are interested). I'm actually impressed that the model, designed to run on supercomputers with 100,000s of cores, ran as well as it did across a wide range of hardware and OSes. Yes, there were problems but most were because of the boinc side and not the model failing. JASMIN, these people are like bad referees or umpires. They should be in the background doing their work, we shouldn't even know they exist. Like sports officials, if they do their job well, they're just part of the background and never mentioned on any sports talks. On the contrary, if they don't - they're the talk of the show instead of the players and game play. Tape drive system down for 4 days?! To repeat myself from a previous post, this OIFS thing is no joke with its demanding requirements for both users and the project. That's part of its appeal. I'm looking forward to the really demanding, high resolution ones, but hopefully not on VBox. With all these issues and the time it'll take to fully resolve and make the necessary infrastructure changes - Glenn should have plenty of time to get native Linux versions working first. If Hadley models weren't also being uploaded to upload11, I'd say release any of them that are ready but alas. I guess it's back to LHC full time, I have under 700k points to go to get the highest badge in ATLAS. I should be there by the time the OIFS tasks I got over the last couple of days are due. ID: 68013 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,528,667 RAC: 5,893	Message 68014 - Posted: 24 Jan 2023, 10:52:12 UTC They should be in the background doing their work, we shouldn't even know they exist. In an ideal world that might be true and when things are working it mostly is. However, the project is I suspect not financed well enough to have the level of redundancy to achieve that. ID: 68014 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 247 Credit: 11,971,267 RAC: 23,423	Message 68015 - Posted: 24 Jan 2023, 12:22:41 UTC - in response to Message 68013. It's unfair to say JASMIN should be 'doing their work'. They are, but they have support contracts in place for equipment they are not maintaining themselves and have to wait like we do. Hold on, so there's a layer beyond JASMIN?! They don't maintain their own equipment?! That kind of explains some things a bit but the arrangement seems puzzling. When there are very large systems to deal with failures happen every day ... I understand that things break down and can't always be fixed in a day or so but this has been quite a lengthy affair with no clear end in sight. That's the part that's confusing and frustrating. It'll be so long by the time this is fully resolved, that the name someone came up for it, The Great Holiday Outage, will not longer be appropriate. I've been through a coupe of lengthy problem times at MilkyWay. I believe they only had a PhD student to keep the server going and things functioning properly. The solution was for the most part to just crunch to get through the excessive backlogs and queues. Here there's nothing we can do and on top of that we're the storage of dozens or even hundreds of GB of data per user. It's an uneasy feeling knowing you have so much of someone's stuff and you can't get rid of it and hopefully nothing goes wrong with it. I could be somewhat cheeky and say; if you are complaining about the current batch, why are you looking forward to an OpenIFS configuration that needs >16Gb RAM, 2-4x I/O & upload filesize... ;) (I am only joking, good to know people are interested). I guess I'm a "complaining" resident, it'll take a lot more for me to leave. "Complaining" because there's nothing I can do but voice things. My complaints are not at all about OIFS and how it performs, I don't believe I have a single such post. I had a suspicion that things won't be as trouble free as they originally were seemingly presented. Things almost always take longer and have more problems than people plan for. My complaints are about JASMIN and seemingly another layer. I'm also still somewhat suspicious that CPDN didn't properly calculate and prepare for the amount and speed of data that was to come from a full run of OIFS. I actually trust you more with OIFS than CPDN with Hadleys based on the high failure rate of Hadleys and seemingly no attempts at reducing them. I also have higher hopes for the more demanding models because I believe that after the current problems no way CPDN wouldn't make sure to be prepared. I better not be wrong otherwise that would be embarrassing for the project. ID: 68015 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,269,184 RAC: 11,001	Message 68016 - Posted: 24 Jan 2023, 13:12:03 UTC I guess in that context I'd place myself as a "critical resident". I mean 'criticism' in the sense of analysis of problems encountered, and participation in the formulation of plans to avoid them in the future. I hope that comes across as a positive contribution. A number of things have happened in the last month. The first is that we placed, with hindsight, an enormous stress test on the CPDN and JASMIN infrastructure. And we did so, by accident of timing, just at the start of the extended holiday break when both groups had limited support cover available. Shit happens - it was ready when it was ready. The lessons learned from that one are probably (a) pay more attention to the calendar when launching major projects, and (b) be poised to monitor the infrastructure readouts after a launch, and be ready to hit the emergency pause button if things get out of balance. The second outage had a very different genesis - the failure of the tape archiving system. We don't know (yet?) exactly what went wrong, and what degree of repair is needed - it may, or may not, require the intervention of external bodies like the suppliers of spare parts. But the time delay seems strange: I first heard about it late on Sunday evening, by which time the tape archiving system had already been offline for four days (according to that initial report). Was any information passed back from JASMIN to CPDN during working hours on Thursday or Friday, and did it reach the right people? If not, that would be another point for a lesson learned. It all raises questions about the state of the UK's scientific and economic infrastructure, but I'd better stop there. ID: 68016 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,536,681 RAC: 2,009	Message 68017 - Posted: 24 Jan 2023, 15:36:02 UTC - in response to Message 68011. There is a software update to my machine that involves rebooting, so now (when three CPDN tasks finish later today) seems like a good time to install it. Well, that took about 15 minutes or less. Just the latest version of Firefox web browser, it turns out. I now have WCG, Einstein, MilkyWay, and Universe running. Problems with Rostta for a long time. And one, instead of my usual five CPDN tasks running, and 18 waiting to upload their results. I share the disappointment of others. I have found in life that blame does not solve problems, so I will not try to assign any. ID: 68017 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 87 Credit: 32,981,759 RAC: 14,695	Message 68019 - Posted: 24 Jan 2023, 19:15:35 UTC - in response to Message 68014. In an ideal world that might be true and when things are working it mostly is. However, the project is I suspect not financed well enough to have the level of redundancy to achieve that. Pretty much summed it up. 24/7 support and multiple levels of redundancy to move failure out of critical path cost a lot of money. At least when I was in grad school, I've never heard of research projects run systems that way. Their funding is better off putting into actual research, while tolerating some downtime or even data loss when the unfortunate happens. On the other hand, for the past month or so, the bottleneck for OpenIFS might actually be the server side infrastructure, not volunteers' compute power. If such failure keeps happening, it could justify changing where money is spent to address the real bottleneck that's slowing down the progress. Or next iteration of the app can consider tuning the compute/network ratio if possible. Guess all we can do is to wait and it would probably take a while for these improvements to materialize even if the team is working on them already. ID: 68019 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 257 Credit: 31,900,540 RAC: 37,895	Message 68021 - Posted: 24 Jan 2023, 20:40:21 UTC Last modified: 24 Jan 2023, 20:41:08 UTC I guess I mostly would expect better reliability from a cloud provider than from a random 4U server stuck in a rack somewhere, and... I don't see it. NewEgg can get you most parts you need (at least in the US) more or less overnight, and certainly within a week. A cloud provider without something resembling 24/7 support is quite absurd, and it sounds to me, from what I'm hearing, like they mostly don't know what they're doing with regards to the data drives, OS drives, etc. Bits and pieces speak of that with regards to the server failures. What I hear doesn't speak to CPDN failing to spec stuff properly, it's the cloud provider failing to provide what they said they would. Then tape failures, OK, but... USB3 external drives are a thing. A 10G network link to the server and you can start shoveling off data at a couple hundred meg a second, and 8-12TB externals aren't that expensive. Yes, it's dealing with a lot of data, but neither are most of the actual compute volunteers set up to deal with tens or 100s of GB of data just laying around on machines that typically empty off as soon as the task is done. A stack of USB external hard drives in a software RAID mirror and a backup upload server would solve a lot of problems and let the stuff be resolved without blocking work. I'd been looking at upgrading RAM in some of my compute nodes to handle the bigger stuff, but I honestly don't see the point until I can actually get things offloaded reliably. ID: 68021 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 87 Credit: 32,981,759 RAC: 14,695	Message 68024 - Posted: 24 Jan 2023, 23:27:42 UTC - in response to Message 68021. USB3 external drives are a thing. You probably should just give up this idea no matter how frustrated you are. The server is clearly hosted in real data centers or hosting facility and none of them AFAIK will ever take non-rackable hardware. If they allow any random customer to add funny hardware they have to help maintain later, the whole operation will fall apart very fast. It's not that hard to procure proper server gear and deploy quickly. It's likely the process or funding stopping the team from doing that. Trying to prevent people from adding random USB drives might be one of the good reasons why such process exists. :-) ID: 68024 · Reply Quote