Message boards :
Number crunching :
The uploads are stuck
Message board moderation
Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 25 · Next
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4472 Credit: 18,448,326 RAC: 22,385 |
Hi, And I have also requested another 25TB emergency TB from JASMIN in our GWS. |
Send message Joined: 1 Jan 07 Posts: 1032 Credit: 36,231,429 RAC: 15,603 |
I started building a reserve of tasks to run when the uploads began to stack up on Saturday evening. But I stopped again when I saw the first of those emails last night. It feels that we need to wait while they fix the tape drives; wait while they transfer data from workspace to tape; wait while they transfer data from the upload server to the workspace; wait while our machines upload their data to the upload server. I'm not going to add any more, until they're ready for it. |
Send message Joined: 15 May 09 Posts: 4472 Credit: 18,448,326 RAC: 22,385 |
I started building a reserve of tasks to run when the uploads began to stack up on Saturday evening.I have enough to keep going given that I am just running two tasks at a time because uploading them even on a good day with a following wind takes so long. And I am turning the machine off at night while nothing is moving as well. |
Send message Joined: 1 Jan 07 Posts: 1032 Credit: 36,231,429 RAC: 15,603 |
I'm still running five tasks at a time on both machines, because my faster upload line isn't stressed at that level. The machine I upgraded last week has about a week's worth of tasks, and that can continue indefinitely when we get the nod from the project that the cavalry are riding over the hill. But my second machine is quota-limited to one task per day, because of the tasks I lost before Christmas and which time-expired yesterday. And I can't buy my way out of jail until I can report fully-uploaded tasks ... That machine has enough work to last until Thursday, when I was planning to repeat the upgrade process (now I know how to do it!). We'll see how we're placed after that. |
Send message Joined: 6 Aug 04 Posts: 194 Credit: 27,820,073 RAC: 7,344 |
The ubuntu VM here has nine tasks waiting to upload their 122 transfers and a task queue of about 12 days. At the present task-rate, the ubuntu VM disc is estimated to fill up in four to five days. |
Send message Joined: 29 Oct 17 Posts: 975 Credit: 15,712,094 RAC: 15,796 |
CPDN update. Due a failure of the tape archive at the JASMIN site, CPDN are not able to offload any more results to the archive and both the upload & transfer server disks (~50Tb) are now full. The batch server has been paused so no further workunits will go out and the upload server will stay disabled for probably a few days at the very least until capacity is restored. |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,373,395 RAC: 4,037 |
Well, Boinc can be some kind of stress-test for all sorts of computer-related functions, be it cooling, space in memory, space on disks, or network throughput beside others. ;-) - - - - - - - - - - Greetings, Jens |
Send message Joined: 15 May 09 Posts: 4472 Credit: 18,448,326 RAC: 22,385 |
CPDN update.I will turn off network activity and if still no joy when queue finished suspend project to allow work elsewhere though my main alternative ARP on WCG is often lacking in work at the moment. |
Send message Joined: 1 Jan 07 Posts: 1032 Credit: 36,231,429 RAC: 15,603 |
I will turn off network activity and if still no joy when queue finished suspend project to allow work elsewhere though my main alternative ARP on WCG is often lacking in work at the moment.That makes sense - no point in trying to upload each file as it's created, when we know this will be a multi-day outage. I'll finish my current hoard in peace, and then switch the GPUs back on for other projects. That will need network activity, of course, but it won't be being re-tried every couple of minutes. |
Send message Joined: 6 Aug 04 Posts: 194 Credit: 27,820,073 RAC: 7,344 |
Thank you for the update Glenn. I have suspended network activity until cpdn storage capacity becomes available. |
Send message Joined: 5 Aug 04 Posts: 1109 Credit: 17,121,631 RAC: 5,430 |
There is a software update to my machine that involves rebooting, so now (when three CPDN tasks finish later today) seems like a good time to install it. This morning, I set all my projects to NoNewTasks. I then suspended all CPDN tasks on my machine that had not started yet. 15 tasks are waiting to upload, not counting the three that are about 70% done. I also did that to Einstein and WCG. I also did that to all but a few hours worth of MilkyWay and Universe. MW tasks take a little less than an hour to run, and Universe tasks take a little more than an hour to run. (Might as well keep those cores busy.) |
Send message Joined: 12 Apr 21 Posts: 293 Credit: 14,174,140 RAC: 21,622 |
JASMIN, these people are like bad referees or umpires. They should be in the background doing their work, we shouldn't even know they exist. Like sports officials, if they do their job well, they're just part of the background and never mentioned on any sports talks. On the contrary, if they don't - they're the talk of the show instead of the players and game play. Tape drive system down for 4 days?! To repeat myself from a previous post, this OIFS thing is no joke with its demanding requirements for both users and the project. That's part of its appeal. I'm looking forward to the really demanding, high resolution ones, but hopefully not on VBox. With all these issues and the time it'll take to fully resolve and make the necessary infrastructure changes - Glenn should have plenty of time to get native Linux versions working first. If Hadley models weren't also being uploaded to upload11, I'd say release any of them that are ready but alas. I guess it's back to LHC full time, I have under 700k points to go to get the highest badge in ATLAS. I should be there by the time the OIFS tasks I got over the last couple of days are due. |
Send message Joined: 29 Oct 17 Posts: 975 Credit: 15,712,094 RAC: 15,796 |
It's unfair to say JASMIN should be 'doing their work'. They are, but they have support contracts in place for equipment they are not maintaining themselves and have to wait like we do. When there are very large systems to deal with failures happen every day; server/compute blades, disk arrays, tape drives (usually mechanical failures). If the site is running 24/7 operations and can afford it they will have vendor engineers on-site but sometimes they have to wait for a specialist engineer or a replacement part to arrive. I've worked in HPCF environments for most of my career and it's nothing new. I could be somewhat cheeky and say; if you are complaining about the current batch, why are you looking forward to an OpenIFS configuration that needs >16Gb RAM, 2-4x I/O & upload filesize... ;) (I am only joking, good to know people are interested). I'm actually impressed that the model, designed to run on supercomputers with 100,000s of cores, ran as well as it did across a wide range of hardware and OSes. Yes, there were problems but most were because of the boinc side and not the model failing. JASMIN, these people are like bad referees or umpires. They should be in the background doing their work, we shouldn't even know they exist. Like sports officials, if they do their job well, they're just part of the background and never mentioned on any sports talks. On the contrary, if they don't - they're the talk of the show instead of the players and game play. |
Send message Joined: 15 May 09 Posts: 4472 Credit: 18,448,326 RAC: 22,385 |
They should be in the background doing their work, we shouldn't even know they exist. In an ideal world that might be true and when things are working it mostly is. However, the project is I suspect not financed well enough to have the level of redundancy to achieve that. |
Send message Joined: 12 Apr 21 Posts: 293 Credit: 14,174,140 RAC: 21,622 |
It's unfair to say JASMIN should be 'doing their work'. They are, but they have support contracts in place for equipment they are not maintaining themselves and have to wait like we do. Hold on, so there's a layer beyond JASMIN?! They don't maintain their own equipment?! That kind of explains some things a bit but the arrangement seems puzzling. When there are very large systems to deal with failures happen every day ... I understand that things break down and can't always be fixed in a day or so but this has been quite a lengthy affair with no clear end in sight. That's the part that's confusing and frustrating. It'll be so long by the time this is fully resolved, that the name someone came up for it, The Great Holiday Outage, will not longer be appropriate. I've been through a coupe of lengthy problem times at MilkyWay. I believe they only had a PhD student to keep the server going and things functioning properly. The solution was for the most part to just crunch to get through the excessive backlogs and queues. Here there's nothing we can do and on top of that we're the storage of dozens or even hundreds of GB of data per user. It's an uneasy feeling knowing you have so much of someone's stuff and you can't get rid of it and hopefully nothing goes wrong with it. I could be somewhat cheeky and say; if you are complaining about the current batch, why are you looking forward to an OpenIFS configuration that needs >16Gb RAM, 2-4x I/O & upload filesize... ;) (I am only joking, good to know people are interested). I guess I'm a "complaining" resident, it'll take a lot more for me to leave. "Complaining" because there's nothing I can do but voice things. My complaints are not at all about OIFS and how it performs, I don't believe I have a single such post. I had a suspicion that things won't be as trouble free as they originally were seemingly presented. Things almost always take longer and have more problems than people plan for. My complaints are about JASMIN and seemingly another layer. I'm also still somewhat suspicious that CPDN didn't properly calculate and prepare for the amount and speed of data that was to come from a full run of OIFS. I actually trust you more with OIFS than CPDN with Hadleys based on the high failure rate of Hadleys and seemingly no attempts at reducing them. I also have higher hopes for the more demanding models because I believe that after the current problems no way CPDN wouldn't make sure to be prepared. I better not be wrong otherwise that would be embarrassing for the project. |
Send message Joined: 1 Jan 07 Posts: 1032 Credit: 36,231,429 RAC: 15,603 |
I guess in that context I'd place myself as a "critical resident". I mean 'criticism' in the sense of analysis of problems encountered, and participation in the formulation of plans to avoid them in the future. I hope that comes across as a positive contribution. A number of things have happened in the last month. The first is that we placed, with hindsight, an enormous stress test on the CPDN and JASMIN infrastructure. And we did so, by accident of timing, just at the start of the extended holiday break when both groups had limited support cover available. Shit happens - it was ready when it was ready. The lessons learned from that one are probably (a) pay more attention to the calendar when launching major projects, and (b) be poised to monitor the infrastructure readouts after a launch, and be ready to hit the emergency pause button if things get out of balance. The second outage had a very different genesis - the failure of the tape archiving system. We don't know (yet?) exactly what went wrong, and what degree of repair is needed - it may, or may not, require the intervention of external bodies like the suppliers of spare parts. But the time delay seems strange: I first heard about it late on Sunday evening, by which time the tape archiving system had already been offline for four days (according to that initial report). Was any information passed back from JASMIN to CPDN during working hours on Thursday or Friday, and did it reach the right people? If not, that would be another point for a lesson learned. It all raises questions about the state of the UK's scientific and economic infrastructure, but I'd better stop there. |
Send message Joined: 5 Aug 04 Posts: 1109 Credit: 17,121,631 RAC: 5,430 |
There is a software update to my machine that involves rebooting, so now (when three CPDN tasks finish later today) seems like a good time to install it. Well, that took about 15 minutes or less. Just the latest version of Firefox web browser, it turns out. I now have WCG, Einstein, MilkyWay, and Universe running. Problems with Rostta for a long time. And one, instead of my usual five CPDN tasks running, and 18 waiting to upload their results. I share the disappointment of others. I have found in life that blame does not solve problems, so I will not try to assign any. |
Send message Joined: 14 Sep 08 Posts: 110 Credit: 38,736,362 RAC: 57,563 |
In an ideal world that might be true and when things are working it mostly is. However, the project is I suspect not financed well enough to have the level of redundancy to achieve that. Pretty much summed it up. 24/7 support and multiple levels of redundancy to move failure out of critical path cost a lot of money. At least when I was in grad school, I've never heard of research projects run systems that way. Their funding is better off putting into actual research, while tolerating some downtime or even data loss when the unfortunate happens. On the other hand, for the past month or so, the bottleneck for OpenIFS might actually be the server side infrastructure, not volunteers' compute power. If such failure keeps happening, it could justify changing where money is spent to address the real bottleneck that's slowing down the progress. Or next iteration of the app can consider tuning the compute/network ratio if possible. Guess all we can do is to wait and it would probably take a while for these improvements to materialize even if the team is working on them already. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,376,460 RAC: 20,753 |
I guess I mostly would expect better reliability from a cloud provider than from a random 4U server stuck in a rack somewhere, and... I don't see it. NewEgg can get you most parts you need (at least in the US) more or less overnight, and certainly within a week. A cloud provider without something resembling 24/7 support is quite absurd, and it sounds to me, from what I'm hearing, like they mostly don't know what they're doing with regards to the data drives, OS drives, etc. Bits and pieces speak of that with regards to the server failures. What I hear doesn't speak to CPDN failing to spec stuff properly, it's the cloud provider failing to provide what they said they would. Then tape failures, OK, but... USB3 external drives are a thing. A 10G network link to the server and you can start shoveling off data at a couple hundred meg a second, and 8-12TB externals aren't that expensive. Yes, it's dealing with a lot of data, but neither are most of the actual compute volunteers set up to deal with tens or 100s of GB of data just laying around on machines that typically empty off as soon as the task is done. A stack of USB external hard drives in a software RAID mirror and a backup upload server would solve a lot of problems and let the stuff be resolved without blocking work. I'd been looking at upgrading RAM in some of my compute nodes to handle the bigger stuff, but I honestly don't see the point until I can actually get things offloaded reliably. |
Send message Joined: 14 Sep 08 Posts: 110 Credit: 38,736,362 RAC: 57,563 |
USB3 external drives are a thing. You probably should just give up this idea no matter how frustrated you are. The server is clearly hosted in real data centers or hosting facility and none of them AFAIK will ever take non-rackable hardware. If they allow any random customer to add funny hardware they have to help maintain later, the whole operation will fall apart very fast. It's not that hard to procure proper server gear and deploy quickly. It's likely the process or funding stopping the team from doing that. Trying to prevent people from adding random USB drives might be one of the good reasons why such process exists. :-) |
©2024 cpdn.org