climateprediction.net home page
OpenIFS Discussion

OpenIFS Discussion

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 31 · Next

AuthorMessage
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 67098 - Posted: 28 Dec 2022, 14:27:16 UTC - in response to Message 67087.  

... I was sooo happy the CPDN had an abundance of jobs and joined the party - only to then find out I can't get rid of my results.
Two machines crunching, two harddrives slowly filling up.

Thanks, that's funny. :-) Initially it's "Where's the work?!", now it's "How do I get rid of the results?!"

Me 2. Locally here, even funnier, because of local weather severe coldsnap just when my fastest hottest CPU's ran out of work. Had to burn a lot of methane to keep my house warm. Murphy's law.
And at the winter holidays, when tech support for low-budget volunteer projects infrastructure is minimal or so overpriced.
It's funny, ironic, anti-serendipitous, and another example of the famous Murphy Law.
keep on crunching, people. Patience pays. Thanks to all.
E
ID: 67098 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1055
Credit: 16,518,458
RAC: 1,068
Message 67100 - Posted: 28 Dec 2022, 19:13:13 UTC - in response to Message 67098.  


keep on crunching, people. Patience pays. Thanks to all.


I had turned off new tasks a few days ago, and they all ran out around lunch time (my local time) today.
I waited a bit, and resumed crunching by allowing new tasks. I got three and they started running.
But all the accumulated uploads failed to upload (no surprise), so I am now adding to the list.
CPDN is now using about 38 GBytes of disk. Luckily I have about 380 GBytes of disk space still available for Boinc. But even that will run out sometime.
We were told not to expect the problem to be fixed Monday (Boxing Day in England), but maybe after 9AM on Tuesday,.
It is now after 9AM on Wednesday and till no uploads. Has anyone a clue what the problem is and when they expect it to be fixed?
Sigh! 8-(

Wed 28 Dec 2022 01:49:48 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_0500_1992050100_123_961_12178144_1_r680267451_2.zip
Wed 28 Dec 2022 01:49:50 PM EST |  | Internet access OK - project servers may be temporarily down.
Wed 28 Dec 2022 01:50:15 PM EST |  | Project communication failed: attempting access to reference site
Wed 28 Dec 2022 01:50:15 PM EST | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0099_2014050100_123_983_12199743_0_r1848880848_2.zip: transient HTTP error
Wed 28 Dec 2022 01:50:15 PM EST | climateprediction.net | Backing off 00:02:31 on upload of oifs_43r3_ps_0099_2014050100_123_983_12199743_0_r1848880848_2.zip
Wed 28 Dec 2022 01:50:17 PM EST |  | Internet access OK - project servers may be temporarily down.
Wed 28 Dec 2022 01:51:49 PM EST |  | Project communication failed: attempting access to reference site
Wed 28 Dec 2022 01:51:49 PM EST | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0500_1992050100_123_961_12178144_1_r680267451_2.zip: transient HTTP error
Wed 28 Dec 2022 01:51:49 PM EST | climateprediction.net | Backing off 00:03:36 on upload of oifs_43r3_ps_0500_1992050100_123_961_12178144_1_r680267451_2.zip
Wed 28 Dec 2022 01:51:50 PM EST |  | Internet access OK - project servers may be temporarily down.

ID: 67100 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 484
Credit: 29,595,844
RAC: 1,821
Message 67106 - Posted: 28 Dec 2022, 23:29:58 UTC - in response to Message 67100.  

Copied from "uploads stuck"

From Andy

Hi Dave,

Thanks. I have looked at this. This machine keeps losing it's SSH port and HTTP port. I reset it and it keeps losing it again. I am going to have a look at this again tomorrow further.

Best wishes,

Andy

and

Update to this: I have made a request to the JASMIN cloud service where this machine resides to look into this.
ID: 67106 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,497,933
RAC: 6,477
Message 67115 - Posted: 29 Dec 2022, 17:06:09 UTC
Last modified: 29 Dec 2022, 17:54:24 UTC

Any updates with regards to stuck uploads I get will go here as opposed to the thread entitled such as it has gone on a tangent and I can't be bothered with moving all the offending posts.

Beginning to look like JASMIN support are either not working 24/7 or have not been able to work out what the issue is.

Edit:
Support will be provided during normal working hours, defined as between 0900 and 1700 on Monday to Thursday and between 0900 and 1630 on Friday, excluding Public Holidays and STFC Privilege Days. Note that times are given in UK time.
So if not sorted by 16:00 UCT tomorrow we will almost certainly have to wait till at least Tuesday iwth Monday being a public holiday in UK.
ID: 67115 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,497,933
RAC: 6,477
Message 67117 - Posted: 29 Dec 2022, 18:02:53 UTC

So if not sorted by 16:00 UCT tomorrow we will almost certainly have to wait till at least Tuesday iwth Monday being a public holiday in UK.


From Andy,

Thanks @dave
I have contacted JASMIN support, however their support is closed at present for the holiday period, I understand they are back on the 3rd January.
ID: 67117 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1055
Credit: 16,518,458
RAC: 1,068
Message 67122 - Posted: 29 Dec 2022, 22:36:07 UTC - in response to Message 67115.  
Last modified: 29 Dec 2022, 22:42:38 UTC

Beginning to look like JASMIN support are either not working 24/7 or have not been able to work out what the issue is.

Edit:

Support will be provided during normal working hours, defined as between 0900 and 1700 on Monday to Thursday and between 0900 and 1630 on Friday, excluding Public Holidays and STFC Privilege Days. Note that times are given in UK time.

So if not sorted by 16:00 UCT tomorrow we will almost certainly have to wait till at least Tuesday iwth Monday being a public holiday in UK.


Are they not the new, profesionally-managed, cloud based, server farm? IIRC, they worked very well the first few days with extremely fast Internet data rates (like over 7 MegaBytes/second) transmission rates)? It is a shame they should be down for well over a week without technical support.

Now maybe CPDN is not an important client of theirs, dealing with critical information (banking, law enforcement, medical facilities, and G.O.K what else). Can they afford to have no technical support for over a week?
ID: 67122 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 942
Credit: 34,170,570
RAC: 5,819
Message 67124 - Posted: 30 Dec 2022, 9:46:30 UTC - in response to Message 67122.  

Are they not the new, profesionally-managed, cloud based, server farm? IIRC, they worked very well the first few days with extremely fast Internet data rates (like over 7 MegaBytes/second) transmission rates)? It is a shame they should be down for well over a week without technical support.
Wasn't that when we were still running small test batches? Now there are 12326 tasks in progress, there are potentially 1,516,098 files to be uploaded, or around 22 terabytes. I wonder if anyone did that sort of a back-of-the-envelope calculation, and checked the aggregate bandwidth of that link - or possibly the terms of service?
ID: 67124 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 802
Credit: 13,560,429
RAC: 6,808
Message 67143 - Posted: 30 Dec 2022, 14:19:49 UTC - in response to Message 67124.  

Are they not the new, profesionally-managed, cloud based, server farm? IIRC, they worked very well the first few days with extremely fast Internet data rates (like over 7 MegaBytes/second) transmission rates)? It is a shame they should be down for well over a week without technical support.
Wasn't that when we were still running small test batches? Now there are 12326 tasks in progress, there are potentially 1,516,098 files to be uploaded, or around 22 terabytes. I wonder if anyone did that sort of a back-of-the-envelope calculation, and checked the aggregate bandwidth of that link - or possibly the terms of service?
I might be wrong but I think the test batches go direct to CPDN and not via JASMIN.

The support CPDN get from JASMIN will depend on their service contract. But it's laughable that JASMIN pressured CPDN to get off the older unmanaged cloud server because of support issues (which delayed the release of these batches), to then get stuffed by lack of support when the server goes down on the new cloud. Still, a backup (or two) upload server might have helped. I'm not familiar with the boinc side but it looks like there is only one in place. The next CPDN technical meeting will be interesting.
ID: 67143 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 942
Credit: 34,170,570
RAC: 5,819
Message 67145 - Posted: 30 Dec 2022, 14:28:48 UTC - in response to Message 67143.  

I might be wrong but I think the test batches go direct to CPDN and not via JASMIN.
That idea came about because we tried tracert to the upload server's IP address when this error first came about, and the last routing hop that responded had a .ja.net suffix.
ID: 67145 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1055
Credit: 16,518,458
RAC: 1,068
Message 67147 - Posted: 30 Dec 2022, 15:24:28 UTC - in response to Message 67124.  

Now there are 12326 tasks in progress, there are potentially 1,516,098 files to be uploaded, or around 22 terabytes. I wonder if anyone did that sort of a back-of-the-envelope calculation, and checked the aggregate bandwidth of that link - or possibly the terms of service?


I do not know about anyone else, but my one Linux machine has about 3200 CPDN .zip files to upload. That is about 28 tasks of output.

It has been up for a little over two weeks.
ID: 67147 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 11,816,348
RAC: 19,973
Message 67164 - Posted: 31 Dec 2022, 8:05:08 UTC - in response to Message 67143.  
Last modified: 31 Dec 2022, 8:12:12 UTC

I might be wrong but I think the test batches go direct to CPDN and not via JASMIN. ...

It sounds like Richard maybe was referring to the test runs on the main site, not the dev. site.

Richard, I do think that those questions are very valid, whether enough attention to detail was paid. I definitely hope that the current issue is not due to the new server not being up to par to handle the amount of uploads (which I assume will only increase with higher resolution models).
ID: 67164 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 802
Credit: 13,560,429
RAC: 6,808
Message 67171 - Posted: 31 Dec 2022, 16:22:14 UTC - in response to Message 67164.  

Richard, I do think that those questions are very valid, whether enough attention to detail was paid. I definitely hope that the current issue is not due to the new server not being up to par to handle the amount of uploads (which I assume will only increase with higher resolution models).
You have to remember there is only 1 full-time paid IT person at CPDN, Andy, who does a great job but is usually juggling 10 things at once. Andy is actually very good at detail (far better than my hacking about...), but there was alot of time pressure because of contract commitments from the Perturbed Surface project.

There's a bit of a back story. JASMIN (the cloud provider) wanted CPDN to move to their newer managed server before the end of the year, which because of other commitments, was not done until the test batches were complete. Which made it rather a rush. I suspect it's a software issue that maybe got missed, but IMHO also crap timing and rather poor from JASMIN they can't provide any support between Christmas and New Year.

The new server (when it works) has a much improved capacity, that's not the issue. Knowing the Prof. in charge of CPDN I suspect strong words will be sent to JASMIN....

I'm more worried the results won't be available in time as the PS contract finishes end of Feb.
ID: 67171 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 254
Credit: 31,658,560
RAC: 33,321
Message 67172 - Posted: 31 Dec 2022, 16:41:02 UTC - in response to Message 67171.  

I'm more worried the results won't be available in time as the PS contract finishes end of Feb.


The upload time for existing tasks shouldn't be a server network capacity issue - 22TB on a 1Gbit upload is only about 2 days. I expect it will take far longer for a lot of users to upload their caches, though...

Is it be possible to get exemptions to the "tasks per day" ramp? It seems like it takes a long time to get a new machine "up to capacity," even if it's returning valid results. Some of my compute boxes weren't able to get to full capacity for a while, though... at this point, some of them are idle on lack of upload slots or something (max uploads in progress). These tasks look like they'd be a good fit for preemptible compute instances on GCE or some other cloud platform if one wanted to throw a bunch of cores at them, but that's wasted if the machines can't stay busy.
ID: 67172 · Report as offensive     Reply Quote
OliverF

Send message
Joined: 23 Nov 19
Posts: 4
Credit: 6,597,088
RAC: 79,816
Message 67282 - Posted: 4 Jan 2023, 9:25:42 UTC

Is there any update on this topic?
I still have about 120Gig results clogging my drives and keep getting transient HTTP errors.
Would love to help crunch, but if this is doing nothing but blocking space on my disks, what's the point?
ID: 67282 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 942
Credit: 34,170,570
RAC: 5,819
Message 67283 - Posted: 4 Jan 2023, 9:42:51 UTC - in response to Message 67282.  

Just hang on to them for the time being. BOINC will hold on to them for up to 90 days (provided you've got the space), and I'm sure the project and Jasmin will have sorted this out by then.
ID: 67283 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67285 - Posted: 4 Jan 2023, 9:46:10 UTC

It might be a good idea to prolong the deadlines on the server-side
- - - - - - - - - -
Greetings, Jens
ID: 67285 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,497,933
RAC: 6,477
Message 67287 - Posted: 4 Jan 2023, 10:03:34 UTC - in response to Message 67283.  

Just hang on to them for the time being. BOINC will hold on to them for up to 90 days (provided you've got the space), and I'm sure the project and Jasmin will have sorted this out by then.
I am hoping all will be resolved by the end of play today or at least a significant dent made in the number of tasks needing to be uploaded. Over 300 went through in the short time the gate was open yesterday, I haven't checked exactly how many.
ID: 67287 · Report as offensive     Reply Quote
OliverF

Send message
Joined: 23 Nov 19
Posts: 4
Credit: 6,597,088
RAC: 79,816
Message 67288 - Posted: 4 Jan 2023, 10:06:49 UTC - in response to Message 67287.  

Thanks!
Fingers crossed then.
I promise to open my gates for new tasks once the upload queue here is gone.
ID: 67288 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 802
Credit: 13,560,429
RAC: 6,808
Message 67324 - Posted: 4 Jan 2023, 17:46:21 UTC - in response to Message 67288.  

Thanks!
Fingers crossed then.
I promise to open my gates for new tasks once the upload queue here is gone.
One of my machines has started downloading again, so that's a good sign the others will be soon (as long as the server stays up).
ID: 67324 · Report as offensive     Reply Quote
OliverF

Send message
Joined: 23 Nov 19
Posts: 4
Credit: 6,597,088
RAC: 79,816
Message 67351 - Posted: 5 Jan 2023, 8:16:36 UTC - in response to Message 67287.  
Last modified: 5 Jan 2023, 8:31:36 UTC

Just hang on to them for the time being. BOINC will hold on to them for up to 90 days (provided you've got the space), and I'm sure the project and Jasmin will have sorted this out by then.
I am hoping all will be resolved by the end of play today or at least a significant dent made in the number of tasks needing to be uploaded. Over 300 went through in the short time the gate was open yesterday, I haven't checked exactly how many.

One of my boxes completed all uploads, the second however is now stuck with 2(sic) out of those thousands result files. Is/will the "gate" be opened permanently? After all, it seems to be the sole prupose of an upload server to be open for uploads?
ID: 67351 · Report as offensive     Reply Quote
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 31 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 climateprediction.net