climateprediction.net home page
Posts by David Wallom

Posts by David Wallom

1) Message boards : Number crunching : Upload server is out of disk space (Message 67808)
Posted 17 Jan 2023 by David Wallom
Post:
Hello Everyone,

We increased the number of concurrent uploads allowed to 150 from 50 and the server ended up indeed running out of space. This is with 5 parallel transfers and deletions of successful WU from jasmin-upload to the analysis space. We have temp restricted back to 100 and are seeing free space increasing, 1.5TB out of 24TB. Of the OpenIFS@Home batches, each has up to 800GB of successful workunits we are transferring off and there are 44 batches.

Thanks for your contributions

David
2) Message boards : Number crunching : The uploads are stuck (Message 67649)
Posted 13 Jan 2023 by David Wallom
Post:
Hi,

The current limit is 50 concurrent connections.

Cheers

David
3) Message boards : Number crunching : The uploads are stuck (Message 67636)
Posted 13 Jan 2023 by David Wallom
Post:
Hello All,

Brief update on status.

The upload server is back running and we are currently in the process of transferring ~24TB of built up project results from that system to the analysis datastores. This process is going to take ~5 days running 5 parallel streams (the files are all OpenIFS workunits).

I have asked Andy to restart uploads but to throttle to ensure that our total stored volume does keep decreasing, i.e. our upload rate doesn't exceed our transfer rate. As such we'll be slow for a while but will gradually increase the upload server bandwidth to you guys as we clear batches.

The issue was caused by an initial instability bought about because the system disks for the VMs that run the upload server and the data storage volumes are all actually hosted in the same physical data system. When the data volumes fill they affect the performance of the other disks as well.... This was exasperated because they allowed us to create extremely large volumes that were really beyond the capability of the storage system so we have to move the data internally as well. Not an idea solution and we've told JASMIN this.

Thank you for your understanding in whats been a difficult few days.

David
4) Message boards : Number crunching : Completed task fails to upload several times over last few days (Message 64136)
Posted 6 Jul 2021 by David Wallom
Post:
Hi,

Indeed very odd as all of the other uploads for that WU are sitting waiting in the in_progress folder....?

Can you forward that zip to me directly by email please? david.wallom at oerc.ox.ac.uk

regards

David
5) Message boards : Number crunching : BOINC Client Improvements (Message 60633)
Posted 11 Jul 2019 by David Wallom
Post:
Hello,

The BOINC community has been offered assistance from a design studio to improve the look, feel and functionality of the BOINC client. AS such part of this work would involve user studies/interaction, i.e. with you the volunteers. Would there be interest in participating in this?

Regards

David
6) Message boards : Number crunching : Upload failures (Message 60526)
Posted 1 Jul 2019 by David Wallom
Post:
There are now 140+ parallel uploads onto the system.

David
7) Message boards : Number crunching : Upload failures (Message 60525)
Posted 1 Jul 2019 by David Wallom
Post:
Hello All,

Apologies for the continued unavailability of the jasmin-upload system which we have been clearing out over the weekend. We have cleared 5TB of space since Thursday so will be re-enabling uploads imminently.

We are going to be reconfiguring the data transfer from the upload to the project storage over this week so that we will be able totake advantage of new capability within the JASMIN system to speed these transfers in future.

Regards

David
8) Message boards : Number crunching : Upload failures (Message 60475)
Posted 27 Jun 2019 by David Wallom
Post:
Hello All,

We have currently stopped uploads to the JASMIN upload server to allow for the backlog to clear from the system. I will update when it is clearer at what rate this is occurring. One issue alongside these that we are trying to debug in parallel is a bandwidth limitation that we have run into on this system. The operators of the system are struggling to debug from their side since our usecase is so far outside the normal operating region for the system as a whole (no-one else is generating between 5TB & 6TB per day and trying not only to receive this onto a system but also in parallel trying to then move it off the system into other parts of the storage system.

Once todays processing has completed we can give a firm timeline on when the system will return into operation.

regards

David
9) Message boards : Number crunching : Credits (Message 59629)
Posted 13 Feb 2019 by David Wallom
Post:
Hello All,

The issue with credit has been traced to the credit script not having been correctly installed following the rebuild of the primary database server after its latest failure before Christmas. We then ran on the backup system for well over a month but one of the failures notices was the enforced downtime on the project due to the newly introduced dump schedule to ensure we have a usable database backup unlike previously. Therefore when we moved back to the primary DB around the 13th Jan it wasn't noticed that the credit script wasn't operating correctly. This should be fixed now and therefore credit should appear on a regular basis again.

regards

David
10) Message boards : Number crunching : Credits (Message 59619)
Posted 12 Feb 2019 by David Wallom
Post:
Hello,

Apologies that the credit issue has still not been fixed since Les last notified the project team. We will be investigating tomorrow as I had wrongly assumed this would have been fixed by now. Our volunteers are important to us and we understand your frustration. Please bear with us while we try and fix this

whilst under the constraints of project deliverables and time we are reviewing how we as a project interact with these boards to ensure there is a more regular project appearance here.

David
11) Message boards : Number crunching : Credits (Message 59618)
Posted 12 Feb 2019 by David Wallom
Post:
Hi,

The issue yesterday and over the last few days with accessing the servers appeared to be a rogue process from a client somewhere that had launched well over 100 queries of the database searching for successful workunits on that particular host system. Following restarting of both the DB and scheduler those tasks have disappeared and not returned so far. If (when) they return we will have to look at who is the offending machine but this is tricky with the large number of active WU we have at the moment as we have to identify the offending httpd which is generating the DB query to find the IP etc.

David
12) Message boards : climateprediction.net Science : CPDN in 2016 – a look back over the last year (Message 55644)
Posted 3 Feb 2017 by David Wallom
Post:
Hello All,

As we start 2017 the Science and Technical teams within CPDN & W@H thought it would be good to summarise the year past, to detail the work done and thank you the volunteers and moderators for your contributions.

http://www.climateprediction.net/cpdn-in-2016-a-look-back-over-the-last-year/

Kind Regards

The Oxford Team




©2024 climateprediction.net