climateprediction.net (CPDN) home page
Thread 'The uploads are stuck'

Thread 'The uploads are stuck'

Message boards : Number crunching : The uploads are stuck
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 25 · Next

AuthorMessage
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 43,319,469
RAC: 71,134
Message 67468 - Posted: 9 Jan 2023, 19:44:41 UTC - in response to Message 67451.  
Last modified: 9 Jan 2023, 20:09:27 UTC

Do you want to continue crunching and generating more files, or just want to be able to write the state files and wait for the upload server to recover? If it's the latter case, you just need to free up any space for the state file, which is just tens of MBs. You could wrap up and remove other boinc projects if you have them, or abandon one WU, or even remove some host applications that you don't immediately need for next few days.

Edit: Just realized if you can't write state file, any messing within BOINC might be hopeless. So have to find the space elsewhere from the system.
ID: 67468 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,322,658
RAC: 1,085
Message 67478 - Posted: 9 Jan 2023, 22:10:49 UTC - in response to Message 67437.  
Last modified: 9 Jan 2023, 22:13:09 UTC

wujj123456 wrote:
There are completed WUs approaching deadline soon. I have WUs due in 11 days and 11 days no longer look that long given the upload server has been down for two weeks. If somehow the new storage array again has problems, or some new issues showing up which is not that uncommon for new systems, we probably need server to extend deadlines to not waste work.
After the reporting deadline, your work isn't obsolete right away. The server would create a replica task from the same workunit, would have to wait for a work-requesting host to assign this new task to, and then wait for that host to return a valid result. Until that happens — which can potentially be a long time after your reporting deadline —, the server will still opportunistically accept a result from your original task. (And give credit to it if valid… normally. Not sure about CPDN, where credit is assigned separately.)

PS, if you return a valid result for the original task, after the server already assigned a replica task to another host, three things can follow: 1) The other host returns a result too. AFAIK it will get credit if valid. 2) The other host issues an unrelated scheduler request to the server. In the response, the server informs the other host that the replica is no longer needed. 2.a) If the host hasn't started the replica task yet, the task is aborted, and thereby no CPU cycles are wasted on it. 2.b) If the host already started the replica task, same as 1: it will finish and report it and AFAIK get credit if valid.
ID: 67478 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 43,319,469
RAC: 71,134
Message 67485 - Posted: 10 Jan 2023, 0:59:37 UTC - in response to Message 67478.  
Last modified: 10 Jan 2023, 1:01:53 UTC

Thanks. That's in line with what I observe from other projects. When I said wasted work, I mostly mean the unnecessary replicas being sent out, especially given only upload server is down. It could end up being a lot of duplicates if upload server is not restored before many WU's timeout. Thanks to Glenn's constant updates though, I feel hopefully we won't reach that point.
ID: 67485 · Report as offensive     Reply Quote
leloft

Send message
Joined: 7 Jun 17
Posts: 23
Credit: 44,434,789
RAC: 2,600,991
Message 67488 - Posted: 10 Jan 2023, 8:11:17 UTC - in response to Message 67468.  

Edit: Just realized if you can't write state file, any messing within BOINC might be hopeless. So have to find the space elsewhere from the system.


Indeed, that's what I've done. The loss of the state file has caused problems: presumably, the .old state file was accessed as the client downloaded some hadam files; it also couldn't locate some of the oifs files and so 20 or so were abandoned as errors, with the loss of 20 results.

My next move is to split the /boinc-client folder: I'm thinking to leave the boinc-client directory on the /var/lib partition but mount the /projects folder on a separate partition. At the moment, the whole of the boinc-client folder is on a separate partition. This arrangement would have meant that the state file could still have been written, much like mounting /var/log separately to /var. Any thoughts?
ID: 67488 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 67496 - Posted: 10 Jan 2023, 13:05:44 UTC - in response to Message 67488.  

Edit: Just realized if you can't write state file, any messing within BOINC might be hopeless. So have to find the space elsewhere from the system.


Indeed, that's what I've done. The loss of the state file has caused problems: presumably, the .old state file was accessed as the client downloaded some hadam files; it also couldn't locate some of the oifs files and so 20 or so were abandoned as errors, with the loss of 20 results.

My next move is to split the /boinc-client folder: I'm thinking to leave the boinc-client directory on the /var/lib partition but mount the /projects folder on a separate partition. At the moment, the whole of the boinc-client folder is on a separate partition. This arrangement would have meant that the state file could still have been written, much like mounting /var/log separately to /var. Any thoughts?


Under your account's computing preferences, you can set set "Leave at least x GB free" (of disk space) to make sure there is enough left for uploads, etc.
ID: 67496 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 67497 - Posted: 10 Jan 2023, 13:07:36 UTC - in response to Message 67454.  

Upload server update 9/1/23 10:49GMT
From a meeting this morning with CPDN they do not expect the upload server to be available until 17:00GMT TOMORROW (10th) at the earliest. The server itself is running, but they have to move many Tbs of data but also want to monitor the newly configured server to check it is stable. As already said, these are issues caused by the cloud provider, not CPDN themselves.


Thanks for the update, Glenn.

FYI... I set my "max uploads per project" to 1 in the cc_config.xml, which is what I recommend for everyone.
ID: 67497 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 67501 - Posted: 10 Jan 2023, 13:57:08 UTC - in response to Message 67497.  

FYI... I set my "max uploads per project" to 1 in the cc_config.xml, which is what I recommend for everyone.


Why? What is it supposed to do? I see no "max uploads per project" in cc_config.xml.

Do you mean <max_file_xfers_per_project>? If your Internet connection can do more than one, why not do it?
        <max_event_log_lines>5000</max_event_log_lines>
        <max_file_xfers>8</max_file_xfers>
        <max_file_xfers_per_project>2</max_file_xfers_per_project>
        <max_stderr_file_size>0.000000</max_stderr_file_size>
        <max_stdout_file_size>0.000000</max_stdout_file_size>
        <max_tasks_reported>0</max_tasks_reported>

ID: 67501 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,750,587
RAC: 2,505
Message 67503 - Posted: 10 Jan 2023, 14:19:28 UTC - in response to Message 67501.  

If your Internet connection can do more than one, why not do it?
Because the project's server probably only has one internet connection, too.

We don't know what type, how fast, how configured - but we all have to share it. And it's going to be very, very, busy. Spread the love, eh?
ID: 67503 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 67504 - Posted: 10 Jan 2023, 15:27:18 UTC - in response to Message 67501.  

Do you mean <max_file_xfers_per_project>? If your Internet connection can do more than one, why not do it?


Yes, that's what I mean:
        <max_file_xfers>4</max_file_xfers>
        <max_file_xfers_per_project>1</max_file_xfers_per_project>


For normal HTTPS traffic, yes, you want about 4 connections per server, and most browsers do 4 to 8 connections at a time anyway, because most big websites are server farms (multiple servers that can all work in parallel). However, file transfers are a different beast and BOINC projects in particular are, as most are grant funded (i.e., run on minimal hardware). Your 1 allowed file transfer will still download or upload at the maximum possible speed, limited by the project's internet connection. It does no good to hammer the same project file server with multiple connections, if connection #2 runs at half speed, connection #3 at 1/3 speed, etc. In other words, it won't take longer for YOU but it will help the project server by only needing to serve 1 connect per client x 1000 active users, etc.
ID: 67504 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,671,175
RAC: 11,014
Message 67505 - Posted: 10 Jan 2023, 15:54:37 UTC - in response to Message 67504.  

For CPDN I have my xfers_per_project set to 10 for each of my machines, because I know my fibre can handle it and so can the CPDN upload server (when it's working). Their upload server is on a big UK cloud server that was handling 10s of 1000s of uploads/hr for the OpenIFS tasks. The Weather@Home ones are more of an issue because they go to a server in New Zealand which I understand doesn't have quite the same capacity.

Do you mean <max_file_xfers_per_project>? If your Internet connection can do more than one, why not do it?

Yes, that's what I mean:
        <max_file_xfers>4</max_file_xfers>
        <max_file_xfers_per_project>1</max_file_xfers_per_project>
For normal HTTPS traffic, yes, you want about 4 connections per server, and most browsers do 4 to 8 connections at a time anyway, because most big websites are server farms (multiple servers that can all work in parallel). However, file transfers are a different beast and BOINC projects in particular are, as most are grant funded (i.e., run on minimal hardware). Your 1 allowed file transfer will still download or upload at the maximum possible speed, limited by the project's internet connection. It does no good to hammer the same project file server with multiple connections, if connection #2 runs at half speed, connection #3 at 1/3 speed, etc. In other words, it won't take longer for YOU but it will help the project server by only needing to serve 1 connect per client x 1000 active users, etc.
ID: 67505 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,671,175
RAC: 11,014
Message 67506 - Posted: 10 Jan 2023, 15:59:36 UTC

Upload server status: 10/Jan 16:00GMT
Just spoken with CPDN. They have successfully migrated 1,000,000 of the 1,300,000 files onto the new block storage for the upload server. That process should be complete today but they will run checks first before opening up the upload server. I'll get an update tomorrow.
ID: 67506 · Report as offensive     Reply Quote
[SG]Felix

Send message
Joined: 4 Oct 15
Posts: 34
Credit: 9,075,151
RAC: 374
Message 67512 - Posted: 10 Jan 2023, 18:36:25 UTC

I am getting a bit nervous right now, since the date was moved again. One of my VMs, which luckily runs Hadam right now, only has 3Gb of space left.
The VM with oIFS still has about 14Gb left.
ID: 67512 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 67513 - Posted: 10 Jan 2023, 19:19:33 UTC - in response to Message 67512.  
Last modified: 10 Jan 2023, 19:21:08 UTC

I am getting a bit nervous right now, since the date was moved again. One of my VMs, which luckily runs Hadam right now, only has 3Gb of space left.
The VM with oIFS still has about 14Gb left.


Luckily, I do not run VMs, so I am not getting nervous right now.

My machine has a 512 Gigabyte SSD for the root, home, boot, and swap partitions. And I have two 4 Terabyte spinning hard drives. Since I did not want to run out of space, envisioning large Oifs job requirements, I made a 512 Gigabyte partition on one of them and mounted it on /var/lib/boinc where my distro's version of boinc client stuff goes.

CPDN is currently using 55 Gigabytes of this, there is 364 Gigabytes free for Boinc, and 64 GBytes I could give to Boinc if it is ever needed.
ID: 67513 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 67514 - Posted: 10 Jan 2023, 19:27:56 UTC - in response to Message 67512.  

I am getting a bit nervous right now, since the date was moved again. One of my VMs, which luckily runs Hadam right now, only has 3Gb of space left.
The VM with oIFS still has about 14Gb left.


So just shut the tasks down until you've got space free.

It's not your problem to solve that the upload server is down. I've got a few machines halted on "Too many uploads in progress" (yes, I know how to fix it, just don't see a point right now), and a few others are running out of disk because I put cheap 128GB M.2 SSDs in my compute rigs - "Designing for upload servers being down for weeks with huge tasks" was not a design criteria I considered, and will remain one I won't consider given the relative rarity of the problem. If the machines are full due to things out of my control, they're full. And if contracts aren't met because a lot of machines are unable to compute because they can't return results, similarly, not my problem. It'll still take me a week+ to upload my pending results, unless I can get some good overnight bandwidth out of Starlink (of course, that doesn't solve that a bunch of the machines are solar powered and don't run overnight). I've got hundreds of gigabytes to upload, and I simply can't do that quickly.
ID: 67514 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,671,175
RAC: 11,014
Message 67521 - Posted: 10 Jan 2023, 22:29:25 UTC
Last modified: 10 Jan 2023, 23:23:59 UTC

Update. 22:30. 10/Jan
Update from CPDN. The data move has been completed and the upload server has been enabled. Uploads should get moving again.

Edit: I'm seeing 'no route to host' errors. Maybe something at upload server needs re-enabling. Anyway, I'm told the data has been successfully migrated and the upload server has been enabled. Any amiss can be dealt with quickly come office hrs tomorrow I would think.
ID: 67521 · Report as offensive     Reply Quote
ProfileConan
Avatar

Send message
Joined: 6 Jul 06
Posts: 147
Credit: 3,615,496
RAC: 420
Message 67525 - Posted: 10 Jan 2023, 23:04:06 UTC
Last modified: 10 Jan 2023, 23:06:23 UTC

Yes I am still seeing "connect(): failed" messages on all upload tries.

But I still have 4 work units running and I am no where near filling up any disks, so no problem here.

Conan
ID: 67525 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 67528 - Posted: 11 Jan 2023, 2:21:23 UTC - in response to Message 67521.  

Might be a hung process taking over port 80. They'll have to stop the service, kill any orphaned processes, and restart the service.
ID: 67528 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 67532 - Posted: 11 Jan 2023, 10:32:19 UTC

My uploads have now started. On 100KB/second it will be a while!
ID: 67532 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1052
Credit: 16,671,175
RAC: 11,014
Message 67533 - Posted: 11 Jan 2023, 10:42:30 UTC - in response to Message 67532.  

I've just had confirmation from CPDN that the upload server is now fully functional.

My uploads have now started. On 100KB/second it will be a while!
ID: 67533 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,750,587
RAC: 2,505
Message 67536 - Posted: 11 Jan 2023, 11:01:37 UTC

Mine have started too. Reasonable speeds (for my line) around 1,000 KB/sec once they've latched on, but occasional files pause and need to retry. Expected, for this stage of a recovery. And I've been able to report several tasks with just a few loose ends to tidy up.
ID: 67536 · Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 25 · Next

Message boards : Number crunching : The uploads are stuck

©2024 cpdn.org