climateprediction.net home page
The uploads are stuck

The uploads are stuck

Message boards : Number crunching : The uploads are stuck
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 25 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,270,013
RAC: 10,988
Message 67787 - Posted: 17 Jan 2023, 9:25:21 UTC

All my stacked-up uploads have cleared, and I just have four tasks in the final stages. So today is maintenance day, and afterwards I have a plan to try and grab a memory usage log to illustrate the startup problem. I'll be using a machine with 6 cores and 64 GB RAM, so no CPDN work should be harmed in the process (though it may take a couple of tries to get it right), and then we'll have something to show CPDN staff in the first instance, and BOINC developers later on.
ID: 67787 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67789 - Posted: 17 Jan 2023, 12:23:48 UTC
Last modified: 17 Jan 2023, 12:49:36 UTC

Dave, Richard, et al..

Can I ask, where have these completed tasks gone?

https://www.cpdn.org/results.php?hostid=1535374&offset=0&show_names=0&state=1&appid=

It says In-Progress but most (if not all) of these have already been completed, uploaded, and reported?

eg I just uploaded and reported one task just a few minutes ago which I downloaded 15 hours ago, but there is nothing showing in the list, just 'in-progress'.

**

I think the problem is that CPDN is treating 2 hosts, as a single host.

eg: L-7113-1 and L-7113-2 are two different hosts. But CPDN see's just 1 host. If I swap out the hard drive, all CPDN does is change the hostname of this device, rather than see it as a seperate device (host).

The two disks are completely separate. Both have a full install of Ubuntu and BOINC. Only one drive is inserted into the server at any one time.

I have ran this server on many BOINC projects, at different times with each drive, and all projects (except CPDN) see it as 2 different hosts.

**

Are all the reported tasks from this server since Dec 24th, now orphaned? If so, that would mean its not just the 17 tasks on this list, but also the 250+ tasks that this server has and is currently reporting?

If I report a task, should not the in-progress not decrease by 1, and the Valid, Invalid, or Error increase by 1?

This has not happened for ANY of the 250+ tasks reported (or being reported) by this server since Dec 24th.

However it has and does happen for all our other hosts (devices) from before and after this date.
ID: 67789 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,503,405
RAC: 1,318
Message 67790 - Posted: 17 Jan 2023, 12:45:57 UTC - in response to Message 67789.  

Dave, Richard, et al..

Can I ask, where have these completed tasks gone?

https://www.cpdn.org/results.php?hostid=1535374&offset=0&show_names=0&state=1&appid=

It says In-Progress but most (if not all) of these have already been completed, uploaded, and reported?

eg I just uploaded and reported one task just a few minutes ago which I downloaded 15 hours ago, but there is nothing showing in the list, just 'in-progress'.

I think the problem is that CPDN is treating 2 hosts, as a single host.

eg: L-7113-1 and L-7113-2 are two different hosts. But CPDN see's just 1 host. If I swap out the hard drive, all CPDN does is change the hostname of this device, rather than see it as a seperate device (host).

The two disks are completely separate. Both have a full install of Ubuntu and BOINC. Only one drive is inserted into the server at any one time.

I have ran this server on many BOINC projects, at different times with each drive, and all projects (except CPDN) see it as 2 different hosts.

The link you give is just to show ONLY in progress tasks, this link shows all tasks for that host: https://www.cpdn.org/results.php?hostid=1535374
There is a server setting that doesn't allow multiple clients. The way you have your 2 drives setup means BOINC sees them as the same so when you swap them over the tasks will get abandoned as shown in your full list.

Go and try GPUGrid, that does not allow multiple clients and their results will be abandoned if you swap your drives over whilst tasks are still active on the drive you swap out.
ID: 67790 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67791 - Posted: 17 Jan 2023, 12:54:53 UTC - in response to Message 67790.  
Last modified: 17 Jan 2023, 13:10:16 UTC

Thank you PDW,

That is my point. There are no tasks in progress on this server. It is in jail so we only get 1 or 2 tasks per day. We complete them in around 15 hours, upload and report them. Yet this is showing 17 In-progress.

Thank you also for confirming there is some kind of lock. I guessed there was, and it was using the Mac address and/or the local IP.

Obviously I am not sure what to do about the missing tasks now.

I have run 300 threads on CPDN since the OpenIFS large batch(s) were issued a couple of weeks ago. It's going to be a hard and better pill to swallow if it turns out that was all for nothing and wasted.
ID: 67791 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 67792 - Posted: 17 Jan 2023, 12:58:52 UTC - in response to Message 67790.  
Last modified: 17 Jan 2023, 13:00:53 UTC

I think the problem is that CPDN is treating 2 hosts, as a single host.
eg: L-7113-1 and L-7113-2 are two different hosts. But CPDN see's just 1 host. If I swap out the hard drive, all CPDN does is change the hostname of this device, rather than see it as a seperate device (host).
The link you give is just to show ONLY in progress tasks, this link shows all tasks for that host: https://www.cpdn.org/results.php?hostid=1535374
There is a server setting that doesn't allow multiple clients. The way you have your 2 drives setup means BOINC sees them as the same so when you swap them over the tasks will get abandoned as shown in your full list.
This should work - it's equivalent to running two clients on the same machine, and just shutting one down whilst the 2nd drive is in the machine. It's perfectly possible to run 2 clients on the same host for CPDN (I do it), but there must be two separate client ids. CPDN's server does not see the mac address, only your external (router) IP.

To swap out the disks you'd need to have created a new client instance on the 2nd disk, whilst keeping the original client on the first disk without detaching from the project. That way, CPDN's server will see two client, one for each disk and that should work. If each disk's boinc client datadir has the same client id (check the 'client_state.xml' file) then I suspect you'll get the behaviour you describe.
ID: 67792 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 67793 - Posted: 17 Jan 2023, 13:08:10 UTC - in response to Message 67791.  

All the tasks on your machine: https://www.cpdn.org/results.php?hostid=1535374 are showing as Abandoned on the 31st Dec, which is before the deadline so not sure what happened there. Or they failed because they hit their disk quota limit (this is probably because 'leave non-GPU in memory was not enabled). I can't see any tasks that have worked on this host after listing several pages :(

The disk limit error is what I'm working on improving now. To work around it, make sure 'leave non-gpu is memory is enabled' and try not to shutdown the boinc client too many times whilst the task is running (no more than 2). That should help.

So I think those have already been lost.
Thank you PDW,

That is my point. There are no tasks in progress on this server. It is in jail so we only get 1 or 2 tasks per day. We complete them in around 15 hours, upload them and report. Yet this is showing 17 In-progress.

Thank you also for confirming there is some kind of lock. I guessed there was, and it was using the Mac address and/or the local IP.

Obviously I am not sure what to do about the missing tasks now.

I have run 300 vcores on CPDN since the OpenIFS large batch(s) were issued a couple of weeks ago. It's going to be a hard and better pill to swallow if it turns out that was all for nothing and wasted.
ID: 67793 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67794 - Posted: 17 Jan 2023, 13:11:15 UTC
Last modified: 17 Jan 2023, 13:22:35 UTC

I have to wonder why I am bothering to upload 500GB worth of uploads, if they are going to just be abandoned as soon as they get reported.
ID: 67794 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,503,405
RAC: 1,318
Message 67795 - Posted: 17 Jan 2023, 13:17:51 UTC - in response to Message 67792.  

I think the problem is that CPDN is treating 2 hosts, as a single host.
eg: L-7113-1 and L-7113-2 are two different hosts. But CPDN see's just 1 host. If I swap out the hard drive, all CPDN does is change the hostname of this device, rather than see it as a seperate device (host).
The link you give is just to show ONLY in progress tasks, this link shows all tasks for that host: https://www.cpdn.org/results.php?hostid=1535374
There is a server setting that doesn't allow multiple clients. The way you have your 2 drives setup means BOINC sees them as the same so when you swap them over the tasks will get abandoned as shown in your full list.
This should work - it's equivalent to running two clients on the same machine, and just shutting one down whilst the 2nd drive is in the machine. It's perfectly possible to run 2 clients on the same host for CPDN (I do it), but there must be two separate client ids. CPDN's server does not see the mac address, only your external (router) IP.

To swap out the disks you'd need to have created a new client instance on the 2nd disk, whilst keeping the original client on the first disk without detaching from the project. That way, CPDN's server will see two client, one for each disk and that should work. If each disk's boinc client datadir has the same client id (check the 'client_state.xml' file) then I suspect you'll get the behaviour you describe.

I didn't know how CPDN was using the setting, I do know there is one, I wasn't going to try running multiple clients to test it before posting.

As I said, "The way you have your 2 drives setup means BOINC sees them as the same" so ncoded could change their setup to make it work if you say it works for you.
ID: 67795 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67796 - Posted: 17 Jan 2023, 13:26:20 UTC
Last modified: 17 Jan 2023, 13:28:57 UTC

I have to wonder why I am bothering to upload 500GB worth of uploads, if they are going to just be abandoned as soon as they get reported.

Also I am not sure if people realise what I am saying here. ANY task I crunch on this server now will just disappear and get stuck In-progress, even after it gets uploaded and reported.
ID: 67796 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,270,013
RAC: 10,988
Message 67797 - Posted: 17 Jan 2023, 13:29:08 UTC

The actual field you'll be looking for is <hostid>. You have a different one for each project: make sure you look at the right one.

This is a known, and deliberate, design feature in BOINC. It's more commonly encountered when people clone an existing BOINC installation to a new machine, either because they didn't know how to do it safely, or because they want one hostid to rack up all the points from several separate bits of hardware. The latter would be regarded as cheating, and is discouraged.

The best way, as Glenn has described, is to enable the 'Allow multiple clients' flags on both client instances, and keep both attached and visible to the project. BOINC should keep both hostids separate, and allow both to communicate (but do check that has worked properly).

If you have run two separate clones of the same hostid, perhaps because of the restriction on fetching new work while uploads are stuck, you can retrieve the situation with care.

1) Let the currently-active instance complete all outstanding tasks, and report them. Shut down that instance completely, so it doesn't contact the server again.
2) Before you do anything else, look on this website for the details of the computer, and find the line "Number of times client has contacted server". Make a note of that number.
3) Before you start the second instance, look for the tag <rpc_seqno> in the second client_state.xml file. Edit the number to be one greater than the one you just noted. Save the file, making sure you don't change the file type from plain text.

It should now be safe to re-start the second instance, without causing the associated tasks to be abandoned.
ID: 67797 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67798 - Posted: 17 Jan 2023, 13:35:14 UTC
Last modified: 17 Jan 2023, 13:51:41 UTC

Okay easy solution is just remove CPDN from every device.

Do you want me to abort all the uploads?

Or do nothing and just remove the project?

Or let the uploads complete, and then remove the Project?

I will let any running tasks complete before removing anything.

Let me know if you have a specific preference Glen et al.
ID: 67798 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 10,252,420
RAC: 30,461
Message 67800 - Posted: 17 Jan 2023, 13:56:56 UTC

It is not easy to run more than one instances of BOINC on the same hardware.

If you create the new instance, the server checks if it has already seen this machine before; this happens by name and IP-Adress. As they are the same, the server assumes that you lost your last instance and cancels all former assigned tasks. You may upload already crunched results, but as the server has already cancelled these tasks, it can not use your uploads.

In latest BOINC-Versions you can set an Instance-Name in cc_config.xml to avoid that the server assignes the old ID to a new instance: <device_name>HereTheNameForTheNewInstance</device_name>

What to do now ? I would cancel both instances, remove them, delete them and then setup the first one.

Then creating a second one, with different name via cc_config.xml and with a different directory then Instance 1

Before you start timeconsuming crunching check that the server has recognized both instances as separate machines


Supporting BOINC, a great concept !
ID: 67800 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67801 - Posted: 17 Jan 2023, 14:09:03 UTC - in response to Message 67800.  
Last modified: 17 Jan 2023, 14:17:47 UTC

The thing is, none of this problem is about BOINC instances.

All I did in this case was buy a new drive so I could continue crunching for cpdn as the old drive was full of uploads. I then swapped the drives over, and did a fresh install of Ubuntu and BOINC on the new drive.

As that is now causing loads of problems then clearly CPDN is not the right project for us at this time.
ID: 67801 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,503,405
RAC: 1,318
Message 67802 - Posted: 17 Jan 2023, 14:17:36 UTC - in response to Message 67801.  

As you didn't make an effort to change the second OS drive to look different from the first when you installed Boinc it came up with the same (or possibly very similar) identifier that it defined for that new host. When the host was attached to CPDN it was recognised as the same host that you had been using, resulting in abandonment of the old results. Much like running multiple clients on the same disk without using the allow_multiple_clients flag in cc_config.xml.
ID: 67802 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67803 - Posted: 17 Jan 2023, 14:22:29 UTC
Last modified: 17 Jan 2023, 14:58:57 UTC

Okay thanks
ID: 67803 · Report as offensive     Reply Quote
Boone

Send message
Joined: 8 Aug 05
Posts: 2
Credit: 12,903,264
RAC: 4,089
Message 67811 - Posted: 17 Jan 2023, 17:11:08 UTC - in response to Message 67745.  

Hi,
I would like to inform you that all my WUs have been uploaded, so far 88GB :-)

I am glad that I made it in time.

Thank you for making this possible.
ID: 67811 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67812 - Posted: 17 Jan 2023, 17:22:36 UTC

ncoded.com:
The point is that Boinc has this feature which the project CPDN can't circumvent, and no other project can.
You as user could have, but you didn't know about this.
This is quite a bitter pill, and I'm sorry you have to gulp it down.
- - - - - - - - - -
Greetings, Jens
ID: 67812 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 87
Credit: 32,981,759
RAC: 14,695
Message 67813 - Posted: 17 Jan 2023, 17:22:46 UTC - in response to Message 67785.  
Last modified: 17 Jan 2023, 17:30:01 UTC

I'd argue against doing this or that there's even a need. It seems to me that BOINC upload is for the most part a background process that does its job relatively well. Pretty much the only times uploading generates user complaints are when upload servers aren't working right. The length of this upload outage is rather unique but even so the progress has been very good so far. Even the users who've had a hard time getting a connection slot are starting to report completed uploads. Even though CPDN put in a due date grace period I suspect it'll hardly be needed, which I believe was also Glenn's position in an earlier post.

Unfortunately not everyone's upload is that fast relative to their compute and I will very likely need the grace period. I got good connection in past one day and half. So far, I've uploaded around 60 with 170 pending. I have a few WUs due in 2-3 days and they seem to be determined to be the very last to go. If boinc client ordered uploads properly, they would have all been reported by now, removing the need for extending the grace period.

On the other hand, I don't agree this should require user intervention either. Boinc client should simply order this correctly by itself, just like how it prioritizes compute deadlines. After all, the goal that matters is to get the WUs reported by deadline, and upload is part of the process to get there.

Edit: Thinking more, I realized it's possible boinc would order this properly as the deadline approaches, just like how it does with compute. Perhaps I will learn whether that's the case in two days...
ID: 67813 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67815 - Posted: 17 Jan 2023, 19:31:30 UTC - in response to Message 67785.  
Last modified: 17 Jan 2023, 19:37:00 UTC

AndreyOR wrote:
It seems to me that BOINC upload is for the most part a background process that does its job relatively well. Pretty much the only times uploading generates user complaints are when upload servers aren't working right. The length of this upload outage is rather unique but even so the progress has been very good so far.
It's not only the length of the server outage which is a one-off edge case here.
The extreme ratio of result data size to CPU time is also unique.
AFAIK, it's very unlike any of the current active projects. (And it's atypical for Distributed Computing which requires client-server communications to be minimal to be effective. Client-server bandwidth and latency in Distributed Computing are, obviously, worlds apart from an HPC cluster.)


(On a positive note, both the result data size and the CPU time of oifs_43r3_ps tasks are very predictable, making it easy for users to control their output accordingly, if they care.)
ID: 67815 · Report as offensive     Reply Quote
[SG]Felix

Send message
Joined: 4 Oct 15
Posts: 34
Credit: 9,069,332
RAC: 14,637
Message 67821 - Posted: 17 Jan 2023, 21:12:37 UTC - in response to Message 67813.  


On the other hand, I don't agree this should require user intervention either. Boinc client should simply order this correctly by itself, just like how it prioritizes compute deadlines. After all, the goal that matters is to get the WUs reported by deadline, and upload is part of the process to get there.

Edit: Thinking more, I realized it's possible boinc would order this properly as the deadline approaches, just like how it does with compute. Perhaps I will learn whether that's the case in two days...


As far as i could watch it at my machine, the zips where uploaded in the order, of which they where created. If one failed, it would be simply retried at the end, like setting it at the end of the queue again. If many failed, they startet at the order of failing.
So it should work, that the oldest WUs are uploadad first.

Greets
Felix
ID: 67821 · Report as offensive     Reply Quote
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 25 · Next

Message boards : Number crunching : The uploads are stuck

©2024 climateprediction.net