climateprediction.net home page
The uploads are stuck

The uploads are stuck

Message boards : Number crunching : The uploads are stuck
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 25 · Next

AuthorMessage
Stony666

Send message
Joined: 9 Feb 21
Posts: 9
Credit: 10,334,808
RAC: 880,522
Message 67824 - Posted: 18 Jan 2023, 7:41:54 UTC - in response to Message 67821.  

Hi again,

has something changed?
The upload of finished WUs stopped yesterday for me. I still have 243WUs to upload on one host.
I restarted the work on some other boxes as they uploaded all work hanging work. These boxes are not uploading to now.

Regards Jörg
ID: 67824 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,187,965
RAC: 6,888
Message 67826 - Posted: 18 Jan 2023, 7:57:44 UTC - in response to Message 67824.  
Last modified: 18 Jan 2023, 8:12:56 UTC

Seeing the same thing. One task stopped uploading around a quarter to four this morning (according to the log), and all the others have now stopped as well. Looks like the server may have gone unresponsive again - the effect is a lengthy pause, followed by an HTTP error. Haven't investigated further yet.

Edit - they've started running again.
ID: 67826 · Report as offensive     Reply Quote
Stony666

Send message
Joined: 9 Feb 21
Posts: 9
Credit: 10,334,808
RAC: 880,522
Message 67831 - Posted: 18 Jan 2023, 9:17:36 UTC - in response to Message 67826.  

Funny...

I have six hosts with CPDN work to upload.
All are using a different time to upload. They are asking the server every 5 minutes for upload.
Only transient http errors.
ID: 67831 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 11,843,683
RAC: 20,418
Message 67832 - Posted: 18 Jan 2023, 9:18:06 UTC

As far as i could watch it at my machine, the zips where uploaded in the order, of which they where created. If one failed, it would be simply retried at the end, like setting it at the end of the queue again. If many failed, they startet at the order of failing.
So it should work, that the oldest WUs are uploadad first.

This has also been my observation and is the basis of my posts that how BOINC uploads works just fine and has no fault in the current situation. Even with slow upload speeds, 30 day deadline gives everyone enough time to upload everything. For example, at 200 Kbps it'll take less than a day to upload the almost 2GB of files for 1 task.


On the other hand, I don't agree this should require user intervention either.

I agree and under normal circumstances it's not needed. However, under these very abnormal circumstances user intervention might be needed by some users. I'd emphasize might and some as I think almost everyone will probably be just fine letting BOINC do its thing. The biggest problem is server availability and that's completely out of our control.

It's not only the length of the server outage which is a one-off edge case here.
The extreme ratio of result data size to CPU time is also unique.
AFAIK, it's very unlike any of the current active projects. (And it's atypical for Distributed Computing which requires client-server communications to be minimal to be effective. Client-server bandwidth and latency in Distributed Computing are, obviously, worlds apart from an HPC cluster.)

I very much agree. The amount of data these things produce is unlike anything I'm aware of when it comes to BOINC projects. I'm expecting it'll increase when higher resolution models come out.
ID: 67832 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67834 - Posted: 18 Jan 2023, 9:40:08 UTC
Last modified: 18 Jan 2023, 10:34:30 UTC

Thank you gemini8. It says a lot when you get more empathy from an average user than you do from the project team.

The truth is I posted that I had just bought a new HD and was about to swap it out due to so many uploads. Not one mod or team member mentioned the problems that would happen if I did this.

I guess if you want a case study in how to lose one of your most loyal supporters (and probably your largest contributor), then this is it.

In terms of this problem, from what I gather this "option" has been turned on by CPDN and GPUGrid, but has been left off by most other projects.

The issue isn't that some projects have settings or workarounds you have to be aware of, its the fact that all this information is not clearly stated in a single place. Instead supporters are expected to go trawl the forums for this information.

PDW can make out this is all obvious stuff and I should have been fully aware of it, but so far everyone that we have reached out to both personally, and across social media, have never heard of this issue. So clearly that is untrue.

Either way it is a bitter pill to swallow which is why there is no point us running this project now.

And obviously now we will also no longer spend money promoting CPDN across social media, although we will continue for all the other projects.
ID: 67834 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,501,246
RAC: 5,648
Message 67835 - Posted: 18 Jan 2023, 10:59:54 UTC

The truth is I posted that I had just bought a new HD and was about to swap it out due to so many uploads. Not one mod or team member mentioned the problems that would happen if I did this.


If I had been aware of the problems, I would certainly have posted something. I have never done it myself so didn't feel I had anything to contribute on the subject. Richard who has provided some understanding for me at least on this is much more of an expert on some of these matters. I suspect the same is true of my fellow mods.
ID: 67835 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 11,843,683
RAC: 20,418
Message 67836 - Posted: 18 Jan 2023, 11:13:35 UTC - in response to Message 67834.  

ncoded.com,
If it's not too much, considering the following might be of some value:

Making comments about what went wrong after the fact and after you provided the result is much easier than foreseeing it beforehand. Could it be at all possible that foreseeing the problem with swapping HDs was not a simple or easy thing for anyone to do from just reading the forum post, especially if no one else has tried it here before?

It is a common human trait to usually trust our own experiences over suggestions of others. Is it possible that based on your experience with other projects you'd have probably tried the HD swap anyway even if someone posted that it might cause problems?

How would you describe your level of interest and care about the type of research CPDN is making available for volunteers to participate in? Could it be unique enough and interesting enough that writing off this experience as a loss and continuing to participate in the project is a reasonable idea to consider?
ID: 67836 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67837 - Posted: 18 Jan 2023, 11:18:50 UTC
Last modified: 18 Jan 2023, 11:38:35 UTC

Remember this isn't just an issue about losing 500GB of tasks due to switching out drives, its also about all the other issues.

However perhaps one good thing can come out of all this?

Please think about having one section on your website that lists all the major issues that one could have, with clear solutions.

These are 3 issues I have hit in the last week or so, all of which stopped us crunching

1) 100GB UI Limit
2) 2x core_count, cant download too many uploads Limit
3) Invalidated tasks by switching out HDD

If we have had 3 major issues in such a short amount of time, then clearly there must be many more issues out there.

If you want to keep things simply, easy, and fun for crunchers, then have a list of these issues in one place and their solution(s).
ID: 67837 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67838 - Posted: 18 Jan 2023, 12:01:58 UTC - in response to Message 67834.  
Last modified: 18 Jan 2023, 12:06:24 UTC

ncoded.com wrote:
[...] I posted that I had just bought a new HD and was about to swap it out due to so many uploads. Not one mod or team member mentioned the problems that would happen if I did this.
Your problems were unexpected.

ncoded.com wrote:
[...] In terms of this problem, from what I gather this "option" has been turned on by CPDN and GPUGrid, but has been left off by most other projects.
No. As mentioned by others before, multiple boinc client instances on a single physical host *are* in fact treated as separate boinc host instances. (Unless these client instances are created such that they make themselves look the same to the project server.) Just follow the widely available guides for the setup and operation of multiple boinc client instances, and you are fine at CPDN.

GPUGrid and one or another project collapses such host entries into a single one. (Or, attempts to collapse. There are still workarounds to prevent this.) But CPDN as well as the majority of other BOINC projects do not do this (again, *if* the client instances don't make themselves look identical to the server).

ncoded.com wrote:
[...] Either way it is a bitter pill to swallow which is why there is no point us running this project now.
The trouble you encountered is not specific to CPDN, as others mentioned before.
ID: 67838 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67839 - Posted: 18 Jan 2023, 12:08:42 UTC
Last modified: 18 Jan 2023, 12:11:16 UTC

Okay just to stop any further replies.

Yes you are right. I am wrong. I apologise.

I have made my suggestion on making CPDN better. Either take it on board, or don't.
ID: 67839 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67840 - Posted: 18 Jan 2023, 12:22:36 UTC - in response to Message 67837.  
Last modified: 18 Jan 2023, 12:23:55 UTC

ncoded.com wrote:
Please think about having one section on your website that lists all the major issues that one could have, with clear solutions.

These are 3 issues I have hit in the last week or so, all of which stopped us crunching

1) 100GB UI Limit
2) 2x core_count, cant download too many uploads Limit
3) Invalidated tasks by switching out HDD
1) That'd be good to have in a FAQ, although this issue is shared with other projects with workunits with large rsc_disk_bound.

2) A generic problem, mostly encountered at corner cases like server outages. However, in case of oifs_43r3_ps with its extremely large result data size per task, why were people so keen on downloading more new work while it was clear that the upload file server was down for more than a week/ that recovery of the upload server would take more than a week and its success was entirely uncertain?

3) A corner case which works trouble-free at CPDN as long as either the filesystem is enlarged while the client is down, or a second client instance is created according to the guidelines for multiple client instances per physical host.

[I am of course speaking just for myself; I am not suggesting what CPDN should or shouldn't do WRT user communications.]
ID: 67840 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 804
Credit: 13,568,734
RAC: 7,025
Message 67841 - Posted: 18 Jan 2023, 12:48:54 UTC - in response to Message 67840.  
Last modified: 18 Jan 2023, 12:55:25 UTC

I agree that information on using boinc & working with the CPDN models should be better. This is something I've (slowly) been working to improve. FAQs are common on forums, but not here, not sure why. I have started an 'OpenIFS FAQ' which I am slowly adding to once I understand the common questions/solutions. I'd also like to see FAQs on the other models, and an FAQ on boinc specific issues that arise from the types of tasks CPDN run (large memory, disk etc), where the disk swap issue would go for example.

I've worked in community modelling projects for over 35yrs now, and the best ones are where the core team concentrate on the systems development, new contracts etc, and the forums are well-run & organised by volunteers, core staff (if resources allow) to collate the information into easy-to-find/read sections either on the website or FAQ on forums. Maybe there are some volunteers to start putting other FAQs together? And the FAQs are then pinned to the top? (much like the 32bit lib thread)? My 2p worth.

xii5ku wrote:
ncoded.com wrote:
Please think about having one section on your website that lists all the major issues that one could have, with clear solutions.

These are 3 issues I have hit in the last week or so, all of which stopped us crunching

1) 100GB UI Limit
2) 2x core_count, cant download too many uploads Limit
3) Invalidated tasks by switching out HDD
1) That'd be good to have in a FAQ, although this issue is shared with other projects with workunits with large rsc_disk_bound.

2) A generic problem, mostly encountered at corner cases like server outages. However, in case of oifs_43r3_ps with its extremely large result data size per task, why were people so keen on downloading more new work while it was clear that the upload file server was down for more than a week/ that recovery of the upload server would take more than a week and its success was entirely uncertain?

3) A corner case which works trouble-free at CPDN as long as either the filesystem is enlarged while the client is down, or a second client instance is created according to the guidelines for multiple client instances per physical host.

[I am of course speaking just for myself; I am not suggesting what CPDN should or shouldn't do WRT user communications.]
ID: 67841 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67842 - Posted: 18 Jan 2023, 13:35:54 UTC
Last modified: 18 Jan 2023, 13:36:21 UTC

Thank you Glenn.
ID: 67842 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,501,246
RAC: 5,648
Message 67843 - Posted: 18 Jan 2023, 14:04:18 UTC

Back to the uploads, presumably due to the limited number of connections being allowed, I am getting transient upload errors and most uploads are taking two or three retries before starting. Once started they are fine but given the slowness of my ADSL I will be sticking to one task running at a time rather than two which it can keep up with only when everything else is working flawlessly.
ID: 67843 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67844 - Posted: 18 Jan 2023, 14:32:54 UTC
Last modified: 18 Jan 2023, 14:49:11 UTC

Dave, just so you know in case I had not made it clear, you and Richard have been really helpful. I really mean this.

Without you Mod's it would be complete anarchy, both in terms of behavior but more importantly in terms of help and content. So to do this unpaid for years is quite remarkable.

The same thing goes for Glenn et al. Most projects you NEVER see the end Researchers so to see this level of involvement is impressive.
ID: 67844 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,501,246
RAC: 5,648
Message 67846 - Posted: 18 Jan 2023, 15:14:03 UTC - in response to Message 67844.  

Thank you.

It is probably worth pointing out that there are some queries where the BOINC forums are a place where it is quicker to get an answer than here. There are a number of people with experience of many projects who frequent those boards and can come up with answers a lot quicker than we can stumble towards them here.

Mods there are quicker/more active in deleting/moving posts thought of as off topic or against forum rules than here which is a mixed blessing.
ID: 67846 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1056
Credit: 16,521,771
RAC: 1,278
Message 67847 - Posted: 18 Jan 2023, 15:31:43 UTC - in response to Message 67837.  

These are 3 issues I have hit in the last week or so, all of which stopped us crunching

1) 100GB UI Limit
2) 2x core_count, cant download too many uploads Limit
3) Invalidated tasks by switching out HDD


I do not understand #1: What is the 100GB UI Limit? Provided I have enough disk space available to the Boinc-client? In the last five days, my machine has had no trouble uploading 84 Gigabytes of stuff. About 16 GBytes/day. I think it did a lot more when recovering from the upload server downtime(s).

I do not understand #2: What is the 2x core_count? I have 12 core limit for Boinc tasks. I have 20 CPDN tasks on my machine of which 5 are running. Why would I want a greater number? Or why would I want a smaller number since I could control this, indirectly, with the Boinc manager?

Re #3: All my boinc stuff is in a partition all its own, mounted at /var/lib/boinc, except for
/usr/bin/boinc
/usr/bin/boinc_client
/usr/bin/boinccmd
/usr/bin/boincmgr
/usr/bin/boincscr


If I wanted to move the boinc stuff to another disk drive, I think it would be pretty simple unless it were the drive with the OS itself on it;
typically /, /boot, /home, /boot/efi on my machine.
ID: 67847 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,187,965
RAC: 6,888
Message 67848 - Posted: 18 Jan 2023, 15:55:32 UTC - in response to Message 67844.  

Thanks for the namecheck. I do try to help where I can, but I'm not omniscient (see my recent post in the Linux area). I try to keep out of the way if I can't add anything constructive.

I'd just like to add that BOINC is multi-faceted: the skillset and experience of a BOINC project administrator is very different from the equivalent perspective of BOINC volunteer crunchers. In general, project administrators are not the best placed to address the issues faced by users. I once expressed that thought out loud during a teleconference call with BOINC developers, and was gratified to hear an enthusiastic endorsement down the line, from one of the most respected BOINC admins.

The position of 'Moderator', as the name suggests, was originally created to maintain order on the somewhere anarchic, 'Wild West', message boards of SETI@home. That need has died down significantly over the years, and fortunately never got established at many, if any, of the other projects. Instead, volunteers and projects have successfully subverted the role into a channel of (filtered) communication between project and volunteers. You mention Dave Jackson here, and I'd add Gary Roberts at Einstein as examples which stand out.

The drawback of this approach is that it tends to compartmenalise the skills: I don't think Gary would be much use here, and I wonder what Dave would think about the issues raised at Einstein? I tend to take a more roving brief: I see myself tramping the moors, pausing at the top of each mineshaft in the landscape, to listen to the muffled curses of the labourers down below. Sometimes it pays off: I've been able to match up the curses of two project administrators each trying to solve what seemed to be a very similar problem at their respective project, and suggest instead that there seemed to be a common cause in the BOINC code, and redirect them accordingly.

Many of the problems you've itemised perhaps fall into that category: the default 100 GB limit on disk usage has already been resolved. But getting a solution at the centre is only part of the problem: it also has to be migrated out onto the machines of volunteers around the world. And with Linux, that brings another group into the fray: the maintainers of the software repositories of all the slightly differing Linux variants in play. I haven't cracked that one yet.
ID: 67848 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,187,965
RAC: 6,888
Message 67849 - Posted: 18 Jan 2023, 16:09:15 UTC - in response to Message 67847.  

I do not understand #1: What is the 100GB UI Limit?
Issue 4643 (comment)
PR 4923
ID: 67849 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67850 - Posted: 18 Jan 2023, 16:18:54 UTC
Last modified: 18 Jan 2023, 16:42:44 UTC

I agree with Glenn.

Just create sticky posts for all these issues that stop people crunching.

It shouldn't really matter if its a BOINC issue or a CPDN Issue, if it stops someone crunching then it should be in the list.

If you don't install the 32-bit libraries your tasks will eventually keep crashing, and hence your device will get jailed. That stops you crunching so it makes sense it should be in this "essential section".

There cannot be that many issues that stop you crunching. And I would guess most have already been answered in the forums (somewhere). So if these issues could be collated into sticky posts (in this special section) wouldn't this make life much easier for the mods? In the long run you wouldn't have to keep answering some of the same questions, or explanations.

I shudder to think how many times you must have explained "trickles" lol.

If someone does not understand the 100GB Limit, then the post should explain what is this limit, when does it affect the user, and how is it resolved. And what are the long term plans so this issue is removed. To me this is what would be useful to understand and resolve the issue.

Unfortunately sending someone to a CVS (Github) does not do that.

This is how we explained the 100GB limit, and its solution on Twitter. Hopefully Jean-David Beyer finds this slightly more informative.

https://twitter.com/ncoded/status/1608650579412398080

If your running CPDN on BOINC and your getting the 'out of disk storage notification', then (for now) you will need to make sure that each of the 3 options for Disk have a non-zero, non-empty value, as shown in the image.


ID: 67850 · Report as offensive     Reply Quote
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 25 · Next

Message boards : Number crunching : The uploads are stuck

©2024 climateprediction.net