climateprediction.net home page
Should full credit be given for time on non successful tasks?

Should full credit be given for time on non successful tasks?

Message boards : Number crunching : Should full credit be given for time on non successful tasks?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 70675 - Posted: 27 Mar 2024, 10:25:48 UTC

I think that giving credit for non completed tasks based on trickle up messages is unique to CPDN. It originated when tasks taking four months or more was not unusual. Now the longest tasks still complete in under a month on an reasonably fast machine, most within two weeks running 24/7.

This was prompted by seeing a resend where one of the failures was on a machine that only completed about one in twenty tasks and sent a trickle up message every few days. I know the credit system has only just been rejigged so the credit script runs daily but I would like to pose the question as to whether we should move to a system of only granting credit for completed tasks? Do those crashing everything or almost everything ever look at their credit? I don't know. If they do, not getting any might prompt them to visit the fora to find out why everything is crashing? Just a thought.
ID: 70675 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 257
Credit: 31,980,996
RAC: 35,252
Message 70679 - Posted: 27 Mar 2024, 16:32:03 UTC

I don't mind a system that's credit only for completed tasks, once the tasks are well enough behaved that they don't regularly crash of their own accord, and assuming that "world goes physically impossible" is still considered a completed task.

"Me getting no credit for a failure of one of my machines in terms of hardware or configuration" is totally fine with me, but I don't think "The task crashes because of code issues or world physics issues" should lead to no credit.

How scientifically useful are trickles of partially completed tasks? If they're of no substantial value, then not giving credit until the work is properly done makes sense to me.
ID: 70679 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 70680 - Posted: 27 Mar 2024, 16:57:02 UTC - in response to Message 70679.  

It is the way over half the credit is quite often given to two or more in the case of some Linux tasks before someone finally completes the work that gets me. Really just floating it to get an idea of whether it would deter those who crash most tasks even if they gain a substantial amount of credit first. I certainly don't see it happening any time soon given the work Andy and Richard have put into sorting out the current system.
ID: 70680 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 70681 - Posted: 27 Mar 2024, 18:27:28 UTC - in response to Message 70680.  
Last modified: 27 Mar 2024, 18:28:25 UTC

Scrapping trickles is a bad idea. Some of the work is long running, particularly on the slower hardware. Can you imagine the comments on the forums that tasks that using over a week of resource get no reward at all? There have been posts recently from disgruntled volunteers whose tasks failed (for whatever reason) before they'd finished complaining they could have run other projects: and that's with credit granted for work done.

I've been looking over the various reasons for tasks failing. Around 50% of the recent Windows WaH2-ri batch failures are outside the control of CPDN (and boinc).

If it ain't broke - why fix it?
---
CPDN Visiting Scientist
ID: 70681 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,304,039
RAC: 11,195
Message 70683 - Posted: 27 Mar 2024, 18:45:19 UTC

I think CPDN is a particularly difficult case for BOINC. Although it seems straightforward from the outside, this is the sort of project that doesn't respond well to the "Fish some second-hand heavy metal out of a skip, power it up, turn all the knobs up to 11, and walk away" approach to crunching.

I've mentioned host 1549227 on the board before: this week, host 1548623 came to my attention as well. They both fall into the 'Heavy Metal' category, with 16 thread and 32 thread capacity respectively. But they're a bit skinny on memory, with just 1GB per thread, and the hyperthreading will hit floating point speed something rotten. You can't just walk away from a machine like that, and think "job done".

Part of the problem is that BOINC Central - the main developers - seem to have adopted an approach, over the last 10 years or so, that human volunteers just get in the way: the computers know how it all works, right out of the box, and can be left to work it out for themselves. The most egregious example of this, of course, is Science United.

But BOINC isn't perfect, and doesn't cope perfectly under all conditions. Any programmer will know that instinctively, even if they don't want to talk about it in polite company. Glenn drew my attention to 1548623, and asked if I could throw some light on the tasks which had been aborted with 'EXIT_TIME_LIMIT_EXCEEDED'. I ran the numbers yesterday, and the problem turns out to be that BOINC has benchmarked the CPU at 47.15 GFlops. That's something that CPDN can't cure, and I strongly suspect that BOINC won't cure either (and because the machines are anonymously registered, we - as users - have no way of making contact and offering to help).

No, I think that the original idea of BOINC - to get volunteers interested and involved in the science, as well a enjoying the competition and the social sides - was on the right lines. But there's a lot of noise out there competing for our attention.
ID: 70683 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 257
Credit: 31,980,996
RAC: 35,252
Message 70684 - Posted: 27 Mar 2024, 19:10:12 UTC - in response to Message 70680.  

Really just floating it to get an idea of whether it would deter those who crash most tasks even if they gain a substantial amount of credit first.


Given how hard it was to get people to just install 32-bit libraries, I don't think most people interact with a project beyond "Select it in the BOINC add project interface and let it run."

I would doubt that anyone crashing many tasks has ever even posted on the forum, or even gone to look at the forum threads. The forum is filled with the sort of people who don't breathe on their computers too hard, in case it crashes a W@H task (seriously, Glenn, thank you so much for your work on improving reliability!). ;) So I'm not sure if it really matters - will crashers even notice zero credit for tasks?

I support aligning "credit rewarded" with "scientific usefulness of results," but I also don't think it's worth a lot of effort to change things up in attempt to get through to people whose computers simply don't work.
ID: 70684 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 70685 - Posted: 27 Mar 2024, 20:21:22 UTC - in response to Message 70684.  

I come from a High Performance Computing background and I'm used to paying for computer resource (cpu, memory, disk, runtime), regardless of whether the jobs run successfully or not. BOINC seems analogous to this, so I see 'credit via trickles' as the equivalent of CPDN payment for resources used on volunteer machines.
ID: 70685 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,035,877
RAC: 23,095
Message 70686 - Posted: 1 Apr 2024, 1:30:42 UTC - in response to Message 70683.  

Richard, could you please explain the time limit exceeded error a bit? From the error log of the last failed task of host 1548623:
exceeded elapsed time limit 823339.74 (38013881.53G/46.17G)
Which is ~9.5 days which is about when the task failed.

The numbers in the fraction seem off though. The numerator is 10x the estimated computation size from task properties and the denominator is ~2% lower than the reported number of 47.15. Is that what BOINC does to set time limits, multiply the estimated computation size by 10 and somewhat reduce the measured floating point speed?
ID: 70686 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 70687 - Posted: 1 Apr 2024, 9:29:10 UTC - in response to Message 70686.  

I looked around a bit an several projects seem to have had people with this issue. If for any reason the benchmark for your machine is optimistic it can cause this error. Manually rerunning benchmarks should solve it for future tasks but not those already downloaded.
ID: 70687 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,304,039
RAC: 11,195
Message 70688 - Posted: 1 Apr 2024, 9:57:09 UTC - in response to Message 70686.  

Sure - well, I'll give it a try, anyway.

You're on the right lines with your maths. The error message is actually badly worded. BOINC doesn't really deal with time (because it's meant to cope with computers with widely differing speeds), so the key figure is the estimated 'size' of any given task. That's expressed in terms of the number of floating point arithmetic calculations the task will take to complete, as <rsc_fpops_est> in the description of each task. That figure is set by the project team for each task type: it's the one thing they have total control of. I think this project gets that one pretty much correct - it was 3,801,388 billions of operations for the last WaH2 batch I looked at.

From that basic figure, the BOINC server calculates a <rsc_fpops_bound> - by default, ten times larger than the estimate. Some projects in the past have got the estimate badly wrong, and pushed up the bound to 100x or even 1000x to escape from their own error, but I'd urge against that.

The other factor in the time limit is the speed of the computer. This is where host 1548623 went wrong.

That computer was only attached to the project on 16 Jan 2024, and we haven't had very much work since then. So the only information BOINC has available is the machine's self-reported benchmark. That's currently reported as 47.15 billion ops/sec, but it may have been slightly different when the task you're looking at was allocated on 29 Feb 2024. That could be a random fluctuation - not significant.

But look a bit further down the host page, at the Application details for that computer. It has processed tasks for application versions 8.24 and 8.29, at 4.93 GFLOPS and 5.26 GFLOPS respectively. Those are much more realistic values, but BOINC will be ignoring them completely.

The figures are real, actual, values, calculated from tasks running on that individual computer and reaching a successful conclusion. But BOINC doesn't trust them until it has a minimum of 11 completed, valid, tasks. I suspect this machine will vanish from the project long before it reaches that target - it only has one qualifying task so far for v8.29.

We can only guess at the motivation of the user who attached the machine to the project in the first place. The processor was introduced in late 2019, so it could be up to 4 years old: maybe it's a rebuild or rescued machine, and he wanted to test it out? If so, it wasn't a very well designed test, and scientifically idiotic.
ID: 70688 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,304,039
RAC: 11,195
Message 70689 - Posted: 1 Apr 2024, 10:03:33 UTC - in response to Message 70687.  

I looked around a bit an several projects seem to have had people with this issue. If for any reason the benchmark for your machine is optimistic it can cause this error. Manually rerunning benchmarks should solve it for future tasks but not those already downloaded.
I've been toying with asking BOINC if they need to re-validate the benchmarking code for these massive CPUs. The benchmark is supposed to saturate all available cores - 32 threads, in this case - and use the average for a single-thread task, but my head hurts when I try to read the code.
ID: 70689 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 70690 - Posted: 1 Apr 2024, 12:03:45 UTC - in response to Message 70689.  
Last modified: 1 Apr 2024, 12:05:20 UTC

It's good to know BOINC uses a factor 10x for the fops bounds. The estimated fops is a fixed value regardless of the domain size i.e. it's the same value for NZ25 as EAS25 even though EAS25 domain is significantly larger and will therefore need more fops. That's why I first contacted you Richard, as I wondered if we needed to up the fops_est. But I don't think we do.
---
CPDN Visiting Scientist
ID: 70690 · Report as offensive     Reply Quote
klepel

Send message
Joined: 9 Oct 04
Posts: 77
Credit: 68,028,602
RAC: 9,346
Message 70691 - Posted: 1 Apr 2024, 15:59:12 UTC - in response to Message 70685.  
Last modified: 1 Apr 2024, 16:00:56 UTC

I come from a High Performance Computing background and I'm used to paying for computer resource (cpu, memory, disk, runtime), regardless of whether the jobs run successfully or not. BOINC seems analogous to this, so I see 'credit via trickles' as the equivalent of CPDN payment for resources used on volunteer machines.

+1
I see it like this as well. And historically CPDN gave credits for partial work, so why change it after 20 years?
I cannot see how this would change user behaver; we will have always some “broken” computers in the CPDN world. Let’s maintain the smaller base of hardcore crunchers, building computers only for BOINC and maintain them for years.
Note besides: Glenn, when will the announced RAM heavy CPDN WUs really come? 2024? 2025? I increased RAM for these WUs in October last year and still have not seen any… ;-)
ID: 70691 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1061
Credit: 16,544,964
RAC: 2,285
Message 70692 - Posted: 2 Apr 2024, 0:23:22 UTC - in response to Message 70691.  

Note besides: Glenn, when will the announced RAM heavy CPDN WUs really come? 2024? 2025? I increased RAM for these WUs in October last year and still have not seen any… ;-)


My big Linux box came with 32 GBytes of RAM with two memory modules. There are 8 memory slots in my box. As RAM prices came down, I bought two more memory modules and put them in, raising the RAM to 64 GBytes. Then I started getting a few 8GByte CPDN tasks with more to come. So when RAM prices dropped again, I got four more memory modules so I have 128GBytes in there now. DDR4 modules.

My machine wouild take up to 512 GBytes of RAM were I to replace all the modules with the largest size. my processor chip has only 16896 KBytes of processor cache, so I do not think it would make sense to enable running more tasks like this at the same time. 16896 KBytes is fairly large for this kind of processor chip, but I run only 13 Boinc tasks at a time in winter and 8-12 in the summer. I have no AC.

Actually, I have gotten no new work for CPDN since last June. ;=(

Computer 1511241

CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16

Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.9 (Ootpa) [4.18.0-513.18.1.el8_9.x86_64|libc 2.28]
BOINC version 	7.20.2
Memory 	125.07 GB
Cache 	16896 KB

ID: 70692 · Report as offensive     Reply Quote

Message boards : Number crunching : Should full credit be given for time on non successful tasks?

©2024 climateprediction.net