climateprediction.net (CPDN) home page
Thread 'Receiving new tasks with impossible to meet deadlines'

Thread 'Receiving new tasks with impossible to meet deadlines'

Message boards : Number crunching : Receiving new tasks with impossible to meet deadlines
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
ProfileForrest

Send message
Joined: 19 May 06
Posts: 9
Credit: 4,294,690
RAC: 11,165
Message 71281 - Posted: 16 Aug 2024, 16:42:54 UTC
Last modified: 16 Aug 2024, 16:43:27 UTC

I'm getting new tasks with 32-61 days remaining to complete, but 6-17 day completion deadlines. In the last few days, several tasks that I had been processing for 14+ days are now gone & I received new ones in their place. One in new task in particular will take 61 days to complete, but the deadline is 7days. I run this project 24/7 on a fairly new machine. Why am I receiving tasks with deadlines that are impossible to meet? The link below is a screenshot of my current task list...I don't see the image in the preview, so I'm not sure I formatted the image tags correctly.



https://www.dropbox.com/scl/fi/71proyqq8hlpqxhsxoohs/deadlines.png?rlkey=dw7v6k1p8uaolpenocyqeebvb&dl=0
ID: 71281 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71282 - Posted: 16 Aug 2024, 17:45:00 UTC - in response to Message 71281.  
Last modified: 16 Aug 2024, 17:45:50 UTC

Ok, first off, there are way too many tasks running at the same time on your machine. It's a 16 core machine and although that's 32 threads, as 2 threads share a single core, if you try to run more than 16 tasks are once, they will slow down massively. By massively I mean taking anywhere from 3 to 5x longer. Keep the number of running tasks to the same number of cores, not threads, on the machine. That will give you a much better throughput. If you do this, each task should be done in about 6-8 days.

Next, the time given by the boinc monitor is an estimate and it's usually wrong. Especially when it's not received tasks from that app before. You'll find the estimated time goes down as the boinc client runs these tasks.

But please reduce the number of tasks. Set the 'no. of cpus' to 50% in boinc manager. That will make it use just the cores, not the threads. The technical reason for this is the CPDN model are very floating point intensive. The CPU only has 1 set of floating point units per core, not per thread. So if you try to run 32 tasks, they will end up competing for the floating point units and run very very slowly.
---
CPDN Visiting Scientist
ID: 71282 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 127
Credit: 43,261,443
RAC: 72,712
Message 71283 - Posted: 16 Aug 2024, 18:17:22 UTC - in response to Message 71281.  
Last modified: 16 Aug 2024, 18:20:50 UTC

Actually I think you are reading the columns wrong. The "completion before deadline" means that it's estimated that you will complete the work that number of days before the deadline. So you are fine even with the current state and exaggerated estimate.

However, please still follow the suggestions above because CPDN workloads are quite different from other BOINC projects. They generally don't benefit from SMT/HT and are memory bandwidth intensive. When memory bandwidth is exhausted, the latency skyrockets non-linearly and can easily backfire in performance. My previous host was 5950X too, so I know you can finish a WU in <10 days if you only assign one WU per core.
ID: 71283 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,748,059
RAC: 5,647
Message 71284 - Posted: 16 Aug 2024, 18:50:21 UTC

Also, I think you're mis-interpreting the task deadlines.

The actual deadline is fixed, and is shown in the right-hand column of your screenshot. The tasks in that list have deadlines between October 2nd (bottom) and October 25th (top). Those are still all well over a month away, so you have plenty of time to work through the other suggestions.
ID: 71284 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,000,104
RAC: 9,568
Message 71286 - Posted: 16 Aug 2024, 19:32:38 UTC - in response to Message 71281.  

The biggest reason you're getting such high estimates to completion is that you haven't run BOINC benchmarks. Thus BOINC thinks your PC is really slow. Go to Tools --> Run CPU benchmarks. It might or might not change the estimates on the current tasks but will provide pretty accurate ones on any future ones.

So far your machine hasn't returned a single valid task and the failed ones don't upload Stderr data, which is odd, so there's no info as to why they're all failing. Are you using your machine heavily for other things?

Like others have said, don't run more than 16 CPDN tasks at a time and you do have sufficient time to complete all current tasks, the deadline I believe is 70 days from the time you get the tasks. There are several ways you can do this, website preferences, app_config fie, suspending tasks manually, changing CPU usage setting.
ID: 71286 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71287 - Posted: 16 Aug 2024, 19:51:31 UTC - in response to Message 71286.  
Last modified: 16 Aug 2024, 19:55:37 UTC

So far your machine hasn't returned a single valid task and the failed ones don't upload Stderr data, which is odd, so there's no info as to why they're all failing. Are you using your machine heavily for other things?.
I had a look at the tasks from this machine. I think we are looking at a system that's too heavily loaded. The most common task failure message was this one: The system cannot find the drive specified. Probably in this case, the storage is not responding quick enough, maybe because it's swapping? Tasks that don't have a specific fail message, in my experience looking at these are usually memory related. The tasks can ask for big chunks of memory and if the memory gets too fragmented big chunks become unavailable and the process fails.

Bottom line is, reduce the number of tasks to match the 16 cores you have (I typically run 1 or 2 less than my core count). That should cure all the issues you have.

Have a look at the plot on this message: https://www.cpdn.org/forum_thread.php?id=9184&postid=68081. The blue curve shows tasks completed per day and shows how going beyond the no. of cores doesn't help because the model runtime slows down considerably.
---
CPDN Visiting Scientist
ID: 71287 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 71288 - Posted: 16 Aug 2024, 20:00:26 UTC

I would echo Glenn's advice to run one or two tasks less than the core as opposed to the thread count. I am at the moment running rather less than that as my new box isn't clearing heat fast enough. I expect as the weather cools down I will be able to go up to 14 or 15 out of 16 cores except on the rare occasions i am rendering video at the same time.
ID: 71288 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 318
Credit: 15,000,104
RAC: 9,568
Message 71289 - Posted: 16 Aug 2024, 20:31:48 UTC - in response to Message 71287.  

Tasks that don't have a specific fail message, in my experience looking at these are usually memory related. The tasks can ask for big chunks of memory and if the memory gets too fragmented big chunks become unavailable and the process fails.

What's the typical and peak memory usage of these tasks? That PC has 32GB which is 1GB per thread so if too many tasks are run and enough ask for those big chunks at the same time, what you say definitely makes sense, that tasks fail due to memory issues.
ID: 71289 · Report as offensive     Reply Quote
ProfileForrest

Send message
Joined: 19 May 06
Posts: 9
Credit: 4,294,690
RAC: 11,165
Message 71291 - Posted: 17 Aug 2024, 0:01:31 UTC - in response to Message 71282.  

Thanks for the very helpful input. I had "in use" set to 70% CPUs & 70% time / "not in use" at 90% & 90%. Both are now at 50% & 50%. I've logged tasks & progress as of today & will see where I am in a week or so.
ID: 71291 · Report as offensive     Reply Quote
ProfileForrest

Send message
Joined: 19 May 06
Posts: 9
Credit: 4,294,690
RAC: 11,165
Message 71292 - Posted: 17 Aug 2024, 0:18:30 UTC - in response to Message 71284.  

Yes, I see it. Not sure how I managed to look past that column and only focus on the "completion before deadline".
ID: 71292 · Report as offensive     Reply Quote
ProfileForrest

Send message
Joined: 19 May 06
Posts: 9
Credit: 4,294,690
RAC: 11,165
Message 71293 - Posted: 17 Aug 2024, 0:19:47 UTC - in response to Message 71283.  

That makes sense...thanks!
ID: 71293 · Report as offensive     Reply Quote
ProfileForrest

Send message
Joined: 19 May 06
Posts: 9
Credit: 4,294,690
RAC: 11,165
Message 71294 - Posted: 17 Aug 2024, 0:22:41 UTC - in response to Message 71288.  

I have kicked it down to 50% as Glenn suggested. I had some fans I bought last year & hadn't got around to installing...they helped a lot.
ID: 71294 · Report as offensive     Reply Quote
ChelseaOilman

Send message
Joined: 24 Dec 19
Posts: 32
Credit: 41,428,499
RAC: 21,960
Message 71295 - Posted: 17 Aug 2024, 13:12:10 UTC - in response to Message 71291.  

Thanks for the very helpful input. I had "in use" set to 70% CPUs & 70% time / "not in use" at 90% & 90%. Both are now at 50% & 50%. I've logged tasks & progress as of today & will see where I am in a week or so.

I use an app_config.xml file to set max number of tasks running so I'm not familiar with this method of controlling the number of tasks running. I feel like you should be running 50% CPU and 100% time. Why do you only want your tasks to run half the time? That's just going to double your task times.
ID: 71295 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,748,059
RAC: 5,647
Message 71296 - Posted: 17 Aug 2024, 15:06:20 UTC - in response to Message 71295.  

I would agree with the suggestion of using 50% of cores [or maybe, just a tiny fraction below], but running them for 100% of the time.

Throttling by time is only really useful if you are suffering from severe CPU overheating - and then, only use it for as long as it takes to fix the cooling problem.

Here at CPDN, make sure that you also check "Leave non-GPU tasks in memory while suspended?" - especially with a high core and task count, there's a big overhead to shuffling data to and from the hard drive.
ID: 71296 · Report as offensive     Reply Quote
ProfileForrest

Send message
Joined: 19 May 06
Posts: 9
Credit: 4,294,690
RAC: 11,165
Message 71297 - Posted: 17 Aug 2024, 20:10:13 UTC - in response to Message 71287.  

So far your machine hasn't returned a single valid task and the failed ones don't upload Stderr data, which is odd, so there's no info as to why they're all failing. Are you using your machine heavily for other things?.
I had a look at the tasks from this machine. I think we are looking at a system that's too heavily loaded. The most common task failure message was this one: The system cannot find the drive specified. Probably in this case, the storage is not responding quick enough, maybe because it's swapping? Tasks that don't have a specific fail message, in my experience looking at these are usually memory related. The tasks can ask for big chunks of memory and if the memory gets too fragmented big chunks become unavailable and the process fails.

Bottom line is, reduce the number of tasks to match the 16 cores you have (I typically run 1 or 2 less than my core count). That should cure all the issues you have.

Have a look at the plot on this message: https://www.cpdn.org/forum_thread.php?id=9184&postid=68081. The blue curve shows tasks completed per day and shows how going beyond the no. of cores doesn't help because the model runtime slows down considerably.



I did follow your early suggestion & kicked the CPUs down to 50%. I suspect you're right, my machine was too heavily loaded. Often it would be unresponsive after I had left it for long periods. Initially, I'm sure I power cycled it a few times, but then found that a ctrl+alt+del & would get it to respond, selecting task manager restored it to full operation. I suspect that this was because I had CPUs at 90% & max ram at 75% while the pc was idle.

I noticed the computation errors from time to time and initially they just seemed random. While working with support on another software issue that required lost of restarts, I noticed the errors occurred after I rebooted the machine either by OS restart or hard shutdown. I accidentally powered it down the other day and lost several tasks. That got me looking into the "why" and I found this (https://www.cpdn.org/cpdnboinc/forum_thread.php?id=9213#69349). I haven't seen any computation errors since started manually exiting BOINC before restarting the machine.
ID: 71297 · Report as offensive     Reply Quote
ProfileForrest

Send message
Joined: 19 May 06
Posts: 9
Credit: 4,294,690
RAC: 11,165
Message 71298 - Posted: 17 Aug 2024, 21:09:49 UTC - in response to Message 71296.  

I would agree with the suggestion of using 50% of cores [or maybe, just a tiny fraction below], but running them for 100% of the time.

Throttling by time is only really useful if you are suffering from severe CPU overheating - and then, only use it for as long as it takes to fix the cooling problem.

Here at CPDN, make sure that you also check "Leave non-GPU tasks in memory while suspended?" - especially with a high core and task count, there's a big overhead to shuffling data to and from the hard drive.


I assumed, like % CPUs, % CPU time was based on the CPU package itself. If I'm understanding you correctly, %time only applies to the cores used.
I had non-gpu tasks on w/0.1 & 0.5 days of work stored. I turned it off while trying to get to the bottom of the computational errors I was receiving. I have this option back on now. Thanks!
ID: 71298 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,736,855
RAC: 4,073
Message 71299 - Posted: 17 Aug 2024, 21:22:31 UTC - in response to Message 71298.  

The "% CPU time" applies to jobs being run under BOINC's control. A BOINC job will run for 50% of the time as dictated by either the operating system or the hardware, then will sit idle for another 50% of the time. This has two affects, first it doubles the clock time for a job to run, and second it can place a thermal stress on the CPU due to these repeated thermal cycles.
ID: 71299 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,748,059
RAC: 5,647
Message 71300 - Posted: 18 Aug 2024, 7:14:18 UTC - in response to Message 71299.  

BOINC makes its decisions about whether to run or stop the tasks once per second. Your 50% setting will translate into "run for 1 second, stop for 1 second". That's pretty brutal, for a CPDN task.
ID: 71300 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71301 - Posted: 18 Aug 2024, 11:46:16 UTC - in response to Message 71300.  

I disagree. It's not brutal, just depends how you want to manage the machine. Suspending/resuming the tasks is not a problem. Could run 4 tasks at 100% CPU time or 8 at 50%, get the same throughput and enjoy seeing more tasks running.

As for this idea of shutting boinc client down before shutting down the PC, that's a myth. Whether the client gets the quit signal from the user on boinc manager or the PC shutdown, the software goes through the same route. Where it might make a difference is on a overloaded machine where the OS might terminate slow to respond processes causing I/O to be incomplete.
---
CPDN Visiting Scientist
ID: 71301 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 71302 - Posted: 18 Aug 2024, 13:09:28 UTC - in response to Message 71301.  

As for this idea of shutting boinc client down before shutting down the PC, that's a myth.


It is two machines ago that I last tested this when I had a core2 duo with 4GB RAM. With that machine, whatever the reasons for it it wasn't a myth then. I got into the habit of closing down BOINC first after stopping computation and it has stuck. With my relatively new Ryzen9, I am only running a maximum of 12, sometimes only 9 or 10 out of 16 cores at the moment to keep the temperature down. I expect to be able to run more later in the year but even then will probably stop at 14 so should never have the issue of the machine being overloaded. My instincts however irrational it might be today are that it always makes sense to make sure anything that needs saving is saved before commencing shutdown.
ID: 71302 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Receiving new tasks with impossible to meet deadlines

©2024 cpdn.org