climateprediction.net home page
New Work Announcements 2024

New Work Announcements 2024

Message boards : Number crunching : New Work Announcements 2024
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 70407 - Posted: 16 Feb 2024, 5:50:55 UTC

#1007 EASHA 6400 2024-02-15 WAH2 East Asia 25km 1986-2018
ID: 70407 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 257
Credit: 31,983,481
RAC: 35,141
Message 70415 - Posted: 16 Feb 2024, 16:27:10 UTC - in response to Message 70405.  

v8.29 is much more stable than the old v8.24; for batch 1006 it's showing 7% task fails and only 9 hard fails out of 6044 workunits so far (a 'hard fail' is when all 3 attempted tasks fail). That is considerably less than the identical batch 1001; 121% and 1346 respectively.


Excellent, that's far better results! Of those hard fails, are they still "code related crashes" (segfaults, failure to resume, etc), or are they things outside your control (AV rejection of the binary, world going impossible, looks-like-bad-hardware)?

The linux version needs verifying against a Windows batch before we can deploy it to production.


I'm always willing and able to throw Linux boxes (mostly AMD right now) at a problem! :)
ID: 70415 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 70433 - Posted: 18 Feb 2024, 21:45:07 UTC - in response to Message 70415.  
Last modified: 18 Feb 2024, 21:47:31 UTC

Of those hard fails, are they still "code related crashes" (segfaults, failure to resume, etc), or are they things outside your control (AV rejection of the binary, world going impossible, looks-like-bad-hardware)?
I'm analysing the failures. CPDN have a process which looks at the output from each failed task and plots a nice histogram of each failure type. If it wasn't such a faff to include an image here I'd show it. About 30% of fails from the new app are due to AV quarantining when it tries to start. About 10-15% are other Windows related errors. Then it's download errors, user aborts etc. But about 40% are 'unclassified' which means we aren't able to easily determine what caused the task to fail judging from the log; could be our code, could be boinc, could be the machine. The 8.29 app is not producing any of the segmentation faults we saw before with the 8.24 app though, which is good. We should get a much more acceptable hard fail rate with the new app.

There are at least 3 more EAS25 batches to come in the next couple of weeks. Plenty of time to have a look at its performance.
---
CPDN Visiting Scientist
ID: 70433 · Report as offensive     Reply Quote
gutelius

Send message
Joined: 11 Jan 22
Posts: 2
Credit: 2,370,167
RAC: 1,692
Message 70494 - Posted: 21 Feb 2024, 2:28:08 UTC
Last modified: 21 Feb 2024, 2:31:53 UTC

Hi, I'm usually a set and forget user that has rarely seen windows tasks and was just dumped a bunch of the EAS25 (had a few batch 1001 fail early on). It seems like the older 1001s have really slowed down in the last couple of days. Not sure if this is normal or if there are any configuration changes that would be a good idea. Happy to see that there is more coming out, just want to check if there is any suggestions for maximizing performance on this project. Right now I have 16 tasks from this project and 11 threads available for boinc, so at this moment 11 CPDN tasks are computing now that some urgent WCG tasks have finished.
https://imgur.com/a/LWB3NAh

Computer info:
I7-12700k (8P cores active, with hyper-threading)
32GB ram
200GB dedicated SSD space (16GB in use)
Simultaneously running FAH on two GPUs(using ~1 thread each)
ID: 70494 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 70495 - Posted: 21 Feb 2024, 7:14:19 UTC - in response to Message 70494.  

o at this moment 11 CPDN tasks are computing now that some urgent WCG tasks have finished.
My experience is that on my 16 thread ryzen (8 real cores) going above running 8 tasks concurrently actually results in a reduction in overall throughput with CPDN tasks. (There are other projects however where going above the 8 real cores does scale in something close to a linear manner.)
ID: 70495 · Report as offensive     Reply Quote
gutelius

Send message
Joined: 11 Jan 22
Posts: 2
Credit: 2,370,167
RAC: 1,692
Message 70496 - Posted: 21 Feb 2024, 7:22:25 UTC - in response to Message 70495.  
Last modified: 21 Feb 2024, 7:24:36 UTC

Thanks. I also run a lot of WCG which works fine with full thread usage for my computer, so I'd rather not set the overall boinc CPU thread usage at half of what should be available.
Is there a convenient way to set how many cores a project uses per task?
ID: 70496 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 70497 - Posted: 21 Feb 2024, 8:30:30 UTC

Additional workunits for batch 1007 are going out today. They were omitted from the original send due to a misconfiguration.
---
CPDN Visiting Scientist
ID: 70497 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 70498 - Posted: 21 Feb 2024, 8:46:31 UTC

Is there a convenient way to set how many cores a project uses per task?
I can't think of an easy way to do it off hand. By the way if ARP tasks come back with WCG, they also suffer in the same way if you start using virtual cores.
ID: 70498 · Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 22 Feb 11
Posts: 31
Credit: 226,546
RAC: 4,080
Message 70499 - Posted: 21 Feb 2024, 8:58:39 UTC

Create app_config.xml in project directory
Copy this there
<app_config>
<app>
<name>wah2</name>
<max_concurrent>4</max_concurrent>
</app>
</app_config>
or
<app_config>
<project_max_concurrent>4</project_max_concurrent>
</app_config>
ID: 70499 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 70500 - Posted: 21 Feb 2024, 10:00:05 UTC - in response to Message 70499.  
Last modified: 21 Feb 2024, 10:00:37 UTC

Not quite, there are two apps for Weather@Home. wah2 & wah2_ri, all the latest batches are using wah2_ri. You need two different <app> sections if you are going to use <app>.

Also, you need to tell the client to 'Reread the config files' otherwise this won't take effect until the next time the client is started.

CPDN models are very floating point intensive. Since a cpu core only has one set of floating point units, two threads have to compete for resource. That's why your throughput drops. Checkout this post https://www.cpdn.org/forum_thread.php?id=9184&postid=68081 on these forums for an illustration and more explanation.

<app_config>
<app>
<name>wah2</name>
<max_concurrent>4</max_concurrent>
</app>
<app>
<name>wah2_ri</name>
<max_concurrent>4</max_concurrent>
</app>
</app_config>

---
CPDN Visiting Scientist
ID: 70500 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 31 May 18
Posts: 43
Credit: 4,305,745
RAC: 4,146
Message 70501 - Posted: 21 Feb 2024, 10:30:49 UTC

So is there any word on when further new work will drop?
ID: 70501 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,307,352
RAC: 11,277
Message 70502 - Posted: 21 Feb 2024, 10:37:03 UTC - in response to Message 70500.  

Or, since all CPDN applications will be floating point intensive, and will all suffer from FPU congestion on a hyperthreaded CPU, you could use the single project-level tag instead:

<project_max_concurrent>N</project_max_concurrent>
For a full list of the available options, see the BOINC user manual.
ID: 70502 · Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 22 Feb 11
Posts: 31
Credit: 226,546
RAC: 4,080
Message 70503 - Posted: 21 Feb 2024, 10:38:25 UTC - in response to Message 70501.  

as of 21 Feb 2024, 10:06:32 UTC there were 1052 unsent wah tasks.
ID: 70503 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 257
Credit: 31,983,481
RAC: 35,141
Message 70504 - Posted: 21 Feb 2024, 16:14:03 UTC - in response to Message 70503.  

as of 21 Feb 2024, 10:06:32 UTC there were 1052 unsent wah tasks.


Yeah, I lit up a new VM to chew on a few of those. I don't think there's more than a day or two before they're drained out, though (and it's a new machine, so it's in the "task quota limit" period - but should get 'em chewed pretty fast with few tasks on a big CPU). There's always resend work for a while after the count goes to zero, though.
ID: 70504 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 70505 - Posted: 21 Feb 2024, 21:39:54 UTC - in response to Message 70501.  
Last modified: 21 Feb 2024, 21:46:44 UTC

So is there any word on when further new work will drop?
Server status currently showing 704 tasks ready to send, though doubtless that has dropped a bit since the last server update. I am guessing it may not be till next week that we get another of the batches that was mis configured sent out. The person who normally sends batches out is away and I don't know how much time Glenn has free to do this. If he doesn't have time it will have to wait till the person who normally does it is back.

Edit, 704 was from the newest batch. there were also a few retreads from 1001.
ID: 70505 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 31 May 18
Posts: 43
Credit: 4,305,745
RAC: 4,146
Message 70506 - Posted: 21 Feb 2024, 22:16:02 UTC - in response to Message 70505.  

So is there any word on when further new work will drop?
Server status currently showing 704 tasks ready to send, though doubtless that has dropped a bit since the last server update. I am guessing it may not be till next week that we get another of the batches that was mis configured sent out. The person who normally sends batches out is away and I don't know how much time Glenn has free to do this. If he doesn't have time it will have to wait till the person who normally does it is back.

Edit, 704 was from the newest batch. there were also a few retreads from 1001.


For some reason it's not letting me have any.
I upped the number of CPU cores and RAM in my VM last night to do more, extended my work cache settings, and freed up disk space, but it's still not giving me any more than the three I currently have.
ID: 70506 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 257
Credit: 31,983,481
RAC: 35,141
Message 70507 - Posted: 21 Feb 2024, 23:14:51 UTC - in response to Message 70506.  

For some reason it's not letting me have any.
I upped the number of CPU cores and RAM in my VM last night to do more, extended my work cache settings, and freed up disk space, but it's still not giving me any more than the three I currently have.


What's your client log say about the reason it's not requesting new work? There's usually some obvious-ish reason listed.
ID: 70507 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 70508 - Posted: 21 Feb 2024, 23:47:13 UTC - in response to Message 70505.  

So is there any word on when further new work will drop?
Server status currently showing 704 tasks ready to send, though doubtless that has dropped a bit since the last server update. I am guessing it may not be till next week that we get another of the batches that was mis configured sent out. The person who normally sends batches out is away and I don't know how much time Glenn has free to do this. If he doesn't have time it will have to wait till the person who normally does it is back.
I've been sending out the WaH2 EAS25 batches as soon as they are ready. The previous mis-configured batches are still being checked and aren't ready. Linux batches are not far away, again, still under test on the dev site.
---
CPDN Visiting Scientist
ID: 70508 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 31 May 18
Posts: 43
Credit: 4,305,745
RAC: 4,146
Message 70509 - Posted: 22 Feb 2024, 1:02:49 UTC - in response to Message 70507.  

For some reason it's not letting me have any.
I upped the number of CPU cores and RAM in my VM last night to do more, extended my work cache settings, and freed up disk space, but it's still not giving me any more than the three I currently have.


What's your client log say about the reason it's not requesting new work? There's usually some obvious-ish reason listed.


22/02/2024 00:52:38 | climateprediction.net | Sending scheduler request: To fetch work.
22/02/2024 00:52:38 | climateprediction.net | Requesting new tasks for CPU
22/02/2024 00:52:41 | climateprediction.net | Scheduler request completed: got 0 new tasks
22/02/2024 00:52:41 | climateprediction.net | No tasks sent
22/02/2024 00:52:41 | climateprediction.net | Project requested delay of 3636 seconds

That's all I'm getting for now, I'll enable a few more logging options and see if anything new comes up at the next update.
ID: 70509 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 31 May 18
Posts: 43
Credit: 4,305,745
RAC: 4,146
Message 70510 - Posted: 22 Feb 2024, 1:56:46 UTC

Log from latest work fetch request (I let BOINC do it on it's own, I didn't click update so it would do the full time-out)

22/02/2024 01:54:14 | climateprediction.net | [css] running wah2_eas25_a33x_200512_24_1007_012268885_0 ( )
22/02/2024 01:54:14 | | [cpu_sched_debug] enforce_run_list: end
22/02/2024 01:54:26 | | choose_project(): 1708566866.014561
22/02/2024 01:54:26 | | [work_fetch] ------- start work fetch state -------
22/02/2024 01:54:26 | | [work_fetch] target work buffer: 259200.00 + 259200.00 sec
22/02/2024 01:54:26 | | [work_fetch] --- project states ---
22/02/2024 01:54:26 | climateprediction.net | [work_fetch] REC 721.330 prio -0.699 can't request work: scheduler RPC backoff (3570.09 sec)
22/02/2024 01:54:26 | | [work_fetch] --- state for CPU ---
22/02/2024 01:54:26 | | [work_fetch] shortfall 1031812.16 nidle 0.00 saturated 2431.98 busy 0.00
22/02/2024 01:54:26 | climateprediction.net | [work_fetch] share 0.000 project is backed off (resource backoff: 5007.51, inc 4800.00)
22/02/2024 01:54:26 | | [work_fetch] ------- end work fetch state -------
22/02/2024 01:54:26 | climateprediction.net | choose_project: scanning
22/02/2024 01:54:26 | climateprediction.net | skip: scheduler RPC backoff
22/02/2024 01:54:26 | | [work_fetch] No project chosen for work fetch
ID: 70510 · Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next

Message boards : Number crunching : New Work Announcements 2024

©2024 climateprediction.net