climateprediction.net home page
New Work Announcements 2024

New Work Announcements 2024

Message boards : Number crunching : New Work Announcements 2024
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1063
Credit: 16,546,621
RAC: 2,321
Message 70288 - Posted: 2 Feb 2024, 14:26:46 UTC - in response to Message 70272.  
Last modified: 2 Feb 2024, 14:29:54 UTC

Although you could get a larger case, I always use full towers.


My machine is already a full tower.

https://www.dell.com/support/manuals/en-us/precision-5820-workstation/precision_5820_om_pub/front-view?guid=guid-37c8fd9c-4ee2-4c39-89f9-061167ff006d&lang=en-us

https://www.dell.com/support/manuals/en-us/precision-5820-workstation/precision_5820_om_pub/major-components-of-your-system?guid=guid-3f127ece-ad92-4fd6-bbbc-b6548ebd69c4&lang=en-us
ID: 70288 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70290 - Posted: 2 Feb 2024, 14:41:52 UTC - in response to Message 70288.  

My machine is already a full tower.
Then I don't understand you not being able to fit a larger fan. I have a 6 inch (15cm) cube cooler in my Ryzen machines. Dual 150mm fans, almost silent.
ID: 70290 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 87
Credit: 32,981,759
RAC: 14,695
Message 70293 - Posted: 2 Feb 2024, 18:17:57 UTC - in response to Message 70276.  

Setting defaults to 1-2 and resetting all current preference initially is reasonable to me, but I really hope we can honor override afterwards. This solves the problem of people never reading forums, while allowing people paying attention to use more cores on bigger machines once they have app_config updated.
One caveat is that the setting is global, so it would also negatively affect WAH and HadAM4 even though they don't face the same memory problem. Other than WCG, I haven't seen per-app max jobs settings. I suppose it won't be a trivial change on server side to implement that, but if we could that would be the best IMO.
ID: 70293 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 147
Credit: 12,814,088
RAC: 261,385
Message 70345 - Posted: 9 Feb 2024, 12:43:14 UTC - in response to Message 70102.  

Copied from old thread from Glen.

Forthcoming batches

The following batches are planned for Jan (or early Feb).

a/ Weather@Home (Windows)*

NZ25 - New Zealand 25km grid, natural forcings.
EAS25 - East Asia 25km grid, range of different forcings.


b/ HadAM4 (Linux)
N216 climatological runs producing high frequency northern-hemisphere output.

c/ OpeniFS (Linux)
Low resolution batch to look at variation of model results across different hardware


*We'll also roll out updated versions of the apps for Weather@Home, HadAM4, & HadSM4 to fix issues with the models failing, particularly on restarts. Although we hope to get these out before the Weather@Home batches it may not happen due to time pressure from the projects funding these batches.

Hoping some of these might come sooner rather than later but I have given up holding my breath!


Any further news of these?

I need to return a result to get rid of the spurious RAC figure from the correction that was done months ago (308,000 where my boxes are capable of 80,000 at best).
ID: 70345 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 809
Credit: 13,604,352
RAC: 5,068
Message 70346 - Posted: 9 Feb 2024, 13:38:21 UTC - in response to Message 70345.  

The priority is to get the replacement EAS25 batches out (for aborted 1002-1004) once the troublesome files have been corrected and tested. It's highly likely they will be using the new WaH2 app which has been in development & tested not to suffer from the excessive failures. It has already been added to the main site as v8.29 of the WAH2 Region Independent app (or wah2-ri for short). If you receive a workunit for wah2-ri 8.29, you're running the new app. George (aka geophi) noted in testing the new app is ~10% faster than the old one.

It's been done this way so as not to interfere with currently running wah2 workunits. There is also a new linux version of wah2-ri which is currently in test (not on the main site yet).

The OpenIFS batch is ready & has been tested, but I'm not happy with some of the failures coming from the monitor code (not the model). To be discussed.

The HadAM4 I think is about ready, maybe needs bit more testing.

So, alot of Windows & Linux work coming soon.

I'll be able to confirm more next week after the usual Monday CPDN meeting.

HTH
---
CPDN Visiting Scientist
ID: 70346 · Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 147
Credit: 12,814,088
RAC: 261,385
Message 70347 - Posted: 9 Feb 2024, 19:29:18 UTC - in response to Message 70346.  

The priority is to get the replacement EAS25 batches out (for aborted 1002-1004) once the troublesome files have been corrected and tested. It's highly likely they will be using the new WaH2 app which has been in development & tested not to suffer from the excessive failures. It has already been added to the main site as v8.29 of the WAH2 Region Independent app (or wah2-ri for short). If you receive a workunit for wah2-ri 8.29, you're running the new app. George (aka geophi) noted in testing the new app is ~10% faster than the old one.

It's been done this way so as not to interfere with currently running wah2 workunits. There is also a new linux version of wah2-ri which is currently in test (not on the main site yet).

The OpenIFS batch is ready & has been tested, but I'm not happy with some of the failures coming from the monitor code (not the model). To be discussed.

The HadAM4 I think is about ready, maybe needs bit more testing.

So, alot of Windows & Linux work coming soon.

I'll be able to confirm more next week after the usual Monday CPDN meeting.

HTH


Many thanks :-)
ID: 70347 · Report as offensive     Reply Quote
David Berg

Send message
Joined: 2 Jul 15
Posts: 18
Credit: 3,921,009
RAC: 1,450
Message 70348 - Posted: 12 Feb 2024, 19:16:44 UTC - in response to Message 70346.  

I received two new tasks for EAS Batch 1006, running 8.29, this morning.
One of the two NZ Batch 1005 tasks (running 8.24) I had running disappeared overnight. The other is still running. Not sure what happened to it or where to look to find out.
ID: 70348 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,321,434
RAC: 11,478
Message 70349 - Posted: 12 Feb 2024, 19:25:39 UTC - in response to Message 70348.  

Not sure what happened to it or where to look to find out.
You can look in your computer's task list from your home page on this website.

All tasks for computer 1367467

Unfortunately, in this particular case, not much evidence has been preserved.
ID: 70349 · Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 22 Feb 11
Posts: 31
Credit: 226,546
RAC: 4,080
Message 70350 - Posted: 12 Feb 2024, 19:27:38 UTC
Last modified: 12 Feb 2024, 19:31:30 UTC

I see lots of suspends in tasks that failed quickly.
Did you set "Suspend when computer is in use" and define "in use" as mouse and computer activity in last 0 minutes" making it stop and immediately resume?
ID: 70350 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 809
Credit: 13,604,352
RAC: 5,068
Message 70351 - Posted: 12 Feb 2024, 21:38:04 UTC - in response to Message 70350.  

Also make sure "keep task in memory when suspended" option is selected (or whatever it's called). This prevents the models from constantly restarting as a new process and reading from the start files. If this is not enabled it increases the chance of a failure.
---
CPDN Visiting Scientist
ID: 70351 · Report as offensive     Reply Quote
David Berg

Send message
Joined: 2 Jul 15
Posts: 18
Credit: 3,921,009
RAC: 1,450
Message 70352 - Posted: 12 Feb 2024, 22:29:28 UTC - in response to Message 70349.  

Thank you. I didn't know about this page. I see now how to navigate to it.
I see many "Error[s] while computing." Are those errors manifested by my system over which I have some control, or errors within the model or data?
ID: 70352 · Report as offensive     Reply Quote
David Berg

Send message
Joined: 2 Jul 15
Posts: 18
Credit: 3,921,009
RAC: 1,450
Message 70353 - Posted: 12 Feb 2024, 22:32:12 UTC - in response to Message 70351.  

Thank you, Glenn. That option was not checked. I updated it now.
ID: 70353 · Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 22 Feb 11
Posts: 31
Credit: 226,546
RAC: 4,080
Message 70354 - Posted: 12 Feb 2024, 22:33:48 UTC

Fortunately client was sending intermediate results so progress wasn't lost.
ID: 70354 · Report as offensive     Reply Quote
David Berg

Send message
Joined: 2 Jul 15
Posts: 18
Credit: 3,921,009
RAC: 1,450
Message 70355 - Posted: 12 Feb 2024, 22:39:22 UTC - in response to Message 70350.  

Following are my Preferences. I am naive about how these preferences affect my processing of cpdn tasks. I am very open to suggestions to improve my configuration. My computer is no longer actively involved in much other work, so cpdn can rise to the top of my priority computing. Please advise changes/enhancements that you suggest I implement.

When computer is in use
'In use' means mouse/keyboard input in last 3 minutes
Suspend all computing
Suspend GPU computing
Use at most 75 % of the CPUs
Use at most 50 % of CPU time
Suspend when non-BOINC CPU usage is above 30 %
Use at most 38 % of memory

When computer is not in use
Use at most
Requires BOINC 7.20.3+ 75 % of the CPUs
Use at most
Requires BOINC 7.20.3+ 50 % of CPU time
Suspend when non-BOINC CPU usage is above
Requires BOINC 7.20.3+ 30 %
Use at most 75 % of memory
Suspend when no mouse/keyboard input in last --- minutes

General
Suspend when computer is on battery
Switch between tasks every 60 minutes
Request tasks to checkpoint at most every 60 seconds
Leave non-GPU tasks in memory while suspended
Store at least --- days of work
Store up to an additional 0.25 days of work
Compute only between ---

Disk
Use no more than 100 GB
Leave at least 0.001 GB free
Use no more than 50 % of total
Page/swap file: use at most 75 %

Network
Limit download rate to --- KB/second
Limit upload rate to --- KB/second
Limit usage to --- MB every --- days
Transfer files only between ---
Skip data verification for image files
Confirm before connecting to Internet
Disconnect when done
ID: 70355 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 485
Credit: 29,638,939
RAC: 3,372
Message 70356 - Posted: 12 Feb 2024, 23:13:57 UTC - in response to Message 70355.  

You could try setting "use computer time" to 100% for both when in use and not in use to reduce the number of suspends
ID: 70356 · Report as offensive     Reply Quote
David Berg

Send message
Joined: 2 Jul 15
Posts: 18
Credit: 3,921,009
RAC: 1,450
Message 70357 - Posted: 12 Feb 2024, 23:20:41 UTC - in response to Message 70356.  

Thank you. I went ahead and did that. I also updated "Suspend when non-BOINC use exceeds ..." to 50%."
ID: 70357 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 257
Credit: 32,009,159
RAC: 33,657
Message 70358 - Posted: 12 Feb 2024, 23:43:31 UTC - in response to Message 70352.  
Last modified: 12 Feb 2024, 23:44:33 UTC

Thank you. I didn't know about this page. I see now how to navigate to it.
I see many "Error[s] while computing." Are those errors manifested by my system over which I have some control, or errors within the model or data?


"Yes." :/

In an ideal world, BOINC tasks should be well behaved and not care if they're repeatedly suspended/resumed, reloaded from checkpoints, etc. It may hurt rate of progress on them, but they shouldn't crash or error out or generate different results from if they're run straight through.

CPDN tasks tend to not be that well behaved (I believe a lot of them are simplified versions of supercomputing code), and generally don't like being suspended/resumed (large clusters tend to just run tasks until done). There is work ongoing to fix that (with what sounds like some good progress on that front!), but they're best started once and let to run until either they hit some impossible conditions (the planet has cooled so much the atmosphere is now liquified), or they crash (which shouldn't happen, but does, and is being improved over time).

If you suspend/resume the computer (S3 sleep, typically), it doesn't bother the tasks, and I do that regularly. But try to avoid causing the actual tasks to have to stop computation and resume regularly. That just doesn't work well right now.
ID: 70358 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 70359 - Posted: 13 Feb 2024, 0:22:46 UTC - in response to Message 70355.  

CPDN likes you not to use hyperthreading, it doesn't speed things up much, and makes the tasks take longer as you do twice as many. So if you're only doing CPDN, you're best setting it to only use 50% CPUs. If you're doing other projects too, it gets more complicated, you need to use the app config file to tell Boinc CPDN tasks "use 2 cores".
ID: 70359 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4347
Credit: 16,541,921
RAC: 6,087
Message 70360 - Posted: 13 Feb 2024, 6:53:54 UTC

#1006 6048 tasks 2024-02-12 ALL WAH2_ri East Asia 25km

These are from the reworked code so please do say whether these behave themselves and are less prone to crashes as has been observed in testing.
ID: 70360 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,321,434
RAC: 11,478
Message 70361 - Posted: 13 Feb 2024, 9:35:19 UTC - in response to Message 70360.  

I mentioned some time ago that my travelling laptop crashed a test task with the old app, with a signal 11 at startup. That host is approaching 10% on wah2_eas25_h0k1_201012_24_1006_012259529_0.

I also have a tiny, low power, Celeron box (about the size and shape of a portable CD library) - picked up to test a 64-bit BOINC error on some low power processors, now resolved (host 1548871). That one is also running a task successfully, but has only reached 3% over the same timescale.
ID: 70361 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

Message boards : Number crunching : New Work Announcements 2024

©2024 climateprediction.net