climateprediction.net home page
Posts by bernard_ivo

Posts by bernard_ivo

1) Message boards : Number crunching : Why does this task fail ? (Message 67784)
Posted 17 Jan 2023 by bernard_ivo
Post:
Hi,
I have at least two WUs failing at the end on 100% with
Exit status : 9 (0x00000009) Unknown error code

One of the WUs is this one
https://www.cpdn.org/result.php?resultid=22298445 I'm not sure what the issue is.
2) Message boards : Number crunching : OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected! (Message 67697)
Posted 14 Jan 2023 by bernard_ivo
Post:
Thanks Richard,
But there are plenty of oifs_43r3_ps 's and I did crunch them on the same machine
3) Message boards : Number crunching : OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected! (Message 67694)
Posted 14 Jan 2023 by bernard_ivo
Post:
I'm new to the app_config wagon, but gave it a try.
<app_config>
   <project_max_concurrent>4</project_max_concurrent>
   <report_results_immediately/>
   <app>
      <name>oifs_43r3</name>
      <max_concurrent>3</max_concurrent>
   </app>
   <app>
      <name>oifs_43r3_ps</name>
      <max_concurrent>3</max_concurrent>
   </app>
   <app>
      <name>oifs_43r3_bl</name>
      <max_concurrent>3</max_concurrent>
   </app>
</app_config>

I've got that message in the event log and in BOINC notices:

Sat 14 Jan 2023 12:38:35 PM EET | climateprediction.net | Your app_config.xml file refers to an unknown application 'oifs_43r3'.  Known applications: 'hadam4', 'hadam4h', 'hadcm3s', 'hadsm4', 'oifs_43r3_ps'
Sat 14 Jan 2023 12:38:35 PM EET | climateprediction.net | Your app_config.xml file refers to an unknown application 'oifs_43r3_bl'.  Known applications: 'hadam4', 'hadam4h', 'hadcm3s', 'hadsm4', 'oifs_43r3_ps'


and I did not get any WUs. I will turn <file_xfer> and <sched_ops> to see what happens...in one hour
4) Message boards : News : Request to volunteers to please enable: 'Leave non-GPU tasks in memory' (Message 67611)
Posted 12 Jan 2023 by bernard_ivo
Post:
Finally CPDN has used the BOINC push notification:

climateprediction.net: Request to volunteers to please enable: 'Leave non-GPU tasks in memory'
The OpenIFS model batches require the option 'Leave non-GPU in memory while suspended' to be enabled under boincmgr -> Disk&Memory. This will prevent the task from frequently restarting and reduce the risk of task failure.
Thu 05 Jan 2023 04:54:57 PM EET more... 


It could've been also used for the missing 32 bit libs as well :) But so far so good
5) Message boards : Number crunching : Tasks failing on Ubuntu 22 (Message 67450)
Posted 9 Jan 2023 by bernard_ivo
Post:
Thanks Glenn and Andrey,

I usually allow my machines to work 24/7 and this laptop I realise was suspended once or twice on Jan 7th as I was moving around - hence the accumulation of SRF files. As for the switch between tasks I currently have it as 60 minutes, but I've never seen it behaving like this, though sometimes I have to pause some to make BOINC continue with tasks I want it to. Moreover CPDN is my highest priority 75%, WCG 12.5 % and WUProp 12.5 % - the latter being non-cpu-intensive. However I will increase the time as suggested to reduce the switches.

As for the concurrent numbers of OIFS both machines work fine(ish) with 2 tasks at the same time (if I use the machines than SWAP kicks in), with 3 WUs for the i7-4790 at the extreme. With 4 the system crashed and I could not start it for several days to switch off BOINC. I can no longer crunch as I hit the upload limit, but I may reduce WUs to 1 as suggested. Let's first clear the upload queue.

So yeah, these OIFS are pushing the limits of my current machines and more demanding ones are to come :) looking forward to the upgraded batches which should reduce some failures.
6) Message boards : Number crunching : Tasks failing on Ubuntu 22 (Message 67420)
Posted 8 Jan 2023 by bernard_ivo
Post:
Hi there,
I also have one WU that crashed with
<message>
Disk usage limit exceeded</message>
<stderr_txt>


I was not working on the computer, I allow only two OSIFs to run (4 HT), 16 GB RAM, swap was around 2.6 GB from 8GB allocated. CPU allowed to work 100%, leave task in memory checked;

For some reason I could not find the srf files via the terminal so I looked at with Files browser
In slot 1 I found 4 srfs. 3 were 805 MB
I also have slots 0 and 2 with no srf files there

So I wonder what could've gone wrong with this one?
7) Message boards : Number crunching : The uploads are stuck (Message 67350)
Posted 5 Jan 2023 by bernard_ivo
Post:
Yesterday I uploaded all the zips, downloaded new WUs, uploads are stuck again and one of the finished WUs got a computation error due to missing upload. I'm not sure whether zips were lost due to the failure of my own machine a week ago or some upload problems.
8) Message boards : Number crunching : Hardware for new models. (Message 67302)
Posted 4 Jan 2023 by bernard_ivo
Post:
There are some Xeons running these batches. I can have a look through the logs and report back what runtimes they give, if that would help?

I do not want to divert you from your much more important work to do. Perhaps such stats could be exported at the CPDN site (bit more useful than https://www.cpdn.org/cpu_list.php), rather than digging manually via logs (as most of us do when looking for sec/TS figures). There is the WUProp@Home project that does collect some metrics from BOINC machines and allows comparison between different hardware, but haven't check it recently and not sure how many CPDN users are on it.

I realise there is no easy answer on what hardware to use for CPDN, so all the contributions so far are very useful. Thanks
9) Message boards : Number crunching : Hardware for new models. (Message 67295)
Posted 4 Jan 2023 by bernard_ivo
Post:
Thanks Glenn,

Much appreciated. I've learnt the hard way to limit the OpenIFS WU's running in parallel

4/ I would say go for fastest single core speed & the most, fastest, RAM you can afford as priority. Don't worry too much about core count or cache size, as I said OIFS moves alot of data in/out of RAM. It's more about having a balanced machine for a code like OpenIFS.

This seems to rule out older Xeon's v2, v3 that are now more available is second hand workstations and servers. HadAM4 at N216 resolution is heavy on both RAM and L3, so is L3 a valid bottleneck to consider or not as much as available RAM and CPU speed?
10) Message boards : Number crunching : OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected! (Message 67284)
Posted 4 Jan 2023 by bernard_ivo
Post:
Work might get done faster if you let things in memory, and certainly it helps if apps don't handle checkpoints well or don't have any.
But the projects should work on their apps to run stable even if not kept in memory, because they want to get their work done.
It's the time-stepping & complex nature of the CPDN weather models that they work that way. For every checkpoint, the model needs to dump its working arrays in 64bit precision so it can do a bit reproducible restart. That's alot of I/O and alot of data, but if you don't want to keep it in memory, it'll have to restart from checkpoint more often than we currently allow for. That means much more I/O, filespace, wearing out SSDs etc. I could indeed allow the model to checkpoint more to cope with being in & out of memory frequently, but you'd pay a price on your drives instead of RAM, and a much slower throughput because of the added I/O.

How often does the OpenIFS model need to checkpoint? Looking at my event log it seems every second? Is that normal?

Wed 04 Jan 2023 11:37:46 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:47 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:48 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:49 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:50 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:51 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:52 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:53 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:54 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
11) Message boards : Number crunching : Hardware for new models. (Message 67279)
Posted 4 Jan 2023 by bernard_ivo
Post:
By looking at the RAM requirements for the next OpenIFS models I wonder what kind of rig/computer I would need to upgrade to. My best machine is i7-4790 with 16 GB RAM, which collapsed under 4 OIFS's WUs running in parallel (I do not have app_config.xml but plan to have one) and trashed few WUs.

Meanwhile the info on OpenIFS running on RasPI got me excited that maybe new ARMs could be an efficient way to go, but then I remember CPDN may not work on ARM for the foreseeable future, so yeah I guess some Ryzen box will be.

Anyway some advice from CPDN fellow users on what to upgrade to will be appreciated. Since I plan to use it 24/7 it should not be power hungry beast = good balance btw output and electricity bill. :)
12) Message boards : Number crunching : OpenIFS Discussion (Message 66662)
Posted 30 Nov 2022 by bernard_ivo
Post:
Error: " climateprediction.net: Notice from server
OpenIFS 43r3 Perturbed Surface needs 21968.66MB more disk space. You currently have 16178.31 MB available and it needs 38146.97 MB.
Tue 29 Nov 2022 09:56:52 AM CET"
38 Gb sounds alot, normal?


I have the same problem, but it may be because Store at was set to 10 days and Store up to a 10 days (so at MAX values). So the server might be trying to push more WUs than the available space for Storage. I've changed these to see if I will get any WU.
13) Message boards : Number crunching : Hardware requirements for upcoming models (Message 65773)
Posted 6 Aug 2022 by bernard_ivo
Post:
I would also include power consumption under others. It seems more and more important as electricity prices soar while people are planning their upgrades.
14) Message boards : Number crunching : Dr Lisa Su says up to 192MB L3 on newer Ryzen -- hope it's true (Message 65010)
Posted 26 Jan 2022 by bernard_ivo
Post:
In my case, simply installing
linux-tools-common
linux-tools-generic
which should link to the latest kernel tools did not work
using perf pointed to possible missing tool libraries, and looking at my current kernel number and available packages
I went to add
linux-tools-generic-hwe-20.04 which points to the latest kernel

Then ran perf as superuser and it showed this for my Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

Performance counter stats for 'system wide':

    12,511,363,439      cache-references                                            
     6,135,922,943      cache-misses              #   49.043 % of all cache refs    

      73.725181985 seconds time elapsed


I run 1/2 of the cores = 4 CPDN WUs, RAM is 16Gb
15) Message boards : climateprediction.net Science : Misconfigured Machine? (Message 64971)
Posted 15 Jan 2022 by bernard_ivo
Post:
This one https://www.cpdn.org/results.php?hostid=1510055 has crashed all ~12k WUs and is continuing to do so.

This one https://www.cpdn.org/results.php?hostid=829775 has been crashing all since 2020 had 67 valid before that

And this one https://www.cpdn.org/results.php?hostid=1517479 has crashed all ~ 10k WUs and is continuing to do so.

Can this reporting be automated somehow? The level of micromanagement CPDN requires and the reluctance of staff to adjust some basic things is becoming daunting.
16) Message boards : Number crunching : Lost tasks (Message 64969)
Posted 14 Jan 2022 by bernard_ivo
Post:
I just got a WU from batch 891. On its first attempt it spend a year with No response (jan 2021-jan 2022). 5 attempts are allocated to this batch. One could expect 5 years in vain in worst case scenario On its second attempt it errored in seconds (I guess missing 64bit libraries). On its 3rd attempt on of my machines got it and it's been computing fine. I'm not waiting someone to tell me that this batch is closed and I should abort. While climate change is accelerating keeping the one year deadline period is a kind of climate change denial. I mean the community here has been asking for shortened period for ages already. I mean how hard that is to be changed and why are we ignored even on the most common sense suggestions?

I even have a ghost task in progress from Jan 2014 with a deadline in Jul 2023 - so I'm close.
17) Message boards : Number crunching : Tasks by application = hoarding (Message 64465)
Posted 14 Sep 2021 by bernard_ivo
Post:
Is the impression of hoarding created by the very high number of tasks shown as in progress for applications that have not issued tasks for some time and where the active users shows zero?

This, surely, is an historical issue of failed tasks that have not been crossed off the list of tasks outstanding.

Would it be possible to synchronise the number shown as outstanding which the number of tasks that are still being processed?


Yes, there are ghost tasks. I have two out of 8 WUs in progress. One of the ghosts was issued in 2014 and its deadline is 2023. So yeah I run it for 7 years. Several times there have been requests to clean up the ghosts. Not much result. Yes detach, reattach from the project sometimes work, but not always.

And yes a shorter deadline circa 4-6 months is completely reasonable to accommodate older machines who run other projects as well.

Reissuing tasks might be useful for researches but I've crunched numerous times batches that were no longer of interest to anyone. Yeah my machines saved the last 3rd or 5th attempt of the WU after few years idling on someone's computer. Old batches are not always pulled out.

Sometimes I had to manually abort WUs no to waste resources on WUs of no interest. Shorter deadline could fix that as well, but hey it seems too much to ask every time this pops up.
18) Message boards : Number crunching : Completed tasks not showing on server (Message 64082)
Posted 25 Jun 2021 by bernard_ivo
Post:
Hi folks,

I have this one successfully finished but still In progress on the web https://www.cpdn.org/result.php?resultid=22089752
19) Message boards : Number crunching : batches closed (Message 63588)
Posted 2 Mar 2021 by bernard_ivo
Post:
Hi,
I just got one from batch 837 from 2019 - hadcm3s_qu49_190012_240_837 on its 2nd attempt. Its 1st attend status is - Didn't need. Should I crunch it? It is on a new machine attached to the project, so I might as well monitor whether it behaves well, but still if the WU is not needed, then should move to a more recent batch.
20) Questions and Answers : Unix/Linux : Missing options on current versions of BOINC (Message 63447)
Posted 1 Feb 2021 by bernard_ivo
Post:
Have you ticked, "Stop running tasks when exiting the BOINC Manager?" If not the client keeps going in the background. If you don't have these options when exiting Manger, go to options>other options and enable Manager exit dialogue and client shutdown dialogue.


No intention to divert the thread topic, but under Ubuntu 20.04 and BOINC 7.16.6 this function does not work (I have two machines like that) No matter if I select the option, the Manager always closes without the dialogue window mentioned.

And thanks to Andy, will keep posting mis-configured machines.


Next 20

©2023 climateprediction.net