climateprediction.net home page
Posts by bernard_ivo

Posts by bernard_ivo

21) Message boards : Number crunching : Upload server is out of disk space (Message 69474)
Posted 15 Aug 2023 by bernard_ivo
Post:
I''ve switched off network activity yesterday and I switched it on today. I still can't get the rest of the zips through Here is the log
22) Message boards : Number crunching : Upload server is out of disk space (Message 69469)
Posted 14 Aug 2023 by bernard_ivo
Post:
It was working for some time and I managed to upload most of the queue. However it is down again and I can't upload. Connection timed out, transient HTTP error.


I have emailed Andy. Sadly, it being Friday afternoon, this may not get looked at till Monday.
This is now back up again.


I managed to upload few zips and then:

14/08/2023 12:52:38 | climateprediction.net | [http] [ID#5921] Info:  Connection #635 to host upload7.cpdn.org left intact
14/08/2023 12:52:38 |  | Internet access OK - project servers may be temporarily down.
14/08/2023 12:52:38 | climateprediction.net | [error] Error reported by file upload server: [wah2_eas25_a2eu_200511_25_994_012218564_0_r1079552417_18.zip] locked by file_upload_handler PID=29787
14/08/2023 12:52:38 | climateprediction.net | Temporarily failed upload of wah2_eas25_a2eu_200511_25_994_012218564_0_r1079552417_18.zip: transient upload error
14/08/2023 12:52:38 | climateprediction.net | Backing off 00:06:09 on upload of wah2_eas25_a2eu_200511_25_994_012218564_0_r1079552417_18.zip
14/08/2023 12:52:38 | climateprediction.net | [http] [ID#5920] Info:  Connected to upload7.cpdn.org (141.223.16.156) port 80 (#634)


Some % was uploaded and then the connection was lost.
23) Message boards : Number crunching : Upload server is out of disk space (Message 69466)
Posted 11 Aug 2023 by bernard_ivo
Post:
It was working for some time and I managed to upload most of the queue. However it is down again and I can't upload. Connection timed out, transient HTTP error.
24) Message boards : Number crunching : Upload server is out of disk space (Message 69454)
Posted 7 Aug 2023 by bernard_ivo
Post:
Andy is aware of this.I know the OS has recently been upgraded on the server and Andy had this on his list of things to look at waiting on his desk when he came back from leave on Wednesday. What I don't know is if this is now a problem he can fix or a problem where the files get sent to in Korea.


Here is a paste bin from the log
25) Message boards : Number crunching : Upload server is out of disk space (Message 69448)
Posted 7 Aug 2023 by bernard_ivo
Post:
Hi folks,
I do have some WAHs batch 994 still crunching, but it seems I can't upload zips to UPLOAD7.cpdn.org with a Transient HTTP error. This is happening since 4 Aug at least.

EDIT: I noticed there is another thread: The uploads are stuck - so perhaps a moderator could move it there. Apology for the inconvenience.
26) Message boards : Number crunching : Ghost work units? (Message 69447)
Posted 2 Aug 2023 by bernard_ivo
Post:
My last standing ghost WU is now "over" after 9 years in the "In Progress" queue with a "Timed out - no response" status as of 19.07.2023 (got the WU back on 15.01.2014)

https://www.cpdn.org/result.php?resultid=16272420

Finally my queue is clear :)
27) Message boards : Number crunching : Why does this task fail ? (Message 67784)
Posted 17 Jan 2023 by bernard_ivo
Post:
Hi,
I have at least two WUs failing at the end on 100% with
Exit status : 9 (0x00000009) Unknown error code

One of the WUs is this one
https://www.cpdn.org/result.php?resultid=22298445 I'm not sure what the issue is.
28) Message boards : Number crunching : OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected! (Message 67697)
Posted 14 Jan 2023 by bernard_ivo
Post:
Thanks Richard,
But there are plenty of oifs_43r3_ps 's and I did crunch them on the same machine
29) Message boards : Number crunching : OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected! (Message 67694)
Posted 14 Jan 2023 by bernard_ivo
Post:
I'm new to the app_config wagon, but gave it a try.
<app_config>
   <project_max_concurrent>4</project_max_concurrent>
   <report_results_immediately/>
   <app>
      <name>oifs_43r3</name>
      <max_concurrent>3</max_concurrent>
   </app>
   <app>
      <name>oifs_43r3_ps</name>
      <max_concurrent>3</max_concurrent>
   </app>
   <app>
      <name>oifs_43r3_bl</name>
      <max_concurrent>3</max_concurrent>
   </app>
</app_config>

I've got that message in the event log and in BOINC notices:

Sat 14 Jan 2023 12:38:35 PM EET | climateprediction.net | Your app_config.xml file refers to an unknown application 'oifs_43r3'.  Known applications: 'hadam4', 'hadam4h', 'hadcm3s', 'hadsm4', 'oifs_43r3_ps'
Sat 14 Jan 2023 12:38:35 PM EET | climateprediction.net | Your app_config.xml file refers to an unknown application 'oifs_43r3_bl'.  Known applications: 'hadam4', 'hadam4h', 'hadcm3s', 'hadsm4', 'oifs_43r3_ps'


and I did not get any WUs. I will turn <file_xfer> and <sched_ops> to see what happens...in one hour
30) Message boards : News : Request to volunteers to please enable: 'Leave non-GPU tasks in memory' (Message 67611)
Posted 12 Jan 2023 by bernard_ivo
Post:
Finally CPDN has used the BOINC push notification:

climateprediction.net: Request to volunteers to please enable: 'Leave non-GPU tasks in memory'
The OpenIFS model batches require the option 'Leave non-GPU in memory while suspended' to be enabled under boincmgr -> Disk&Memory. This will prevent the task from frequently restarting and reduce the risk of task failure.
Thu 05 Jan 2023 04:54:57 PM EET more... 


It could've been also used for the missing 32 bit libs as well :) But so far so good
31) Message boards : Number crunching : Tasks failing on Ubuntu 22 (Message 67450)
Posted 9 Jan 2023 by bernard_ivo
Post:
Thanks Glenn and Andrey,

I usually allow my machines to work 24/7 and this laptop I realise was suspended once or twice on Jan 7th as I was moving around - hence the accumulation of SRF files. As for the switch between tasks I currently have it as 60 minutes, but I've never seen it behaving like this, though sometimes I have to pause some to make BOINC continue with tasks I want it to. Moreover CPDN is my highest priority 75%, WCG 12.5 % and WUProp 12.5 % - the latter being non-cpu-intensive. However I will increase the time as suggested to reduce the switches.

As for the concurrent numbers of OIFS both machines work fine(ish) with 2 tasks at the same time (if I use the machines than SWAP kicks in), with 3 WUs for the i7-4790 at the extreme. With 4 the system crashed and I could not start it for several days to switch off BOINC. I can no longer crunch as I hit the upload limit, but I may reduce WUs to 1 as suggested. Let's first clear the upload queue.

So yeah, these OIFS are pushing the limits of my current machines and more demanding ones are to come :) looking forward to the upgraded batches which should reduce some failures.
32) Message boards : Number crunching : Tasks failing on Ubuntu 22 (Message 67420)
Posted 8 Jan 2023 by bernard_ivo
Post:
Hi there,
I also have one WU that crashed with
<message>
Disk usage limit exceeded</message>
<stderr_txt>


I was not working on the computer, I allow only two OSIFs to run (4 HT), 16 GB RAM, swap was around 2.6 GB from 8GB allocated. CPU allowed to work 100%, leave task in memory checked;

For some reason I could not find the srf files via the terminal so I looked at with Files browser
In slot 1 I found 4 srfs. 3 were 805 MB
I also have slots 0 and 2 with no srf files there

So I wonder what could've gone wrong with this one?
33) Message boards : Number crunching : The uploads are stuck (Message 67350)
Posted 5 Jan 2023 by bernard_ivo
Post:
Yesterday I uploaded all the zips, downloaded new WUs, uploads are stuck again and one of the finished WUs got a computation error due to missing upload. I'm not sure whether zips were lost due to the failure of my own machine a week ago or some upload problems.
34) Message boards : Number crunching : Hardware for new models. (Message 67302)
Posted 4 Jan 2023 by bernard_ivo
Post:
There are some Xeons running these batches. I can have a look through the logs and report back what runtimes they give, if that would help?

I do not want to divert you from your much more important work to do. Perhaps such stats could be exported at the CPDN site (bit more useful than https://www.cpdn.org/cpu_list.php), rather than digging manually via logs (as most of us do when looking for sec/TS figures). There is the WUProp@Home project that does collect some metrics from BOINC machines and allows comparison between different hardware, but haven't check it recently and not sure how many CPDN users are on it.

I realise there is no easy answer on what hardware to use for CPDN, so all the contributions so far are very useful. Thanks
35) Message boards : Number crunching : Hardware for new models. (Message 67295)
Posted 4 Jan 2023 by bernard_ivo
Post:
Thanks Glenn,

Much appreciated. I've learnt the hard way to limit the OpenIFS WU's running in parallel

4/ I would say go for fastest single core speed & the most, fastest, RAM you can afford as priority. Don't worry too much about core count or cache size, as I said OIFS moves alot of data in/out of RAM. It's more about having a balanced machine for a code like OpenIFS.

This seems to rule out older Xeon's v2, v3 that are now more available is second hand workstations and servers. HadAM4 at N216 resolution is heavy on both RAM and L3, so is L3 a valid bottleneck to consider or not as much as available RAM and CPU speed?
36) Message boards : Number crunching : OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected! (Message 67284)
Posted 4 Jan 2023 by bernard_ivo
Post:
Work might get done faster if you let things in memory, and certainly it helps if apps don't handle checkpoints well or don't have any.
But the projects should work on their apps to run stable even if not kept in memory, because they want to get their work done.
It's the time-stepping & complex nature of the CPDN weather models that they work that way. For every checkpoint, the model needs to dump its working arrays in 64bit precision so it can do a bit reproducible restart. That's alot of I/O and alot of data, but if you don't want to keep it in memory, it'll have to restart from checkpoint more often than we currently allow for. That means much more I/O, filespace, wearing out SSDs etc. I could indeed allow the model to checkpoint more to cope with being in & out of memory frequently, but you'd pay a price on your drives instead of RAM, and a much slower throughput because of the added I/O.

How often does the OpenIFS model need to checkpoint? Looking at my event log it seems every second? Is that normal?

Wed 04 Jan 2023 11:37:46 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:47 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:48 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:49 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:50 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:51 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:52 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:53 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:54 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
37) Message boards : Number crunching : Hardware for new models. (Message 67279)
Posted 4 Jan 2023 by bernard_ivo
Post:
By looking at the RAM requirements for the next OpenIFS models I wonder what kind of rig/computer I would need to upgrade to. My best machine is i7-4790 with 16 GB RAM, which collapsed under 4 OIFS's WUs running in parallel (I do not have app_config.xml but plan to have one) and trashed few WUs.

Meanwhile the info on OpenIFS running on RasPI got me excited that maybe new ARMs could be an efficient way to go, but then I remember CPDN may not work on ARM for the foreseeable future, so yeah I guess some Ryzen box will be.

Anyway some advice from CPDN fellow users on what to upgrade to will be appreciated. Since I plan to use it 24/7 it should not be power hungry beast = good balance btw output and electricity bill. :)
38) Message boards : Number crunching : OpenIFS Discussion (Message 66662)
Posted 30 Nov 2022 by bernard_ivo
Post:
Error: " climateprediction.net: Notice from server
OpenIFS 43r3 Perturbed Surface needs 21968.66MB more disk space. You currently have 16178.31 MB available and it needs 38146.97 MB.
Tue 29 Nov 2022 09:56:52 AM CET"
38 Gb sounds alot, normal?


I have the same problem, but it may be because Store at was set to 10 days and Store up to a 10 days (so at MAX values). So the server might be trying to push more WUs than the available space for Storage. I've changed these to see if I will get any WU.
39) Message boards : Number crunching : Hardware requirements for upcoming models (Message 65773)
Posted 6 Aug 2022 by bernard_ivo
Post:
I would also include power consumption under others. It seems more and more important as electricity prices soar while people are planning their upgrades.
40) Message boards : Number crunching : Dr Lisa Su says up to 192MB L3 on newer Ryzen -- hope it's true (Message 65010)
Posted 26 Jan 2022 by bernard_ivo
Post:
In my case, simply installing
linux-tools-common
linux-tools-generic
which should link to the latest kernel tools did not work
using perf pointed to possible missing tool libraries, and looking at my current kernel number and available packages
I went to add
linux-tools-generic-hwe-20.04 which points to the latest kernel

Then ran perf as superuser and it showed this for my Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

Performance counter stats for 'system wide':

    12,511,363,439      cache-references                                            
     6,135,922,943      cache-misses              #   49.043 % of all cache refs    

      73.725181985 seconds time elapsed


I run 1/2 of the cores = 4 CPDN WUs, RAM is 16Gb


Previous 20 · Next 20

©2024 climateprediction.net