climateprediction.net home page
Posts by cetus

Posts by cetus

1) Message boards : Number crunching : Batch 1005 WAH2 NZ region (Message 70201)
Posted 25 Jan 2024 by cetus
Post:
I have two of these running. Both seem to be uploading OK.
2) Message boards : Number crunching : OpenIFS Discussion (Message 68300)
Posted 13 Feb 2023 by cetus
Post:
I think you may have killed a normal task. How long was it 'hanging' for?

It was running for about 1.5 hours after the model seemed to have finished. It looked like it was in the same state as that 2 minute pause at the end, but it just never finished.
3) Message boards : Number crunching : OpenIFS Discussion (Message 68295)
Posted 13 Feb 2023 by cetus
Post:
Also still seeing the task completing successfully and then hanging in the final call to boinc. It's not clear what's causing this but it seems this is a known issue with boinc.


I had one task with this issue: https://www.cpdn.org/result.php?resultid=22307151
Before I aborted it, I copied the slot directory that it was running in. If you want to see any of those files, let me know.
4) Message boards : Number crunching : The uploads are stuck (Message 68119)
Posted 30 Jan 2023 by cetus
Post:
I have the same problem
5) Message boards : Number crunching : OpenIFS Discussion (Message 68023)
Posted 24 Jan 2023 by cetus
Post:
It might be related to the 'memory faults' that also occur because on my machine I had the 'process still running' error right after I saw a task fail with 'double corruption'.

I have also seen and killed several detached model.exe processes that seem to occur after the model fails with a "double free or corruption (out)" error. I've started looking with "ps -efl | grep boinc" whenever I see a task with a computation error. The bad process is pretty easy to find because the parent PID is set to "1", instead of the PID of a controlling process. It also has the same slot number as another process. I suspect that there is a detached process every time the corruption error happens, but I haven't looked consistently enough to be certain.
Do you have any insight into how the intermediate data is used? It's easy to imagine looking at final results of 40,000 runs, and it's easy to imagine looking at the intermediate results of a few runs, but I have a hard time imagining sorting through the massive amount of data that we are generating here.
6) Message boards : Number crunching : New work discussion - 2 (Message 66925)
Posted 15 Dec 2022 by cetus
Post:
Glen,
I looked in syslog, kern.log and the systemd journal, but did not see anything unusual while the job was running or when it ended.

The boinc log messages for when the job failed were:
Dec 14 12:04:05 hal boinc[2320]: 14-Dec-2022 12:04:05 [climateprediction.net] Finished upload of oifs_43r3_bl_a019_2016092300_15_949_12166439_0_r1529103669_9.zip
Dec 14 12:04:07 hal boinc[2320]: 14-Dec-2022 12:04:07 [climateprediction.net] Computation for task oifs_43r3_bl_a019_2016092300_15_949_12166439_0 finished
Dec 14 12:04:07 hal boinc[2320]: 14-Dec-2022 12:04:07 [climateprediction.net] Output file oifs_43r3_bl_a019_2016092300_15_949_12166439_0_r1529103669_10.zip for task oifs_43r3_bl_a019_2016092300_15_949_12166439_0 absent
Dec 14 12:04:07 hal boinc[2320]: 14-Dec-2022 12:04:07 [climateprediction.net] Output file oifs_43r3_bl_a019_2016092300_15_949_12166439_0_r1529103669_11.zip for task oifs_43r3_bl_a019_2016092300_15_949_12166439_0 absent
Dec 14 12:04:07 hal boinc[2320]: 14-Dec-2022 12:04:07 [climateprediction.net] Output file oifs_43r3_bl_a019_2016092300_15_949_12166439_0_r1529103669_12.zip for task oifs_43r3_bl_a019_2016092300_15_949_12166439_0 absent
Dec 14 12:04:07 hal boinc[2320]: 14-Dec-2022 12:04:07 [climateprediction.net] Output file oifs_43r3_bl_a019_2016092300_15_949_12166439_0_r1529103669_13.zip for task oifs_43r3_bl_a019_2016092300_15_949_12166439_0 absent
Dec 14 12:04:07 hal boinc[2320]: 14-Dec-2022 12:04:07 [climateprediction.net] Output file oifs_43r3_bl_a019_2016092300_15_949_12166439_0_r1529103669_14.zip for task oifs_43r3_bl_a019_2016092300_15_949_12166439_0 absent
7) Message boards : Number crunching : New work discussion - 2 (Message 66920)
Posted 15 Dec 2022 by cetus
Post:
I also had a task that failed with that error, however the model did not finish:

https://www.cpdn.org/result.php?resultid=22250347

Exit status 9 (0x00000009) Unknown error code
...
12:03:35 STEP 973 H= 243:15 +CPU= 10.376
12:03:45 STEP 974 H= 243:30 +CPU= 10.186
12:03:56 STEP 975 H= 243:45 +CPU= 10.185
double free or corruption (out)
12:04:06 STEP 976 H= 244:00 +CPU= 10.574

</stderr_txt>

The same computer successfully completed 11 other jobs from the latest oifs batch. They were being run 6 at a time, with around 20GB free ram available.
I'm certainly fine with test jobs being sent to it, if you want to.
8) Message boards : Number crunching : OpenIFS Discussion (Message 66734)
Posted 3 Dec 2022 by cetus
Post:
I noticed an issue that I don't think has been reported yet.
One of my machines has run 46 oifs jobs. 12 of them with computation errors, the rest appear to have completed successfully. After the boinc client finished all the jobs there are still three oifs processes running. No master.exe processes.

$ ps  -flU boinc
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY      TIME CMD
4 S boinc       2449       1  0  99  19 - 76998 -      Nov29 ?    00:16:34 /usr/bin/boinc --gui_rpc_port 31418
0 S boinc    1657533    2449  0  99  19 - 35465 -      Dec01 ?    00:01:18 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-gnu 2021050100 hpi1 1734 946 12164823 123 oifs_43r3_ps 1
0 S boinc    2001745    2449  0  99  19 - 35464 -      Nov30 ?    00:02:00 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-gnu 2021050100 hpi1 1833 946 12164922 123 oifs_43r3_ps 1
0 S boinc    2147924    2449  0  99  19 - 35465 -      Nov30 ?    00:02:15 ../../projects/climateprediction.net/oifs_43r3_ps_1.01_x86_64-pc-linux-gnu 2021050100 hpi1 1942 946 12165031 123 oifs_43r3_ps 1

Two of the slots directories still have 6 files in them, another one has more than 300 files in it. The rest of the slots are empty.
In the projects/climateprediction.net directory, there are 9 directories with names like oifs_43r3_ps_12163845 that appear to be job folders that did not get deleted after the job finished. I have the BOINC directory archived if the contents are of interest.

The computer is a 12 core 5900X with 64GB of ram. The oifs jobs were run 8 at a time. I never noticed less than 15GB free RAM while 8 were running, but I wasn't watching most of the time of course.

Here are all the error jobs:
https://www.cpdn.org/result.php?resultid=22248825 double free or corruption
https://www.cpdn.org/result.php?resultid=22248783 this one looks like it finished, but has exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT, job # 12164823
https://www.cpdn.org/result.php?resultid=22248662 double free or corruption
https://www.cpdn.org/result.php?resultid=22246507 double free or corruption
https://www.cpdn.org/result.php?resultid=22248118 double free or corruption
https://www.cpdn.org/result.php?resultid=22246441 double free or corruption
https://www.cpdn.org/result.php?resultid=22246293 double free or corruption
https://www.cpdn.org/result.php?resultid=22247041 this one looks like it finished, but has exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT, job # 12165031
https://www.cpdn.org/result.php?resultid=22246587 double free or corruption
https://www.cpdn.org/result.php?resultid=22248533 this one looks like it finished, but has exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT, job # 12164922
https://www.cpdn.org/result.php?resultid=22248053 double free or corruption
https://www.cpdn.org/result.php?resultid=22246923 double free or corruption

The three jobs that were "aborted" look like the same three processes that are still running. I did not abort any manually.
9) Questions and Answers : Unix/Linux : crash with code 251 (Message 11462)
Posted 27 Mar 2005 by cetus
Post:
&gt; One more time -- fifth 4.11 crash, this time on Dbox, third of my machines to
&gt; have a "hardware failure" with this version. Increasingly disgusting that
&gt; there is no information about a potential solution.
&gt;
&gt; [Edit: Corrected a typo.]
&gt;
&gt; Is this affecting only the minority of us posting the errors? Or is it
&gt; widespread and unnoticed, ignored, or silently driving people away from CPDN?
&gt;

I appear to have the same problem too. I've had three 4.11 crashes in phase one on a P4 linux box. The same machine is still successfully running a 4.04 model, so a hardware issue seems unlikely.




©2024 climateprediction.net