climateprediction.net home page
Posts by Conan

Posts by Conan

1) Message boards : Number crunching : Batch 1017 Errors (Message 70977)
Posted 4 days ago by Profile Conan
Post:
Sorry the last 7 work units failed, but not due to faulty work units.

I ran out of memory when another programme started up using 1 GB per work unit and launched 22 of them, normally not a problem but with 2 Climate Prediction WUs running using 3 to 5 GB each I had nothing left.

It took a while to get control of the computer back and then I aborted the other project and set to No New Work which should stop it from happening again.

Conan
2) Message boards : Number crunching : Batch 1017 Errors (Message 70976)
Posted 5 days ago by Profile Conan
Post:
The resent tasks are now running correctly and I completed one successfully with a few more running.

Thanks
Conan
3) Message boards : Number crunching : Batch 1017 Errors (Message 70951)
Posted 10 days ago by Profile Conan
Post:
Next 2 failed the same way

My hosts are visible so you can see the error messages
I am running Linux Fedora 37 on a Ryzen 8 7900x and a 5900x. the 5900 has not returned a result yet

Conan
4) Message boards : Number crunching : Batch 1017 Errors (Message 70949)
Posted 10 days ago by Profile Conan
Post:
Great to get some work after a very long time.

However two completed work units show an error after the 2nd trickle has been uploaded.

I think this is after the 14th zip file

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_15.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_16.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_17.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_18.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>oifs_43r3_bl_a05v_2016092300_20_1017_12282038_0_r1427327128_19.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>

Ran for almost 6 and half hours before failing.

Conan
5) Message boards : Number crunching : New work discussion - 2 (Message 69557)
Posted 2 Sep 2023 by Profile Conan
Post:
Until we get more experience with volunteers running these high memory apps I think it makes sense to restrict it to a single task for now. We can change it later in light of experience.

No other projects I know of run tasks with this high memory requirements so it's not obvious how they will be received. Let's walk first before we run with this.
LHC's ATLAS tasks at 10GB are the biggest I know of. But that's 8 threads, so you don't get people trying to run huge numbers of them. Are yours going to be single threads?


YOYO@home ECM/P2 tasks take at least 11 GB per task, single thread. Which is why I stopped running them on my 32 GB machine and limit them to just 3 at a time on my 64GB machine, they are real memory hogs.

Conan
6) Message boards : Number crunching : New work discussion - 2 (Message 69537)
Posted 28 Aug 2023 by Profile Conan
Post:
Any new work for 64 bit coming along? I noticed a couple of new entries on the server status page

OpenIFS 43r3
OpenIFS 43r3 Baroclinic Lifecycle
OpenIFS 43r3 Perturbed Surface
OpenIFS 43r3 Cubic Octahedral grid tco95 l91
OpenIFS 43r3 Linear grid tl255 l91


Thanks
Conan
7) Message boards : Number crunching : New work discussion - 2 (Message 68914)
Posted 18 Jun 2023 by Profile Conan
Post:
Although not related to new work but following on from the last couple of posts,
CMDock uses a wrapper and it shows under Linux,
I believe that YAFU also uses a wrapper and possibly YOYO, SRBase, TNGrid? and a few others. In some cases it is needed due to the type of programme being used or the code it has been written in.

A few other projects also use a "Trickle up" method to keep the Server updated with progress (Primegrid is one) and some of these projects need a wrapper for this purpose.

Conan
8) Message boards : Number crunching : Server Status page questions (Message 68604)
Posted 19 Mar 2023 by Profile Conan
Post:
I have also wondered about the server page.

UK Met Office Coupled Model Full Resolution Ocean has had 927 tasks "in progress" for many months but I have seen no indication that any have been returned and the number never changes.

Weather At Home 2 (wah2) (region independent) has 4,731 tasks in progress again for many months and again I have not seen any activity with this either (maybe 1 came back 4 months ago but can't be sure).

What is happening with these work units?

Conan
9) Message boards : Number crunching : Upload server is out of disk space (Message 67724)
Posted 14 Jan 2023 by Profile Conan
Post:
Hi Kali,

The server they go to is in Hobart, NZ. I should have spotted the NZ in the task name and thought of that. Most likely when Andy gets my message he will email the data centre in Tasmania. This has happened before on a number of occasions.

Dave


Actually Dave, Hobart is in Tasmania, Australia. Not NZ (New Zealand).

Conan
10) Message boards : Number crunching : The uploads are stuck (Message 67538)
Posted 11 Jan 2023 by Profile Conan
Post:
Yes I am still seeing "connect(): failed" messages on all upload tries.

But I still have 4 work units running and I am no where near filling up any disks, so no problem here.

Conan


It has changed to "transient HTTP error" now so still not working here yet (Australia).

Server Status has not changed yet, still showing nothing.

Conan

PS: Some files are now moving, so possibly due to the load, some fail then must retry later, others are going through, some as low as 17 kB/s to as high as 1,700 kB/s.
11) Message boards : Number crunching : The uploads are stuck (Message 67525)
Posted 10 Jan 2023 by Profile Conan
Post:
Yes I am still seeing "connect(): failed" messages on all upload tries.

But I still have 4 work units running and I am no where near filling up any disks, so no problem here.

Conan
12) Message boards : Number crunching : Tasks failing on Ubuntu 22 (Message 67347)
Posted 5 Jan 2023 by Profile Conan
Post:
If you changed the option to "leave tasks in memory" but did not read the file to update BOINC with the change it may not work until it is read.
Restarting BOINC would also read the file.

Conan
13) Message boards : Number crunching : Hardware for new models. (Message 67296)
Posted 4 Jan 2023 by Profile Conan
Post:
I saw some test results with the AMD RYZEN 5950X, RYZEN 7950X, INTEL 12900 and INTEL 13900 (I think they were the model names).

When all under full load for what ever test they were doing

RYZEN 9 5950X used 130 Watts
RYZEN 9 7950X used 270 Watts (or there abouts)
INTEL 12900 used 285-290 Watts (or there abouts)
INTEL 13900 used 315 Watts (or there abouts)

Can't point you to the tests but they were on Youtube along with other showing similar results.

So the RYZEN 5950X may not be as powerful as the new models but for energy efficiency hard to beat.

That's of course if you can find them, they are getting harder to find.

I run a RYZEN 9 5900X which has 12 cores + 12 threads which should use even less power as it has less cores than the 5950X.
It has 64 GB of RAM and along with a full compliment of other BOINC projects easily runs 9 CPDN work units at a time. Only gets to about 42 GB max depending what I am running at the time (everything not just CPDN) (it may get higher than 42 GB but I have the head room to cover that.)

BOINC has not downloaded more than 9 work units at any one time, probably because I am running a lot of other projects at the same time.

Conan
14) Message boards : Number crunching : OpenIFS Discussion (Message 66999)
Posted 22 Dec 2022 by Profile Conan
Post:
All 9 work units that I had running overnight have completed successfully.

Running on an AMD Ryzen 9 5900x, 64GB RAM, all 24 threads used to run BOINC programmes at the same time as the ClimatePrediction models.
All took around 17 hours 10 minutes run time.

Conan
15) Message boards : Number crunching : Late Validation pending (Message 66991)
Posted 21 Dec 2022 by Profile Conan
Post:
Well it seems that these files have finally been validated and I have been awarded credit for them, I think.

I have noticed a clean up/out has taken place and a lot of the old past work units that I have done over the years has been removed.
Those 2 pending jobs among them. I was awarded some small amount of credit this week when I have not done any work and now it seems that the database has had a bit of a clean out and fix up. Good to see.

Conan
16) Message boards : Number crunching : OpenIFS Discussion (Message 66990)
Posted 21 Dec 2022 by Profile Conan
Post:
G'Day Glenn,

You may of miss read what I wrote I think.

The 11.3 GB was not a file size but the amount of disk writes made in that first 2 hours (now after 5 hours well over 30 Gb).
The 2.7 to 4.6 GB were RAM amounts that each work unit was using.

This was all taken from System Monitor.

I did what you have asked and

% cd slots/26
% du -hs . # note the '.'
1.2G .

This is the same as your example.

% cd projects/climateprediction.net
% du -hs .
1.2G .

This is similar to your example.

du -hs srf*

768 MB srf00370000.0001

So all running fine, so maybe just a bit of a misunderstanding I think with data amounts and RAM usage.

Thanks
Conan
17) Message boards : Number crunching : OpenIFS Discussion (Message 66983)
Posted 21 Dec 2022 by Profile Conan
Post:
These Oifs _ps tasks really test your system out.

Running 9 at once, each using from 2.7 to 4.2 GB of RAM, after 2 hours run time they have written 11.3 GB of data to disk each (101.7 GB), which is huge.
Hitting 50 GB of RAM in use out of 64 GB, but I am also running LODA tasks which each use 1 GB of RAM. All 24 threads are running.
12% in and running fine so far.

Conan
18) Message boards : Number crunching : OpenIFS Discussion (Message 66795)
Posted 6 Dec 2022 by Profile Conan
Post:
My resent task 22249228 has been sent out twice before.

Previous Task 22246540 and Task 22248943

Task 22246540 has no Stderr, it failed with a Run Time of 1 Day 5 Hours and a CPU Time of 31 Minutes. It also had an unusual amount of Peak Disk Usage of 23,961.87 MB (or 23.9 GB) way above the norm as I have seen.

Task 22248943 has the error "Process exited with code 9" other than that seemed to have run fine. This one belonged to wateroakley

I was able to run this WU to completion without error.


Another resent task I have running is Task 22249324

Previous Task 22247025 and Task 22249194

Task 22247025 on computer 1524992 it had a Run Time of 42 Minutes with a CPU Time of 20 Seconds with a Peak Disk Usage of just 404.06 MB.
This computer still has work on it but has not completed a successful OpenIFS WU all failed work units have the same long run times and short CPU times and have different error codes as well, codes 1, 5 and 148 all appear on this computer.

Task 22249194 on computer 1504810 has No Stderr, has a Run Time of 1 Day 1 Hour and CPU Time of 7 Hours.
This computer has run 9 OpenIFS work units all have failed with the long Run Time and short CPU Time.
This computer belongs to happywetter.at

So a few different reasons that some work units have failed or thrown an error.

Conan

I completed Task 22249324 successfully in just under 17 1/2 hours.
19) Message boards : Number crunching : OpenIFS Discussion (Message 66793)
Posted 5 Dec 2022 by Profile Conan
Post:
My resent task 22249228 has been sent out twice before.

Previous Task 22246540 and Task 22248943

Task 22246540 has no Stderr, it failed with a Run Time of 1 Day 5 Hours and a CPU Time of 31 Minutes. It also had an unusual amount of Peak Disk Usage of 23,961.87 MB (or 23.9 GB) way above the norm as I have seen.

Task 22248943 has the error "Process exited with code 9" other than that seemed to have run fine. This one belonged to wateroakley

I was able to run this WU to completion without error.


Another resent task I have running is Task 22249324

Previous Task 22247025 and Task 22249194

Task 22247025 on computer 1524992 it had a Run Time of 42 Minutes with a CPU Time of 20 Seconds with a Peak Disk Usage of just 404.06 MB.
This computer still has work on it but has not completed a successful OpenIFS WU all failed work units have the same long run times and short CPU times and have different error codes as well, codes 1, 5 and 148 all appear on this computer.

Task 22249194 on computer 1504810 has No Stderr, has a Run Time of 1 Day 1 Hour and CPU Time of 7 Hours.
This computer has run 9 OpenIFS work units all have failed with the long Run Time and short CPU Time.
This computer belongs to happywetter.at

So a few different reasons that some work units have failed or thrown an error.

Conan
20) Message boards : Number crunching : OpenIFS Discussion (Message 66737)
Posted 3 Dec 2022 by Profile Conan
Post:
Just downloaded a resend of a Work Unit that failed due to an error.

This Task 22245903

It failed due to running longer than 5 minutes after the work unit had finished.

The WU was run by mikey and other than the longer run time after finishing seemed to have run successfully after over 2 days run time.

The run time seems overly long on a Ryzen but did complete.

It is now running as Task 22249047 on my Ryzen computer.

Will see how it runs for me.

Conan


Completed successfully after 16 1/2 hours.

Conan


Next 20

©2024 climateprediction.net