climateprediction.net home page
OpenIFS Discussion

OpenIFS Discussion

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 25 · 26 · 27 · 28 · 29 · 30 · 31 · Next

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1055
Credit: 16,516,801
RAC: 955
Message 68429 - Posted: 23 Feb 2023, 23:16:21 UTC - in response to Message 68423.  

My guess is it failed on the first machine because of lack of memory. - 11GB RAM and 4 cores- it looks likely that the user is running all four cores at once which would explain a relatively high failure rate.

Not my problem.
Another 64 GBytes on order.


Order came in and four modules installed, so I am less likely to run out of memory than ever before.
OTOH, there are no CPDN tasks here, so none are running.

My first hard drive, at work, held 40 Megabytes and spun at 2400 rpm.
One or two decades later, I was amazed to have a 2 Gigabyte hard drive that spun at 7200 rpm.
And now I have 125 GBytes of RAM! Amazing the progress from 1965 until now!

Computer 1511241

Created 	14 Nov 2020, 15:37:02 UTC
Total credit 	7,074,017
Average credit 	1.58
CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16
Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.7 (Ootpa) [4.18.0-425.13.1.el8_7.x86_64|libc 2.28]
BOINC version 	7.20.2
Memory 	125.34 GB
Cache 	16896 KB

$ free -hw
              total        used        free      shared     buffers       cache   available
Mem:          125Gi       4.7Gi       117Gi        26Mi        13Mi       2.7Gi       119Gi
Swap:          15Gi          0B        15Gi


ID: 68429 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1055
Credit: 16,516,801
RAC: 955
Message 68430 - Posted: 24 Feb 2023, 4:24:39 UTC - in response to Message 68361.  

Send Personal Message to me if interested rather than reply here. If there is sufficient interest, I'll share the files on dropbox. I'll post answers to PM'd questions here.


How do I do that?

If you are still interested, I raised my RAM to 128 GBytes this afternoon.

$ free -hw
              total        used        free      shared     buffers       cache   available
Mem:          125Gi       5.8Gi       114Gi        82Mi        13Mi       5.2Gi       118Gi
Swap:          15Gi          0B        15Gi

ID: 68430 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,471,353
RAC: 3,620
Message 68431 - Posted: 24 Feb 2023, 5:13:37 UTC - in response to Message 68430.  
Last modified: 24 Feb 2023, 5:15:35 UTC

Send Personal Message to me if interested rather than reply here. If there is sufficient interest, I'll share the files on dropbox. I'll post answers to PM'd questions here.


How do I do that?


Click on his name in the Author section for his post. It'll bring up an abbreviated profile page for him and then click on "Send personal message" on the right hand side of the webpage.

Or, easier, just click on the "Send Message" button under his name in the Author section.
ID: 68431 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1055
Credit: 16,516,801
RAC: 955
Message 68432 - Posted: 24 Feb 2023, 5:27:47 UTC - in response to Message 68431.  

Or, easier, just click on the "Send Message" button under his name in the Author section.

I tried that and I got this:

User Glenn Carver (ID: 1560856) is not accepting private messages from you
ID: 68432 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 802
Credit: 13,560,429
RAC: 6,808
Message 68435 - Posted: 24 Feb 2023, 10:47:12 UTC

Next OpenIFS batch about to go out today. Missing file fixed.
---
CPDN Visiting Scientist
ID: 68435 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,497,933
RAC: 6,477
Message 68437 - Posted: 24 Feb 2023, 11:50:08 UTC - in response to Message 68435.  
Last modified: 24 Feb 2023, 12:53:34 UTC

Next OpenIFS batch about to go out today. Missing file fixed.
#993 is now out there. My first one is about 6 minutes in so well past the time when the problem occurred.
Edit: third zip has now been sent.
ID: 68437 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,497,933
RAC: 6,477
Message 68438 - Posted: 24 Feb 2023, 12:55:08 UTC

I wonder if there is a problem with the tasks being uploaded. I have only got one, a request an hour later said, "project has no tasks available." That seems a bit quick for them all to go for Linux tasks. Well check again after next request.
ID: 68438 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 942
Credit: 34,160,630
RAC: 5,164
Message 68439 - Posted: 24 Feb 2023, 13:10:26 UTC

There were 188 unsent on the server status page at 12:45, and I got one of them at 13:01. It's running, but still in the early stages - I'll watch how it runs for a while, before switching into full multi-fetch mode.
ID: 68439 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 9,982,362
RAC: 20,516
Message 68440 - Posted: 24 Feb 2023, 13:15:11 UTC

Mine seem to run fine, 40 Minutes running without a failure, zip-Files have been uploaded from Nr 0 to 5


Supporting BOINC, a great concept !
ID: 68440 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 942
Credit: 34,160,630
RAC: 5,164
Message 68441 - Posted: 24 Feb 2023, 13:21:46 UTC

Yikes - there are 123 upload files in all, and the first one was over 15 MB. Your band is going to get very bored, Dave!
ID: 68441 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,497,933
RAC: 6,477
Message 68442 - Posted: 24 Feb 2023, 13:46:28 UTC - in response to Message 68438.  
Last modified: 24 Feb 2023, 14:22:13 UTC

I wonder if there is a problem with the tasks being uploaded. I have only got one, a request an hour later said, "project has no tasks available." That seems a bit quick for them all to go for Linux tasks. Well check again after next request.
Maybe just getting there slowly. I have a second one downloading now and two is my limit without overloading my connection. (I could let five or six go via my phone which would be faster and still have enough headway for other usage.

Edit: Estimate based on percentage completed rather than BOINC'S guess is just over 10 hours on my Ryzen7
ID: 68442 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 802
Credit: 13,560,429
RAC: 6,808
Message 68443 - Posted: 24 Feb 2023, 14:20:00 UTC - in response to Message 68438.  

I wonder if there is a problem with the tasks being uploaded. I have only got one, a request an hour later said, "project has no tasks available." That seems a bit quick for them all to go for Linux tasks. Well check again after next request.

It's working fine. It just takes time to process all the tasks to be uploaded. They get taken up very quickly.
---
CPDN Visiting Scientist
ID: 68443 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 942
Credit: 34,160,630
RAC: 5,164
Message 68444 - Posted: 24 Feb 2023, 14:46:16 UTC - in response to Message 68443.  

The server status page has finally updated, and both 'Unsent' and 'In progress' have gone up substantially. Looks like the workunit generator is running at about twice the speed of our demand load, which is fine.
ID: 68444 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 32
Credit: 9,526,696
RAC: 109,831
Message 68445 - Posted: 24 Feb 2023, 14:46:20 UTC

Someone can have the retry for my first one of this batch: it got a "double free or corruption"...

The system in use is a Ryzen 3700X with 32GB RAM, and it is only using about half that under the current load (including a second OIFS task it got when reporting this one.) I only run one CPDN task at a time. and none of the other BOINC stuff I'm currently running on that system (a maximum of 9 other processes) will get up to a single GB of RAM!

I'll keep an eye on both this system and the other one that also has a single CPDN task in its BOINC mix (a Ryzen 5600H with 32GB RAM, currently showing about 24GB free...)...

Cheers - Al.
ID: 68445 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1055
Credit: 16,516,801
RAC: 955
Message 68446 - Posted: 24 Feb 2023, 14:59:09 UTC - in response to Message 68438.  

I have only got one, a request an hour later said, "project has no tasks available." That seems a bit quick for them all to go for Linux tasks.


I have one running.
Since I just doubled my RAM size to 128 GBytes, I diddled app_config.xml to run two at a time.
I then got this:

Fri 24 Feb 2023 09:23:34 AM EST |  | Re-reading cc_config.xml
Fri 24 Feb 2023 09:23:34 AM EST | climateprediction.net | Config: excluded GPU.  Type: all.  App: all.  Device: all
Fri 24 Feb 2023 09:23:34 AM EST |  | Config: event log limit 5000 lines
Fri 24 Feb 2023 09:23:34 AM EST |  | log flags: file_xfer, sched_ops, task
Fri 24 Feb 2023 09:23:34 AM EST | climateprediction.net | Found app_config.xml
Fri 24 Feb 2023 09:23:59 AM EST | climateprediction.net | Sending scheduler request: To send trickle-up message.
Fri 24 Feb 2023 09:23:59 AM EST | climateprediction.net | Requesting new tasks for CPU
Fri 24 Feb 2023 09:24:01 AM EST | climateprediction.net | Scheduler request completed: got 0 new tasks
Fri 24 Feb 2023 09:24:01 AM EST | climateprediction.net | No tasks sent
Fri 24 Feb 2023 09:24:01 AM EST | climateprediction.net | This computer has finished a daily quota of 1 tasks


This is true enough, since I have never completed one of those before.

OpenIFS 43r3 1.21 x86_64-pc-linux-gnu
Number of tasks completed 	0
Max tasks per day 	1
Number of tasks today 	1
Consecutive valid tasks 	0
Average turnaround time 	0.00 days


The boincmgr thinks this task has about 2 1/2 days to go before it finishes.

top - 09:38:45 up 15:48,  1 user,  load average: 11.36, 11.45, 11.38
Tasks: 456 total,  12 running, 444 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.2 sy, 68.1 ni, 31.2 id,  0.0 wa,  0.1 hi,  0.0 si,  0.0 st
MiB Mem : 128345.2 total,  72454.7 free,  10400.3 used,  45490.2 buff/cache
MiB Swap:  15992.0 total,  15992.0 free,      0.0 used. 116670.2 avail Mem 

    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
  56011   56007 boinc     39  19 R   4.5g   3.6  98.9  2 131:48.65 /var/lib/boinc/slots/9/oifs_43r3_model.exe                                

ID: 68446 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 802
Credit: 13,560,429
RAC: 6,808
Message 68447 - Posted: 24 Feb 2023, 15:00:11 UTC - in response to Message 68445.  

Someone can have the retry for my first one of this batch: it got a "double free or corruption"...
This bug is bloody annoying. We've got a development version in testing that makes yet more changes to the memory handling in the code, which may/may not fix it but it'll clean the code up anyway. We didn't have time to use it for this batch, there will probably be some test batches going out after this one.

The biggest problem is I can't reproduce it in standalone testing and catching one of these on my machine so I can see exactly where it went wrong is difficult.

We are getting there, the error rate has dropped substantially. It's a priority to solve it.
---
CPDN Visiting Scientist
ID: 68447 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 9,982,362
RAC: 20,516
Message 68448 - Posted: 24 Feb 2023, 16:39:12 UTC

:-(

My fastest and best working cruncher had got lastly the definitiv dead tasks, that all errored out and now it has a daily Quota of 1. :-(


Supporting BOINC, a great concept !
ID: 68448 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 802
Credit: 13,560,429
RAC: 6,808
Message 68449 - Posted: 24 Feb 2023, 16:51:02 UTC - in response to Message 68448.  

:-(

My fastest and best working cruncher had got lastly the definitiv dead tasks, that all errored out and now it has a daily Quota of 1. :-(
I had that. I created a new boinc client on the same machine, different port, and attached to cpdn. Works.
---
CPDN Visiting Scientist
ID: 68449 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1055
Credit: 16,516,801
RAC: 955
Message 68450 - Posted: 24 Feb 2023, 17:53:32 UTC - in response to Message 68449.  

My fastest and best working cruncher had got lastly the definitiv dead tasks, that all errored out and now it has a daily Quota of 1. :-(

I had that. I created a new boinc client on the same machine, different port, and attached to cpdn. Works.


I am not going to do that.
In the last 5 1/2 hours I have completed 38% of the work. So I will just wait it out. For one thing, It will be another day and if there are any tasks left, I would get one. And if this one completes before that, they may raise my limit to 4 or 5 (I forget just what they do, but it is something like that).
ID: 68450 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,497,933
RAC: 6,477
Message 68458 - Posted: 25 Feb 2023, 8:05:13 UTC

This task failed with a divide by zero error. Presumably this is one of those cases where the physics of the model get out of control?
ID: 68458 · Report as offensive     Reply Quote
Previous · 1 . . . 25 · 26 · 27 · 28 · 29 · 30 · 31 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 climateprediction.net