climateprediction.net home page
OpenIFS Discussion

OpenIFS Discussion

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 22 · 23 · 24 · 25 · 26 · 27 · 28 . . . 31 · Next

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1063
Credit: 16,546,621
RAC: 2,321
Message 68333 - Posted: 15 Feb 2023, 13:45:06 UTC - in response to Message 68332.  

Could the push notifications be used to ask users to limit the number of tasks they run concurrently?


How would that be worded? I am running 12 Boinc tasks at a time, of which 5 right now are _bl and they all run just fine. There appears to be no memory shortage. Of all the Oifs tasks I have run (280), only one crashed (due to double free problem). I do not see how limiting the number of concurrent tasks would have helped this.

$ date; free -hw
Wed Feb 15 08:33:58 EST 2023
              total        used        free      shared     buffers       cache   available
Mem:           62Gi        21Gi       4.8Gi       118Mi       150Mi        36Gi        40Gi
Swap:          15Gi       1.2Gi        14Gi



    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                  
 766565  766546 boinc     39  19 R   4.0g   6.3  98.3  9 225:29.05 /var/lib/boinc/slots/11/oifs_43r3_model.exe                              
 766988  766985 boinc     39  19 R   3.9g   6.2  98.5 10 218:06.08 /var/lib/boinc/slots/7/oifs_43r3_model.exe                               
 768349  768346 boinc     39  19 R   3.5g   5.6  97.8  0 190:55.99 /var/lib/boinc/slots/5/oifs_43r3_model.exe                               
 762499  762494 boinc     39  19 R   3.4g   5.5  98.4  6 305:36.67 /var/lib/boinc/slots/6/oifs_43r3_model.exe                               
 768103  768098 boinc     39  19 R   3.4g   5.5  98.0  8 194:09.25 /var/lib/boinc/slots/9/oifs_43r3_model.exe        

ID: 68333 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4347
Credit: 16,541,921
RAC: 6,087
Message 68334 - Posted: 15 Feb 2023, 14:25:35 UTC


How would that be worded? I am running 12 Boinc tasks at a time, of which 5 right now are _bl and they all run just fine. There appears to be no memory shortage. Of all the Oifs tasks I have run (280), only one crashed (due to double free problem). I do not see how limiting the number of concurrent tasks would have helped this.


I agree the working is important. There are however quite a few computers out there trying to run four or even more tasks at once with only 16GB of memory or less. Some of these are failing everything they get.
ID: 68334 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1063
Credit: 16,546,621
RAC: 2,321
Message 68335 - Posted: 15 Feb 2023, 14:32:17 UTC - in response to Message 68334.  

I agree the working is important. There are however quite a few computers out there trying to run four or even more tasks at once with only 16GB of memory or less. Some of these are failing everything they get.


So would the wording be something like "divide RAM size by 7G to compute the maximum number to run at a time, and be sure to subtract the sizes of the non-Oifs tasks from the RAM size first"?
ID: 68335 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 10,332,940
RAC: 24,942
Message 68336 - Posted: 15 Feb 2023, 14:47:35 UTC - in response to Message 68335.  
Last modified: 15 Feb 2023, 14:48:18 UTC

... something like "divide RAM size by 7G to compute the maximum number to run at a time, and be sure to subtract the sizes of the non-Oifs tasks from the RAM size first"?
I have seen several figures how much RAM OpenIFS_PS-Tasks need, 7 GB, 6 GB, in real I have never seen using more than 4,5 GB per Task

I'm running 3 OpenIFS in a 16 GB RAM-Environment together with a Squid-Instance and so far this box has run 32 WUs successfull without any errors.


Supporting BOINC, a great concept !
ID: 68336 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1063
Credit: 16,546,621
RAC: 2,321
Message 68337 - Posted: 15 Feb 2023, 14:55:43 UTC - in response to Message 68334.  

My machine has many more boinc tasks sitting around waiting to run than are actually running.
For example, there are perhaps 150 Universe tasks ready to go, and app_config allows up to three at a time to run. Notice that none are running.
Similarly, there are lots of MilkyWay tasks ready to run, and app_config allows up to three at a time to run. Notice that only two are running.
In Preferences, I tell boinc-client to use at most 75% of memory when machine is in use and 85% when machine is not in use. And to use at most 75% of the CPUs. Do not these restrictions protect the boinc-client from using too much memory?

top - 09:39:30 up 8 days, 19:35,  1 user,  load average: 12.65, 12.67, 12.58
Tasks: 471 total,  14 running, 457 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us, 11.5 sy, 62.9 ni, 24.9 id,  0.0 wa,  0.2 hi,  0.1 si,  0.0 st
MiB Mem :  63897.3 total,   5678.4 free,  21547.5 used,  36671.4 buff/cache
MiB Swap:  15992.0 total,  14590.7 free,   1401.2 used.  41495.3 avail Mem 

    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
 768103  768098 boinc     39  19 R   4.2g   6.7  98.7 11 262:25.35 /var/lib/boinc/slots/9/oifs_43r3_model.exe                                
 768349  768346 boinc     39  19 R   4.0g   6.5  98.6 14 259:11.76 /var/lib/boinc/slots/5/oifs_43r3_model.exe                                
 766565  766546 boinc     39  19 R   3.5g   5.6  99.0  6 293:44.84 /var/lib/boinc/slots/11/oifs_43r3_model.exe                               
 762499  762494 boinc     39  19 R   2.8g   4.4  98.7  0 373:52.14 /var/lib/boinc/slots/6/oifs_43r3_model.exe                                
 766988  766985 boinc     39  19 R   2.3g   3.6  98.8  3 286:22.15 /var/lib/boinc/slots/7/oifs_43r3_model.exe                                
 784022    2211 boinc     39  19 R 765368   1.2  98.8  1   0:16.00 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-+ 
 775376    2211 boinc     39  19 R  88936   0.1  98.8  2 120:28.72 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
 779023    2211 boinc     39  19 R  77148   0.1  99.1 12  59:57.71 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
 779253    2211 boinc     39  19 R  76776   0.1  98.9  7  56:57.61 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
 781925    2211 boinc     39  19 R  71760   0.1  99.2  4  28:00.93 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
   2211       1 boinc     30  10 S  45480   0.1   0.4  9 141262:07 /usr/bin/boinc                                                            
 781003    2211 boinc     39  19 R   7180   0.0  98.6  5  38:09.60 ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_1.46_x86_64-pc-linu+ 
 780794    2211 boinc     39  19 R   7156   0.0  99.0  8  40:14.77 ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_1.46_x86_64-pc-linu+ 
 768346    2211 boinc     39  19 S   4824   0.0   0.0 10   1:15.75 ../../projects/climateprediction.net/oifs_43r3_bl_1.11_x86_64-pc-linux-g+ 
 766985    2211 boinc     39  19 S   4820   0.0   0.0 11   1:27.40 ../../projects/climateprediction.net/oifs_43r3_bl_1.11_x86_64-pc-linux-g+ 
 768098    2211 boinc     39  19 S   4816   0.0   0.1 11   1:34.92 ../../projects/climateprediction.net/oifs_43r3_bl_1.11_x86_64-pc-linux-g+ 
 762494    2211 boinc     39  19 S   3368   0.0   0.0 13   2:14.72 ../../projects/climateprediction.net/oifs_43r3_bl_1.11_x86_64-pc-linux-g+ 
 766546    2211 boinc     39  19 S   3292   0.0   0.0 10   1:39.86 ../../projects/climateprediction.net/oifs_43r3_bl_1.11_x86_64-pc-linux-g+ 

ID: 68337 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 810
Credit: 13,617,606
RAC: 5,900
Message 68338 - Posted: 15 Feb 2023, 17:30:49 UTC - in response to Message 68336.  

... something like "divide RAM size by 7G to compute the maximum number to run at a time, and be sure to subtract the sizes of the non-Oifs tasks from the RAM size first"?
I have seen several figures how much RAM OpenIFS_PS-Tasks need, 7 GB, 6 GB, in real I have never seen using more than 4,5 GB per Task

I'm running 3 OpenIFS in a 16 GB RAM-Environment together with a Squid-Instance and so far this box has run 32 WUs successfull without any errors.
Depends what process(es) you are looking at as there are several associated with the task as it runs. The limit is set to the sum. But yes the BL model itself takes less memory than the PS version because it's a simpler configuration. However, because we know there are significant memory leaks in the boinc client code for zipping files, we have to add a buffer to account for memory leaks accumulating during the run to make sure the task isn't killed.

The problem with informing users is (a) getting them to read the notices/forums and putting it in a way they understand (look on the News forum now and you'll see what I mean); (b) will they really care; (c) it's not really their problem & not CPDN's. The issue is with the client itself.

Don't assume that just because one OpenIFS task uses less than 7Gb memory the others will. Respect the rsc_memory_bound in the task. It's set to those numbers for good reasons.
ID: 68338 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1063
Credit: 16,546,621
RAC: 2,321
Message 68339 - Posted: 15 Feb 2023, 18:09:27 UTC - in response to Message 68338.  

Don't assume that just because one OpenIFS task uses less than 7Gb memory the others will. Respect the rsc_memory_bound in the task. It's set to those numbers for good reasons.


Respect the rsc_memory_bound in the task.

OK: I found one of those, but how do I read it?

[<rsc_memory_bound>8804000000.000000</rsc_memory_bound>

Is that 8804000000.000000 Bytes?

It is not using that much at the moment, but that proves little.

   PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND
788056  788051 boinc     39  19 R   3.9g   6.2  98.9  4 137:26.93 /var/lib/boinc/slots/6/oifs_43r3_model.exe 
788051    2211 boinc     39  19 S   4736   0.0   0.0  9   0:31.71 ../../projects/climateprediction.net/oifs_43r3_bl_1.11_x86_64-pc-linux-g+

ID: 68339 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,327,232
RAC: 11,319
Message 68340 - Posted: 15 Feb 2023, 18:24:28 UTC

Another forrtl: error (72): floating overflow on WU 12206864 - two attempts so far, and both failed with the same error at the exact same stage in the run (after step 156).
ID: 68340 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1063
Credit: 16,546,621
RAC: 2,321
Message 68341 - Posted: 15 Feb 2023, 18:59:08 UTC - in response to Message 68340.  

Another forrtl: error (72): floating overflow on WU 12206864 - two attempts so far, and both failed with the same error at the exact same stage in the run (after step 156).


It is clearly data-dependent. All my Oifs _bl tasks worked except for the one that died from double-free problem.
ID: 68341 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,327,232
RAC: 11,319
Message 68342 - Posted: 15 Feb 2023, 19:17:02 UTC

And this one is a real oddity - same machine as the last one, but I don't think this one was data dependent. Task 22313532

This ran as normal and without incident until the very end:

  17:15:10 STEP 1440 H= 360:00 +CPU= 19.627
It then stopped, with the worker missing from the PID list, but with the wrapper still present. It was locked solid - using no discernible CPU time, and not responding to 'kill' commands. But it hadn't written the finish file, so BOINC let it run.

And so did I, while other tasks finished. When things were quiet, I stopped and restarted the BOINC client. That's been observed to remove locked PIDs from memory, and did so on this occasion too. You can see the restart process in the stderr, culminating with a re-alignment at

  18:47:29 STEP 1440 H= 360:00 +CPU= 11.699
It then prepared and uploaded the final zip and trickle, and has been accepted as valid.
ID: 68342 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1063
Credit: 16,546,621
RAC: 2,321
Message 68343 - Posted: 15 Feb 2023, 19:30:47 UTC - in response to Message 68342.  

This ran as normal and without incident until the very end:

17:15:10 STEP 1440 H= 360:00 +CPU= 19.627

It then stopped, with the worker missing from the PID list, but with the wrapper still present. It was locked solid - using no discernible CPU time, and not responding to 'kill' commands. But it hadn't written the finish file, so BOINC let it run.

And so did I, while other tasks finished. When things were quiet, I stopped and restarted the BOINC client. That's been observed to remove locked PIDs from memory, and did so on this occasion too. You can see the restart process in the stderr, culminating with a re-alignment at

18:47:29 STEP 1440 H= 360:00 +CPU= 11.699

It then prepared and uploaded the final zip and trickle, and has been accepted as valid.



... and here is how my most recent one of these ended normally:

  12:29:46 STEP 1438 H= 359:30 +CPU= 17.113
  12:30:04 STEP 1439 H= 359:45 +CPU= 17.382
  12:30:27 STEP 1440 H= 360:00 +CPU= 22.512
..The child process terminated with status: 0

>>> Printing last 70 lines from file: NODE.001_01

I   VARIABLE   I      Initial value         I        Current value        I
I-------------------------------------------------------------------------I
I  OMEGA (P/S) I     -0.1227910857E-20      I       0.2519104860E-05      I
ICloud fractionI      0.0000000000E+00      I       0.1919869766E-01      I
I  Relat. Hum. I      0.3184217114E+00      I       0.3629554164E+00      I
I  PASS 01     I      0.0000000000E+00      I       0.3514421542E+03      I
DDH------------------------------------------------------------------------
  
  MAXGPFV : MAX. VALUE =   0.000000000000000E+000
  MAXGPFV : MAX. VALUE =   0.000000000000000E+000
 NSTEP =  1440 SCAN2M_HPOS  P
 IO-STREAM SETUP - IOTYPE =           2  NUMIOPROCS =           1  CPATH =
 ICMGGh7zg+001440 MODE=a
 IO-STREAM CLOSED - ICMGGh7zg+001440
 MPI-TASK:   1 -    295645393 BYTES IN      7 RECORDS TRANSFERRED IN    0.0002 SECONDS******** Mbytes/s, TOTAL TIME=    0.0079( 2.5%)
 12:30:27 STEP 1440 H= 360:00 +CPU= 22.512
 END CNT3
 NSTEP =  1440 CNT0   000000000  16.726  16.726  16.974   0.985               0               0               0               0               0
IO-STREAM STATISTICS, TOTAL NO OF REC/BYTES READ            0/           0 WRITTEN        42886/************
 -TASK-OPENED-OPEN-RECS IN -KBYTE IN    -RECS OUT-KBYTE OUT-WALL   -WALL IN-WALL
  OU-TOT IN -TOT OUT
   1    758    0        0            0    42886 ************    34.5     0.0    34.5     0.0    83.2
===-=== START OF TIMING STATISTICS ===-===
 
STATS FOR ALL TASKS
 NUM ROUTINE                                     CALLS  MEAN(ms)   MAX(ms)   FRAC(%)  UNBAL(%)
   0 CNT0     - COMPLETE EXECUTION                   1 ********* *********    100.00      0.00
   1 CNT4     - FORWARD INTEGRATION                  1 ********* *********     99.99      0.00
   8 SCAN2M - GRID-POINT DYNAMICS                 1562    2726.9    2726.9     16.29      0.00
   9 SPCM     - SPECTRAL COMP.                    1440     573.0     573.0      3.15      0.00
  10 SCAN2M - PHYSICS                             1441    8670.9    8670.9     47.78      0.00
  11 IOPACK   - OUTPUT P.P. RESULTS                121     423.0     423.0      0.20      0.00
  12 SPNORM   - SPECTRAL NORM COMP.                 63      32.3      32.3      0.01      0.00
  14 SUINIF                                          1    1945.1    1945.1      0.01      0.00
  17 GRIDFPOS IN CNT4                              121      19.6      19.6      0.01      0.00
  18 SUSPECG                                         1    1000.1    1000.1      0.00      0.00
  19 SUSPEC                                          1    1005.5    1005.5      0.00      0.00
  24 SUGRIDU                                         1     446.4     446.4      0.00      0.00
  25 SPECRT                                          1     126.3     126.3      0.00      0.00
  26 SUGRIDF                                         1     366.8     366.8      0.00      0.00
  27 RESTART FILES - WRITING                        30    1706.2    1706.2      0.20      0.00
  28 RESTART FILES - READING                         1       0.1       0.1      0.00      0.00
  29 SU4FPOS IN CNT4                               121       0.0       0.0      0.00      0.00
  30 DYNFPOS IN CNT4                               121    3800.5    3800.5      1.76      0.00
  31 POSDDH IN STEPO                                 7       5.0       5.0      0.00      0.00
  37 CPGLAG   - SL COMPUTATIONS                   1441    3827.4    3827.4     21.09      0.00
  39 SU0YOMB                                         1     132.2     132.2      0.00      0.00
  51 SCAN2M   - SL COMM. PART 1                   1441      13.3      13.3      0.07      0.00
  54 SPCM     - M TO S/S TO M TRANSP.             1440     260.2     260.2      1.43      0.00
  55 SPCIMPF  - S TO M/M TO S TRANSP.             1440      38.1      38.1      0.21      0.00
  56 SPNORM   - SPECTRAL NORM COMM.                 63       0.2       0.2      0.00      0.00
 102 LTINV_CTL   - INVERSE LEGENDRE TRANSFORM     3125     288.5     288.5      3.45      0.00
 103 LTDIR_CTL   - DIRECT LEGENDRE TRANSFORM      3006     168.8     168.8      1.94      0.00
 106 FTDIR_CTL   - DIRECT FOURIER TRANSFORM       3006      53.0      53.0      0.61      0.00
 107 FTINV_CTL   - INVERSE FOURIER TRANSFORM      3125      91.6      91.6      1.09      0.00
 140 SULEG       - COMP. OF LEGENDRE POL.            1      21.9      21.9      0.00      0.00
 152 LTINV_CTL   - M TO L TRANSPOSITION           3125      20.9      20.9      0.25      0.00
 153 LTDIR_CTL   - L TO M TRANSPOSITION           3006      12.1      12.1      0.14      0.00
 157 FTINV_CTL   - L TO G TRANSPOSITION           3125      54.8      54.8      0.66      0.00
 158 FTDIR_CTL   - G TO L TRANSPOSITION           3006      13.9      13.9      0.16      0.00
 400 GSTATS                                     167374       0.0       0.0      0.00      0.00
 401 GSTATS HOOK                                155706       0.0       0.0      0.00      0.00
TOTAL MEASURED IMBALANCE =       0.0 SECONDS,  0.0 PERCENT
TOTAL WALLCLOCK TIME    26151.9 CPU TIME   25832.3 VECTOR TIME    25832.3


===-=== END   OF TIMING STATISTICS ===-===
FORECAST DAYS PER DAY       49.6
  *** END CNT0 *** 
------------------------------------------------


Moving to projects directory: /var/lib/boinc/slots/9/ICMGGh7zg+001440
Moving to projects directory: /var/lib/boinc/slots/9/ICMSHh7zg+001440
Moving to projects directory: /var/lib/boinc/slots/9/ICMUAh7zg+001440
Adding to the zip: /var/lib/boinc/slots/9/NODE.001_01
Adding to the zip: /var/lib/boinc/slots/9/ifs.stat
Adding to the zip: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_bl_12209190/ICMGGh7zg+001344
[snip]
Adding to the zip: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_bl_12209190/ICMUAh7zg+001440
Zipping up the final file: /var/lib/boinc/projects/climateprediction.net/oifs_43r3_bl_a1ur_2016092300_15_991_12209190_0_r208224497_14.zip
Uploading the final file: upload_file_14.zip
Uploading trickle at timestep: 1295100
12:33:54 (768098): called boinc_finish(0)

</stderr_txt>
]]>

ID: 68343 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 810
Credit: 13,617,606
RAC: 5,900
Message 68344 - Posted: 15 Feb 2023, 19:57:11 UTC - in response to Message 68342.  

And this one is a real oddity - same machine as the last one, but I don't think this one was data dependent. Task 22313532
This ran as normal and without incident until the very end:
  17:15:10 STEP 1440 H= 360:00 +CPU= 19.627
It then stopped, with the worker missing from the PID list, but with the wrapper still present. It was locked solid - using no discernible CPU time, and not responding to 'kill' commands. But it hadn't written the finish file, so BOINC let it run.
This looks like the file locking problem we've seen before, could also be a manifestation of the memory corruption which has corrupted file ptrs. It's also possible we are in the boinc_finish() call but that text 'boinc_finish' hasn't been flushed to stderr yet.

In cases like this, ideally I need you to attach the debugger and generate a traceback so we can see where it is in the calling tree. In deadlocks like this, the program pointer will be right at the problem:
# get the process id <pid> of the oifs_43r3_bl_1.11_x86-64-linux-gnu
ps -ef | grep *_bl_*
gdb -p <pid>
bt full
detach
exit
and then PM the output to me.
ID: 68344 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 810
Credit: 13,617,606
RAC: 5,900
Message 68345 - Posted: 15 Feb 2023, 20:08:04 UTC

If anyone's interested, this is another task fail which is different to the turbulence failure mentioned earlier:
https://www.cpdn.org/result.php?resultid=22313995

In this task, scrolll up the stderr output until just before the long traceback and you'll see lines like this:

  15:18:49 STEP  920 H= 230:00 +CPU= 11.509
  MAX U WIND=   250.807710093462     
  15:19:00 STEP  921 H= 230:15 +CPU= 11.031
  MAX U WIND=   258.524841745508     
  MAX V WIND=   253.308999272636     
  15:19:11 STEP  922 H= 230:30 +CPU= 10.936
  MAX U WIND=   256.013634963429     
  MAX V WIND=   253.579772348885     
  15:19:22 STEP  923 H= 230:45 +CPU= 10.949
  MAX U WIND=   250.801318000386     
  MAX V WIND=   250.423648892277     
  15:19:36 STEP  924 H= 231:00 +CPU= 14.168
The model has caught that the maximum wind speed (U is the E-W, V is the N-S; total windspeed is sqrt(U*U+V*V)), is greater than 250m/s (~560 mph). Usually, max E-W wind is ~75m/s. I've just looked it up and the max windspeed ever recorded was 231mph. So this storm got pretty powerful!
ID: 68345 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 186
Credit: 27,123,458
RAC: 3,218
Message 68346 - Posted: 15 Feb 2023, 20:39:14 UTC - in response to Message 68345.  

If anyone's interested, this is another task fail which is different to the turbulence failure mentioned earlier:
https://www.cpdn.org/result.php?resultid=22313995

So this storm got pretty powerful!
It's earlier run of the workunit reported the same results for wind speed at the same step: https://www.cpdn.org/result.php?resultid=22307272
That's very reassuring that the science/physics/programmimg/perturbations give consistent results.
ID: 68346 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,041,675
RAC: 20,255
Message 68347 - Posted: 15 Feb 2023, 20:49:00 UTC - in response to Message 68336.  

I'm running 3 OpenIFS in a 16 GB RAM-Environment together with a Squid-Instance and so far this box has run 32 WUs successfull without any errors.

That's impressive, also this PC is almost certainly in the minority of PCs that can do that. I'm assuming you don't use the machine for anything else? Does it have ECC RAM? Is Squid there from when you used it for LHC?
ID: 68347 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 10,332,940
RAC: 24,942
Message 68348 - Posted: 15 Feb 2023, 21:12:40 UTC - in response to Message 68347.  
Last modified: 15 Feb 2023, 21:14:28 UTC

I'm assuming you don't use the machine for anything else?
It's sitting on an older ESX-Server and run's only LHC-ATLAS, Squid (for all my LHC-Machines) and now CPDN. For CPDN I have stopped LHC/ATLAS at the moment

Does it have ECC RAM?
Shure, it's running on a Server

Is Squid there from when you used it for LHC?
I moved Squid from my former Windows-VM to this Linux-VM a month ago and yes, Squid is full active for my LHC/ATLAS-Machine-Park


Supporting BOINC, a great concept !
ID: 68348 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4347
Credit: 16,541,921
RAC: 6,087
Message 68349 - Posted: 15 Feb 2023, 21:25:40 UTC

I am doing a second attempt of one that failed after after the last timestep with this almost at the beginning of stderr

exceeded elapsed time limit 154748.62 (1920000.00G/9.28G)</message>


here Looks like the same problem Richard described. I shall see what happens on my machine.
ID: 68349 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 485
Credit: 29,638,939
RAC: 3,372
Message 68350 - Posted: 15 Feb 2023, 23:26:21 UTC

Is this related to the floating point issue:

Task 22307753

08:48:48 STEP 61 H= 15:15 +CPU= 23.513
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1538] Received signal#8 (SIGFPE) :: 4362MB (heap), 5076MB (maxrss), 0MB (maxstack), 0 (paging), nsigs = 1
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1542] Also activating Harakiri-alarm (SIGALRM=14) to expire after 500s elapsed to prevent hangs, nsigs = 1
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1544] Harakiri signal handler 'signal_harakiri' for signal#14 (SIGALRM) installed at 0x81f0c0 (old at (nil))
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1617] Signal#8 was caused by floating-point overflow [memaddr=0x1cc4a8f], nsigs = 1
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1686] Starting DrHook backtrace for signal#8, nsigs = 1
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [c_drhook_print_@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:3843] 4362 MB (maxheap), 5076 MB (maxrss), 0 MB (maxstack)
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [c_drhook_print_@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:3897] : MASTER
ID: 68350 · Report as offensive     Reply Quote
mikey

Send message
Joined: 18 Nov 18
Posts: 21
Credit: 5,630,175
RAC: 7,040
Message 68351 - Posted: 16 Feb 2023, 1:11:26 UTC - in response to Message 68348.  

Yeti wrote:
Is Squid there from when you used it for LHC?
I moved Squid from my former Windows-VM to this Linux-VM a month ago and yes, Squid is full active for my LHC/ATLAS-Machine-Park[/quote]

Are you using it for other Boinc Projects as well? Or just LHC?
ID: 68351 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 810
Credit: 13,617,606
RAC: 5,900
Message 68353 - Posted: 16 Feb 2023, 9:38:48 UTC - in response to Message 68350.  

Alan K.
Yes that task is another example of a model forecast that failed due to a too strong perturbation. In that traceback, near the middle, there's:
Signal#8 was caused by floating-point overflow

which indicates model field(s) have got too big (probably winds as I mentioned before).

Is this related to the floating point issue:

Task 22307753

08:48:48 STEP 61 H= 15:15 +CPU= 23.513
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1538] Received signal#8 (SIGFPE) :: 4362MB (heap), 5076MB (maxrss), 0MB (maxstack), 0 (paging), nsigs = 1
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1542] Also activating Harakiri-alarm (SIGALRM=14) to expire after 500s elapsed to prevent hangs, nsigs = 1
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1544] Harakiri signal handler 'signal_harakiri' for signal#14 (SIGALRM) installed at 0x81f0c0 (old at (nil))
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1617] Signal#8 was caused by floating-point overflow [memaddr=0x1cc4a8f], nsigs = 1
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [signal_drhook@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:1686] Starting DrHook backtrace for signal#8, nsigs = 1
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [c_drhook_print_@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:3843] 4362 MB (maxheap), 5076 MB (maxrss), 0 MB (maxstack)
[EC_DRHOOK:ubuntu:1:1:20673:20673] [20230214:084903:1676364543:1561.172] [c_drhook_print_@/home/abowery/Desktop/OpenIFS/oifs_43r3_bl/gc_oifs43r3-feature-cpdn/src/ifsaux/support/drhook.c:3897] : MASTER
ID: 68353 · Report as offensive     Reply Quote
Previous · 1 . . . 22 · 23 · 24 · 25 · 26 · 27 · 28 . . . 31 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 climateprediction.net