climateprediction.net home page
New work discussion - 2

New work discussion - 2

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 42 · Next

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 774
Credit: 13,433,329
RAC: 7,110
Message 66608 - Posted: 28 Nov 2022, 21:11:15 UTC - in response to Message 66607.  

Another name is /var/lib/boinc/slots/11/./master.exe which is pretty funny because my Linux machine will not really run .exe files.
'.exe' is just a convention we adopted to indicate an executable file, it's not related to a Windows .exe. Normal linux convention is not to have any suffix but for CPDN we preferred to have one.

My two take 2.5 and 3.5 GBytes working set but the amounts jump around a lot.
The model uses dynamic memory alot. The high water memory is when it goes into the radiation code. This involves recomputing look-up tables & matrix computations. It's the most expensive timestep too.

Predicted 2 days 18 hours to go, having done about 1 hour 18 minutes each.
Ignore predicted time, it's rather useless for these models. It depends on the client seeing this app enough times to work out a figure, it's not under control of the app. The problem is that OpenIFS apps can be run for varying lengths of time, so the boinc client will never get a good estimate of time remaining. The fraction done is accurate, use that to work out time to completion. I turn off the display of 'time remaining'.
ID: 66608 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 66610 - Posted: 28 Nov 2022, 21:49:44 UTC - in response to Message 66606.  

Typical runtime is ~7-10hrs depending on your CPU. Memory should be ~6Gb.
The two I have should finish in a fraction under 10 hours.
ID: 66610 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 66615 - Posted: 28 Nov 2022, 23:39:59 UTC

It's been so long since here was any work, that I've forgotten how to get it to start.
ID: 66615 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 669
Credit: 4,391,754
RAC: 6,918
Message 66616 - Posted: 29 Nov 2022, 0:21:39 UTC - in response to Message 66608.  

Another name is /var/lib/boinc/slots/11/./master.exe which is pretty funny because my Linux machine will not really run .exe files.
'.exe' is just a convention we adopted to indicate an executable file, it's not related to a Windows .exe. Normal linux convention is not to have any suffix but for CPDN we preferred to have one.
How do Linux users know what type of file they're looking at?
ID: 66616 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 66617 - Posted: 29 Nov 2022, 1:02:54 UTC - in response to Message 66616.  

How do Linux users know what type of file they're looking at?


Linux follows the conventions of the original UNIX Operating systems. Filenames are just names, and the dot is just another letter (when it is part of a directory name; . and .. have special meanings in a directory, but that is just another story). So they could have called that file master.exe master.jpeg for all the difference it would make. The easiest way to find out what a Linux file is to apply the file command to it.

For example
[/var/lib/boinc/slots/10]$ file master.exe 
master.exe: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, BuildID[sha1]=c3f8ea54db10edfe769adb8096efc92f023410a8, for GNU/Linux 3.2.0, stripped


You can find many files by looking at the first two bytes of the file. For example

[/var/lib/boinc/slots/10]$ od master.exe | head -n 2
0000000 042577 043114 000402 001401 000000 000000 000000 000000
0000020 000002 000076 000001 000000 020100 000100 000000 000000


IIRC, the 042577 tells you it is an executable file. At least it did in the old days (early 1970s). Many files are not like this, however, so the file program must apply other heuristics to those files. It could even do this correctly (TaxAct for my Windows machine).

file ta22stpremier.exe 
ta22stpremier.exe: PE32 executable (GUI) Intel 80386, for MS Windows

ID: 66617 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 774
Credit: 13,433,329
RAC: 7,110
Message 66618 - Posted: 29 Nov 2022, 1:11:13 UTC - in response to Message 66617.  
Last modified: 29 Nov 2022, 1:12:30 UTC

So they could have called that file master.exe master.jpeg for all the difference it would make.
True, but these days with desktop environments like Cinnamon or even macOS, the file suffix is used to identify files and assign default applications to handle them. If I had called the file master.jpg, and double clicked on it in the file explorer app in Mint, it would have tried to open an image browser.
ID: 66618 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 669
Credit: 4,391,754
RAC: 6,918
Message 66620 - Posted: 29 Nov 2022, 1:36:41 UTC - in response to Message 66618.  

So they could have called that file master.exe master.jpeg for all the difference it would make.
True, but these days with desktop environments like Cinnamon or even macOS, the file suffix is used to identify files and assign default applications to handle them. If I had called the file master.jpg, and double clicked on it in the file explorer app in Mint, it would have tried to open an image browser.
Yes, thankfully even Linux and Mac have realised an extension is so much easier for the user to see at a glance. Unfortunately the three operating systems often copy the worst aspects of each other. Windows 95 made a wonderful thing called the start menu, and also the taskbar. Apple copied this and screwed it up. Then Windows copied it back and I have to use a third party utility to make it work as it used to. One click and I see x recently used apps. The taskbar shows nothing but the start button, the running apps (in words, not a silly unintelligible icon), and the clock. Not a mixture of links/aliases/shortcuts to apps aswell as running ones, almost indistinguishable from each other. If I had a penny for every minute of my time wasted getting basic interfaces to do what I want them to....
ID: 66620 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 66621 - Posted: 29 Nov 2022, 2:49:49 UTC

Looks like about 13 hours running two at a time on an i7-4790K and about 9 hours running two at a time on my Rzyen 5 5600X.
ID: 66621 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 66622 - Posted: 29 Nov 2022, 4:08:13 UTC - in response to Message 66615.  
Last modified: 29 Nov 2022, 4:09:04 UTC

It's been so long since here was any work, that I've forgotten how to get it to start.


If you are not joking, I did not have to do anything to get my (now 3) OIFS tasks to start, other than waiting for some other task to finish.
ID: 66622 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 66623 - Posted: 29 Nov 2022, 7:17:19 UTC - in response to Message 66621.  

Looks like about 13 hours running two at a time on an i7-4790K and about 9 hours running two at a time on my Rzyen 5 5600X.
10hrs47 on my Ryzen7.
ID: 66623 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 66624 - Posted: 29 Nov 2022, 8:44:12 UTC

The first of the Perturbed Surface variant of OpenIFS are going out now. App name: oifs_43r3_ps

3000 in this batch. Many more to follow.
Only seem to be 1,000 in this batch. I presume the other 2,000 will arrive soon seeing as the first lot are all gone. 7 showing as completed on batch statistics page which is a bit out of date.
ID: 66624 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 774
Credit: 13,433,329
RAC: 7,110
Message 66625 - Posted: 29 Nov 2022, 8:48:45 UTC - in response to Message 66623.  
Last modified: 29 Nov 2022, 8:51:34 UTC

Those times depend on what %cpu boinc is allowed to use. 100%? Perhaps add that info. Machine load affects wall clock time too.

I'm more interested in any failures. If you get one, let me know. Thx.

There's another 2000 ready to go as soon as Andy gets to it. And then there's plenty more after, the scientist needs to run a minimum of ~42000.
ID: 66625 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 774
Credit: 13,433,329
RAC: 7,110
Message 66628 - Posted: 29 Nov 2022, 12:05:19 UTC

Restarts for openifs_43r3_ps work ok.

As a test, I shutdown my Ubuntu/WSL last night to make sure the task would restart. Before shutting the machine down I suspended the task(s) in boincmgr, made sure the 'master.exe' had disappeared from output from 'ps' (or top), and then shutdown the machine. This morning, restarted boinc client, resumed the task and the model happily restarted from its last checkpoint.
ID: 66628 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 66629 - Posted: 29 Nov 2022, 12:10:17 UTC - in response to Message 66628.  

Restarts for openifs_43r3_ps work ok.

As a test, I shutdown my Ubuntu/WSL last night to make sure the task would restart. Before shutting the machine down I suspended the task(s) in boincmgr, made sure the 'master.exe' had disappeared from output from 'ps' (or top), and then shutdown the machine. This morning, restarted boinc client, resumed the task and the model happily restarted from its last checkpoint.


If you only did it the once, I wouldn't guarantee it. Last four restarts with hadam4s tasks didn't lose any for me but I still occasionally lose one or more. I was going to wait till I had gotten a few tasks under my belt so to speak before trying.
ID: 66629 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 66630 - Posted: 29 Nov 2022, 12:21:56 UTC - in response to Message 66621.  

Looks like about 13 hours running two at a time on an i7-4790K and about 9 hours running two at a time on my Rzyen 5 5600X.


For me, my very first OIFS task ran like this:

Task 22245034
Name 	oifs_43r3_ps_0002_2021050100_123_945_12163091_0
Workunit 	12163091
Created 	28 Nov 2022, 19:12:00 UTC
Sent 	28 Nov 2022, 19:24:39 UTC
Report deadline 	28 Dec 2022, 19:24:39 UTC
Received 	29 Nov 2022, 11:23:20 UTC
Server state 	Over
Outcome 	Success
Client state 	Done
Exit status 	0 (0x00000000)
Computer ID 	1511241
Run time 	15 hours 33 min 50 sec <---<<<
CPU time 	15 hours 22 min 37 sec <---<<<
Validate state 	Valid
Credit 	0.00
Device peak FLOPS 	6.13 GFLOPS <---<<<
Application version 	OpenIFS 43r3 Perturbed Surface v1.01
x86_64-pc-linux-gnu
Peak working set size 	4,619.30 MB <---<<<


My next one,
Task 22245062
Name oifs_43r3_ps_0030_2021050100_123_945_12163119_0
was about the same.
ID: 66630 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 774
Credit: 13,433,329
RAC: 7,110
Message 66631 - Posted: 29 Nov 2022, 12:29:09 UTC - in response to Message 66629.  

This morning, restarted boinc client, resumed the task and the model happily restarted from its last checkpoint.

If you only did it the once, I wouldn't guarantee it. Last four restarts with hadam4s tasks didn't lose any for me but I still occasionally lose one or more. I was going to wait till I had gotten a few tasks under my belt so to speak before trying.
OpenIFS is a completely different model with a new controlling code. Experience with HadSM4 doesn't apply. I did a lot of testing to make sure it works. Yes there are always edge cases but 99% should be ok.
ID: 66631 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 66632 - Posted: 29 Nov 2022, 16:42:40 UTC - in response to Message 66625.  

Those times depend on what %cpu boinc is allowed to use. 100%? Perhaps add that info. Machine load affects wall clock time too.

I'm more interested in any failures. If you get one, let me know. Thx.

There's another 2000 ready to go as soon as Andy gets to it. And then there's plenty more after, the scientist needs to run a minimum of ~42000.

One of the two on my i7-4790K crashed at the end with exit status of "194 (0x000000C2) EXIT_ABORTED_BY_CLIENT". In stderr, it has "Process still present 5 min after writing finish file; aborting".
https://www.cpdn.org/result.php?resultid=22245298
Both the successful task and the errored task ran through step 2592.

Both tasks on my Ryzen 5600X completed successfully in just under 9 hours CPU and wall clock time.
ID: 66632 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 66633 - Posted: 29 Nov 2022, 17:15:21 UTC - in response to Message 66625.  

Those times depend on what %cpu boinc is allowed to use. 100%? Perhaps add that info. Machine load affects wall clock time too.


Well, the %cpu times are pretty-much 99+%. These are just the Boinc processes on my 16-core machine. I only allow 12 cores for Boinc, so everything else gets the other four cores.

    
 PID       PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                        
 619112  619107 boinc     39  19 R   4.6g   7.3  99.0 13 359:48.53 /var/lib/boinc/slots/11/./master.exe           
 596096  596091 boinc     39  19 R   3.9g   6.2  99.0  9 793:52.71 /var/lib/boinc/slots/9/./master.exe            
 618959  618951 boinc     39  19 R   2.8g   4.5  99.0  0 362:41.64 /var/lib/boinc/slots/10/./master.exe           
 621104    2146 boinc     39  19 R 213108   0.3  99.1 12 322:43.61 ../../projects/einstein.phys.uwm.edu/einstein+ 
 633168    2146 boinc     39  19 R 212988   0.3  99.0  2 132:49.51 ../../projects/einstein.phys.uwm.edu/einstein+ 
 640995    2146 boinc     39  19 R 133692   0.2  99.0  7  24:47.84 ../../projects/www.worldcommunitygrid.org/wcg+ 
 636512    2146 boinc     39  19 R  72996   0.1  99.3  4  92:53.62 ../../projects/www.worldcommunitygrid.org/wcg+ 
 641748    2146 boinc     39  19 R  63884   0.1  99.2  8  18:23.02 ../../projects/www.worldcommunitygrid.org/wcg+ 
   2146       1 boinc     30  10 S  46352   0.1   0.3 15 118009:13 /usr/bin/boinc   <---<<< This is the Boinc client                            
 640361    2146 boinc     39  19 R   7172   0.0  99.0  5  33:15.25 ../../projects/milkyway.cs.rpi.edu_milkyway/m+ 
 642260    2146 boinc     39  19 R   5924   0.0  99.0  6   9:05.17 ../../projects/milkyway.cs.rpi.edu_milkyway/m+ 
 596091    2146 boinc     39  19 S   5360   0.0   0.1 10   1:43.66 ../../projects/climateprediction.net/oifs_43r+ 
 638162    2146 boinc     39  19 R   5008   0.0  99.1  3  74:30.71 ../../projects/universeathome.pl_universe/BHs+ 
 638038    2146 boinc     39  19 R   4932   0.0  99.2 14  75:41.48 ../../projects/universeathome.pl_universe/BHs+ 
 619107    2146 boinc     39  19 S   4912   0.0   0.1  3   0:46.86 ../../projects/climateprediction.net/oifs_43r+ 
 618951    2146 boinc     39  19 S   4868   0.0   0.1 14   0:46.95 ../../projects/climateprediction.net/oifs_43r+ 

ID: 66633 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 774
Credit: 13,433,329
RAC: 7,110
Message 66660 - Posted: 30 Nov 2022, 11:00:12 UTC
Last modified: 30 Nov 2022, 11:04:39 UTC

OpenIFS batch problems

First, thanks to those who have reported problems either in threads or to me via private messages. It has been very useful to track what's going wrong. And apologies to those volunteers having difficulties running these tasks. These errors were not found in testing and it's apparent a larger scale test should have been done for this first use of OpenIFS.

Task failures
There are examples of the tasks failing, either midway through or at the very end. It only seems to be happening on some machines and it's related to memory issues in the code (it's not related to hardware as far as I can tell).

The process that starts with the name 'oifs_43r3_1.....' dies for some reason. As this controls the model, it leaves the model process called 'master.exe' still running in the same slot directory (it shouldn't do this but it does). If the client then restarts the task (in the same slot directory), it not only regenerates more output (filling the slot dir) it will corrupt the model files confusing the client.

There should always be the same number of 'master.exe' and 'oifs_43r3_1...' processes running. If you have more master.exe processes, one of them is the rogue one. Suspend all your oifs tasks and kill the one that's still running. Or use the 'ps' command to check the parent of each master.exe process.

I think the boinc client will eventually kill of any rogue processes, though you may need to manually clean the slot directory.

Error code 9 : Some users have reported seeing 'task exited with error code 9'. This is an indication of lack of system memory. Reduce the number of OpenIFS tasks you have running.

If anyone has problems/questions with this, send me a Private Message and I'll help.

Data volumes
Volunteers on slower internet lines (ADSL) have reported problems with transferring the model output. That's something we can deal with in subsequent batches.
Remedy : reduce the number of OpenIFS tasks

There have also been messages that the boinc client reports climateprediction.net needs very large storage (38Gb was mentioned in 1 post). This is a consequence of both the task failures causing data left behind & the data volumes.

I suggest setting 'no new tasks', letting the openifs tasks finish and then manually delete any files in the open slot directories.

Last, if anyone who has more experience with boinc than me wants to add anything useful, please feel free.
ID: 66660 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 66669 - Posted: 30 Nov 2022, 18:17:53 UTC

Interesting. I have 2 processes running despite all tasks being suspended while I wait for uploads to clear.

  52816 boinc     39  19  141824   1164    308 S   0.3   0.0   2:24.43 oifs_43r3_ps_1.                                        
  59704 boinc     39  19   10752    896    312 S   0.3   0.0   0:47.65 oifs_43r3_ps_1. 


I currently have 4 successes and two of the crashes right at the end.
ID: 66669 · Report as offensive
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 climateprediction.net