climateprediction.net home page
Posts by Jean-David Beyer

Posts by Jean-David Beyer

61) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69775)
Posted 12 Oct 2023 by Jean-David Beyer
Post:
Well all three of my tasks crashed after uploading 10 trickles each. My machine got another task and it crashed after uploading a single trickle.
I cannot tell what really went wrong with any of them.

My machine is
Computer ID 1512658, and the tasks were:

22340449
22339081
22339022
22346116
62) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69712)
Posted 9 Oct 2023 by Jean-David Beyer
Post:
FWIW, I have 7 zips that cannot upload. "transient HTTP error"


I have three tasks running and have had no trouble uploading zip files. Each has uploaded seven .zip files.

Here is one of them:

Task 22340449
Name 	wah2_eas25_a3fh_200712_24_996_012227993_0
Workunit 	12227993
Created 	5 Oct 2023, 16:02:19 UTC
Sent 	5 Oct 2023, 16:38:36 UTC
Report deadline 	16 Oct 2024, 21:58:36 UTC
Received 	---
Server state 	In progress
Outcome 	---
Client state 	New
Exit status 	0 (0x00000000)
Computer ID 	1512658
Run time 	
CPU time 	
Validate state 	Initial
Credit 	5,819.81
Device peak FLOPS 	4.23 GFLOPS
Application version 	Weather At Home 2 (wah2) v8.24
windows_intelx86
Stderr 	

--

Latest Trickles Received
Time Sent (UTC) 	Host ID Result ID Result Name 	Timestep 	CPU Time (sec) 	 	Average (sec/TS)
09 Oct 2023 06:04:24 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 80,939 306,949 3.7923
08 Oct 2023 17:21:55 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 69,419 261,246 3.7633
08 Oct 2023 05:04:43 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 57,899 217,100 3.7496
07 Oct 2023 16:52:41 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 46,379 173,219 3.7349
07 Oct 2023 04:53:35 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 34,859 130,196 3.7349
06 Oct 2023 16:55:50 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 23,339 87,178  3.7353
06 Oct 2023 05:00:33 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 11,819 44,353  3.7527
63) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69697)
Posted 8 Oct 2023 by Jean-David Beyer
Post:
Left the PC on overnight.
The task that was running last night is still running this morning.
However a new task arrived, and promptly crashed (less than 2 minute running)

https://www.cpdn.org/result.php?resultid=22344887

, with a segment violation error. So it looks as if that problem, while reduced, is still around.


I wonder what your problem is.
My three tasks are still running and have now uploaded 5 trickles.
Oldest one is:
22340449 	12227993 	5 Oct 2023, 16:38:36 UTC 	16 Oct 2024, 21:58:36 UTC 	In progress
64) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69695)
Posted 7 Oct 2023 by Jean-David Beyer
Post:
All but one of the failures was after shutting down for the night. It's somewhat reassuring that it's not my computer that's got an issue, ut it's a bit disappointing that the stop/restart issue hasn't been fully cured yet.


That may be why I seem to get less crashes than others. I let my machines run 24/7 and reboot them only when installing updates.
65) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69692)
Posted 7 Oct 2023 by Jean-David Beyer
Post:
Don't want to dual boot as this is my main machine and non BOINC work all happens in Linux.


Good Idea.

I, too, hate dual booting because I run almost everything in Linux. I need Windows only to run TaxAct each year to do my income taxes (Federal and my state). And four times a year to keep my Garmin GPS unit up to date.

I could get a Windows license to run Windows on this machine, but a few years ago I got sick of that so I got a little desktop machine (It looks just like a monitor, but the computer is inside the Monitor.) And that little computer runs Windows 10 and has nothing else to do, so I downloaded Boinc into it.

I signed it up for CPDN, WCG, DENIS, Rosetta, Einstein, and Universe.

My main machine is ID: 1511241 and has lots of RAM and processor cache. And my pipsqueak machine is ID: 1512658 and has much less RAM and a slower Processor that is only 8 cores. My Linux machine has a pretty fast processor with 16 cores.
66) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69690)
Posted 7 Oct 2023 by Jean-David Beyer
Post:
Had hoped that the signal11 failures would be a lot lower with this batch but it seems this might not be the case. This is to do with the batch and not your computer. Just hoping there are enough good tasks between this and the last lot for the researcher to get what she needs.


I have only one computer running Windows and I do not run WINE on the other (Linux machine). How do you distinguish between failures due to the machine from those due to the batch? I assume mine are all from the same batch and they show no signs of failure yet. I guess you see results from many other machines so you have more data from which to draw conclusions.

My three work units have about two days of work done on each. Each has uploaded 3 zip files. No failures yet. These are on my Windows 10 machine.
Computer 1512658

22339022 	12226566 	5 Oct 2023, 18:39:55 UTC 	16 Oct 2024, 23:59:55 UTC 	In progress 	--- 	--- 	2,506.49 	Weather At Home 2 (wah2) v8.24
windows_intelx86
22339081 	12226625 	5 Oct 2023, 17:39:17 UTC 	16 Oct 2024, 22:59:17 UTC 	In progress 	--- 	--- 	2,506.49 	Weather At Home 2 (wah2) v8.24
windows_intelx86
22340449 	12227993 	5 Oct 2023, 16:38:36 UTC 	16 Oct 2024, 21:58:36 UTC 	In progress 	--- 	--- 	2,506.49 	Weather At Home 2 (wah2) v8.24
windows_intelx86


Task 22339022
Name 	wah2_eas25_a2bu_200012_24_996_012226566_0
Workunit 	12226566

Computer ID 	1512658

Credit 	2,506.49
Device peak FLOPS 	4.23 GFLOPS
Application version 	Weather At Home 2 (wah2) v8.24
windows_intelx86
67) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69675)
Posted 6 Oct 2023 by Jean-David Beyer
Post:
I don't trust WINE for running the model correctly. We discovered during testing that WINE implementations do not fail the model when it suffers a memory fault unlike on bare metal Windows. I think there is some memory protection in place for WINE. That implies the results from incorrect memory addresses (e.g. maybe zero) are being used by the model, potentially corrupting the results.


I think memory faults, WINE or not, are an indication of an incorrect program or a hardware fault. If WINE has some memory protection in it in addition to the hardware, perhaps this is just more proof of my theory. It seems to hide the memory faults.

When I first used Windows (Windows 95) it had so many faults that it crashed several times a day even if it was not doing anything. I did not run BOINC then (I do not remember if it existed at that time). Since then Windows has improved some. IIRC Windows 7 was pretty good and I am now running Windows 10 on my other machine.

The three current tasks on my Windows machine now have about 18 hours on them with about 9 days to go.
68) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69665)
Posted 5 Oct 2023 by Jean-David Beyer
Post:
The task on my Ryzen running Windows natively failed at the usual point with a signal 11 (segmentation fault) during the first model day. Tasks running under Wine appear to be progressing nicely.


I have three of those tasks running on my Windows10 macine. They started at about one-hour intervals and have about 1.7, 2.7, and 3.7 hours completed.
About 9.5 days for them to complete. 8-core machine running on 7 of the cores. Machine not doing anything else (except 4 other Boinc tasks).
69) Message boards : Number crunching : New work discussion - 2 (Message 69662)
Posted 5 Oct 2023 by Jean-David Beyer
Post:
Got one on my pipsqueek Windows10 machine and it has over 15 minutes on it so far. Predicting 9 days 18 hours to go.
Task 22340449
Computer 1512658
70) Message boards : Number crunching : New work discussion - 2 (Message 69639)
Posted 17 Sep 2023 by Jean-David Beyer
Post:
Wow, that's pure insanity, but being Linux it doesn't surprise me. Swapping would be more sensible.


I would not blame Linux. And when things get so bad as to run the system out of memory, swapping may not be possible: buffers would be required to do the swap, and there is proibably no space for the needed buffers.

As I said earlier, in over 20 years of running Linux, this has never happened to me.
71) Message boards : Number crunching : New work discussion - 2 (Message 69637)
Posted 17 Sep 2023 by Jean-David Beyer
Post:
I know Linux can do that, but I have never had it happen and I have been running Linux since about 1998 (Red Hat not enterprise Linux 5 to begin with).
I am currently running Red Hat Enterprise Linux release 8.8 (Ootpa)

I would hope an OS would never do that. The application could be important and have unsaved work. Or did you mean swap to disk?


I do not mean swap to disk.

https://neo4j.com/developer/kb/linux-out-of-memory-killer/

This one is probably better:

https://rakeshjain-devops.medium.com/linux-out-of-memory-killer-31e477a45759
72) Message boards : Number crunching : New work discussion - 2 (Message 69635)
Posted 17 Sep 2023 by Jean-David Beyer
Post:
Einstein doesn't freeze my computers. Boinc removes tasks if the memory is too full.


Einstein does not freeze my computers either.
I do not know if Boinc removes tasks if memory is too full, whatever that means. I know Linux can do that, but I have never had it happen and I have been running Linux since about 1998 (Red Hat not enterprise Linux 5 to begin with).
I am currently running Red Hat Enterprise Linux release 8.8 (Ootpa)
73) Message boards : Number crunching : Credit handed out weekly? (Message 69619)
Posted 15 Sep 2023 by Jean-David Beyer
Post:
I'm hoping to get more Linux tasks haha!


Me roo.

Remember, though: Hope is just deferred disappointment.
74) Message boards : Number crunching : Credit handed out weekly? (Message 69591)
Posted 6 Sep 2023 by Jean-David Beyer
Post:
Check the stats sites again as looks like something changed today. I just picked up a ton of credit. Kudos Andy if that was your work!


Me too. I just picked up 6 megabits of credit. All in one swell foop. They were issued some months ago.
75) Message boards : Number crunching : New work discussion - 2 (Message 69583)
Posted 4 Sep 2023 by Jean-David Beyer
Post:
If you accidentally go over the RAM limit and go into the pagefile, a rust spinner grinds the computer to a halt, so much so you can't even use the interface to stop the problem.


For sure. But my machine has 128 GBytes of RAM and 16 cores, of which 12 are allowed for boinc. Furthermore I set app_config files to limit how many of each type of task is allowed to run. So I do not remember ever using the pagefile for much of anything. Running 24/7 for a little over three days, I seem to be using only one megabyte of pagefile. And that pagefile is on the reasonably fast NVME drive.
top - 10:59:10 up 3 days, 17:18,  2 users,  load average: 12.46, 12.70, 12.54
Tasks: 471 total,  11 running, 460 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.8 us,  5.4 sy, 68.9 ni, 24.7 id,  0.0 wa,  0.1 hi,  0.0 si,  0.0 st
MiB Mem : 128086.0 total,   1183.3 free,   7824.3 used, 119078.5 buff/cache
MiB Swap:  15992.0 total,  15991.0 free,      1.0 used. 118833.2 avail Mem 
76) Message boards : Number crunching : New work discussion - 2 (Message 69575)
Posted 3 Sep 2023 by Jean-David Beyer
Post:
I think most of us have SSDs by now. I gave up on rust spinners for anything but backups, security cameras, and TV/Films years ago.


I am guessing that without a reasonably fast NVME drive,some users will notice the slow down.


Well, I do have an NVMe drive on my machine, but the partition for Boinc is on an SATA hard drive. OTOH, the other two partitions on that drive store videos and sound files that I seldom use, and surely I would go at least 8 hours a day without using them at all and my machine runs 24/7 except for occasional system updates. so writing checkpoint files will, at least, not be doing a lot of seeking on that drive.

The other concern is disk I/O. The hi-mem OIFS models will be writing larger checkpoint (aka restart files) to disk. We need time to tune the model I/O so not to cause problems.


IIRC when the Oifs tasks were being sent out early this year, I was running 3 or 4 of those at a time with no problems with computation or even trickle uploads. I do have a 75 megabit/sec Internet connection.
77) Message boards : Number crunching : New work discussion - 2 (Message 69565)
Posted 2 Sep 2023 by Jean-David Beyer
Post:
Until we get more experience with volunteers running these high memory apps I think it makes sense to restrict it to a single task for now. We can change it later in light of experience.


One way to get more experience with volunteers running these high memory apps would be to send more of them to we volunteers.
78) Message boards : Number crunching : New work discussion - 2 (Message 69543)
Posted 30 Aug 2023 by Jean-David Beyer
Post:
I can tell you we have successfully run higher resolution configurations of OpenIFS on the dev-site that use 11Gb & 20Gb RAM and CPDN has okayed testing an even higher one that uses ~28Gb RAM. I don't think we will go beyond that yet, as these models also produce more output that might cause issues when uploading (plus more I/O to disk). Because of the memory size, CPDN will also limit the no. of 'in progress' tasks a user can have so 64Gb you have already is fine.


My main machine runs Red Hat Enterprise Linux release 8.8 (Ootpa) and is like this:

Computer 1511241

CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16
Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.8 (Ootpa) [4.18.0-477.15.1.el8_8.x86_64|libc 2.28]
BOINC version 	7.20.2
Memory 	125.08 GB
Cache 	16896 KB
Swap space 	15.62 GB
Total disk space 	488.04 GB
Free Disk Space 	480.57 GB
Measured floating point speed 	6.02 billion ops/sec
Measured integer speed 	25.36 billion ops/sec
Average upload rate 	139.34 KB/sec
Average download rate 	22391.7 KB/sec


I normally have it run 12 Boinc tasks at a time. My Internet connection isVerizon FiOS guaranteed to run at 75 Megabits/second. It acrually gets response like this. CPDN reports slower upload speeds than download speeds.I do not know why the speeds should be so different. I do not believe the download speeds are as fast as CPDN says. Those speeds could be true if they were in Kilobits per second, but KBytes per second is not really possible.
When I was getting oifs jobs, the trickles went up quite fast as long as the upload servers were running. I guess I could run one 28 GByte Oifs task at a time as well as some smaller tasks at the same time.
Timestamp 	  Download   Upload 	Latency Jitter Quality Score Test Server
8/30/2023 17:9:28 76.65 Mbps 89.02 Mbps 4 ms    2 ms   Excellent     speedtest1.nyc1.nitelusa.net.prod.hosts.ooklaserver.net
6/7/2023 20:3:31  78.13 Mbps 63.66 Mbps 6 ms    1 ms   Excellent     ny2.speedtest.gslnetworks.com.prod.hosts.ooklaserver.net
5/5/2023 11:23:28 76.26 Mbps 89.16 Mbps 6 ms    1 ms   Excellent     speedtest.nyc.rr.com
79) Message boards : Cafe CPDN : The Climate Machine (Message 69472)
Posted 14 Aug 2023 by Jean-David Beyer
Post:
In OpeniFS, the low level computer interface (debugging/tracing/hardware) is handled in C, the number crunching in Fortran (because fortran compilers still produce the fastest code in general) and the upper level control code was written in C++.


While I have no intent to contradict you, I wonder if claims like this are actually useful, even if true. At one time, I was working as part of a two-man team to write an assembly-level optimizer for the C compiler. We were given a bunch of benchmarks to optimize, and we got some truly impressive speed-ups. For the famous Whetstone benchmark, supposedly a test of floating point computation, for example, we got over 10,000:1 speedup. This took several parts. Whetstone had several modules and one was thought to be a test of floating point computation because it was called 10,000 times and it did a bunch of floating point operations. That module was actually there to test function and subroutine calling overhead. And we defeated that by expanding the routine in-line. The loop-invariant code motion optimization moved all those floating point operations outside the loop, causing an enormous speed up. Then live-dead analysis noticed the results were never used, so it eliminated the instructions (including the loop overhead) altogether. Marketing was pleased because we could do that benchmark so much better than Motorola (who made a better processor than we did).

We had a huge IBM 370 machine that was running UNIX and they gathered a lot of data, so we had them tell us how many processes were run per day and what the programs were that took the most time. nroff/troff (text processor) was the biggest so we ran that through our optimizer and it sped up a little bit (IIRC 10%), but not 10,000:1 or anywhere near.

IMAO, it does not matter much how good a compiler is (unless it is really awful), or how good the programming language is. What matters is what the algorithms are and how well the system is programmed. And fixing those is what really matters these days. So the language best used is probably the one with which the programmers are most familiar, and IMAO, FORTRAN is not it.

For purely numeric calculation, I preferred Algol-60, but would hesitate to recommend it for CPDN since my guess is that most programmers never even heard of it, and I do not know any compilers for it either. I got to be pretty good at C and C++, but have not written anything in over 20 years, so I am probably nowhere near as good a programmer as I used to be before I retired;
80) Message boards : Number crunching : Website certificate problem (Message 69453)
Posted 7 Aug 2023 by Jean-David Beyer
Post:
Seems fixed now.


Previous 20 · Next 20

©2024 climateprediction.net