61)
Message boards :
Number crunching :
Batch 996 Weather@Home2 East Asia25
(Message 69775)
Posted 12 Oct 2023 by Jean-David Beyer Post: Well all three of my tasks crashed after uploading 10 trickles each. My machine got another task and it crashed after uploading a single trickle. I cannot tell what really went wrong with any of them. My machine is Computer ID 1512658, and the tasks were: 22340449 22339081 22339022 22346116 |
62)
Message boards :
Number crunching :
Batch 996 Weather@Home2 East Asia25
(Message 69712)
Posted 9 Oct 2023 by Jean-David Beyer Post: FWIW, I have 7 zips that cannot upload. "transient HTTP error" I have three tasks running and have had no trouble uploading zip files. Each has uploaded seven .zip files. Here is one of them: Task 22340449 Name wah2_eas25_a3fh_200712_24_996_012227993_0 Workunit 12227993 Created 5 Oct 2023, 16:02:19 UTC Sent 5 Oct 2023, 16:38:36 UTC Report deadline 16 Oct 2024, 21:58:36 UTC Received --- Server state In progress Outcome --- Client state New Exit status 0 (0x00000000) Computer ID 1512658 Run time CPU time Validate state Initial Credit 5,819.81 Device peak FLOPS 4.23 GFLOPS Application version Weather At Home 2 (wah2) v8.24 windows_intelx86 Stderr -- Latest Trickles Received Time Sent (UTC) Host ID Result ID Result Name Timestep CPU Time (sec) Average (sec/TS) 09 Oct 2023 06:04:24 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 80,939 306,949 3.7923 08 Oct 2023 17:21:55 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 69,419 261,246 3.7633 08 Oct 2023 05:04:43 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 57,899 217,100 3.7496 07 Oct 2023 16:52:41 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 46,379 173,219 3.7349 07 Oct 2023 04:53:35 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 34,859 130,196 3.7349 06 Oct 2023 16:55:50 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 23,339 87,178 3.7353 06 Oct 2023 05:00:33 1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 11,819 44,353 3.7527 |
63)
Message boards :
Number crunching :
Batch 996 Weather@Home2 East Asia25
(Message 69697)
Posted 8 Oct 2023 by Jean-David Beyer Post: Left the PC on overnight. I wonder what your problem is. My three tasks are still running and have now uploaded 5 trickles. Oldest one is: 22340449 12227993 5 Oct 2023, 16:38:36 UTC 16 Oct 2024, 21:58:36 UTC In progress |
64)
Message boards :
Number crunching :
Batch 996 Weather@Home2 East Asia25
(Message 69695)
Posted 7 Oct 2023 by Jean-David Beyer Post: All but one of the failures was after shutting down for the night. It's somewhat reassuring that it's not my computer that's got an issue, ut it's a bit disappointing that the stop/restart issue hasn't been fully cured yet. That may be why I seem to get less crashes than others. I let my machines run 24/7 and reboot them only when installing updates. |
65)
Message boards :
Number crunching :
Batch 996 Weather@Home2 East Asia25
(Message 69692)
Posted 7 Oct 2023 by Jean-David Beyer Post: Don't want to dual boot as this is my main machine and non BOINC work all happens in Linux. Good Idea. I, too, hate dual booting because I run almost everything in Linux. I need Windows only to run TaxAct each year to do my income taxes (Federal and my state). And four times a year to keep my Garmin GPS unit up to date. I could get a Windows license to run Windows on this machine, but a few years ago I got sick of that so I got a little desktop machine (It looks just like a monitor, but the computer is inside the Monitor.) And that little computer runs Windows 10 and has nothing else to do, so I downloaded Boinc into it. I signed it up for CPDN, WCG, DENIS, Rosetta, Einstein, and Universe. My main machine is ID: 1511241 and has lots of RAM and processor cache. And my pipsqueak machine is ID: 1512658 and has much less RAM and a slower Processor that is only 8 cores. My Linux machine has a pretty fast processor with 16 cores. |
66)
Message boards :
Number crunching :
Batch 996 Weather@Home2 East Asia25
(Message 69690)
Posted 7 Oct 2023 by Jean-David Beyer Post: Had hoped that the signal11 failures would be a lot lower with this batch but it seems this might not be the case. This is to do with the batch and not your computer. Just hoping there are enough good tasks between this and the last lot for the researcher to get what she needs. I have only one computer running Windows and I do not run WINE on the other (Linux machine). How do you distinguish between failures due to the machine from those due to the batch? I assume mine are all from the same batch and they show no signs of failure yet. I guess you see results from many other machines so you have more data from which to draw conclusions. My three work units have about two days of work done on each. Each has uploaded 3 zip files. No failures yet. These are on my Windows 10 machine. Computer 1512658 22339022 12226566 5 Oct 2023, 18:39:55 UTC 16 Oct 2024, 23:59:55 UTC In progress --- --- 2,506.49 Weather At Home 2 (wah2) v8.24 windows_intelx86 22339081 12226625 5 Oct 2023, 17:39:17 UTC 16 Oct 2024, 22:59:17 UTC In progress --- --- 2,506.49 Weather At Home 2 (wah2) v8.24 windows_intelx86 22340449 12227993 5 Oct 2023, 16:38:36 UTC 16 Oct 2024, 21:58:36 UTC In progress --- --- 2,506.49 Weather At Home 2 (wah2) v8.24 windows_intelx86 Task 22339022 Name wah2_eas25_a2bu_200012_24_996_012226566_0 Workunit 12226566 Computer ID 1512658 Credit 2,506.49 Device peak FLOPS 4.23 GFLOPS Application version Weather At Home 2 (wah2) v8.24 windows_intelx86 |
67)
Message boards :
Number crunching :
Batch 996 Weather@Home2 East Asia25
(Message 69675)
Posted 6 Oct 2023 by Jean-David Beyer Post: I don't trust WINE for running the model correctly. We discovered during testing that WINE implementations do not fail the model when it suffers a memory fault unlike on bare metal Windows. I think there is some memory protection in place for WINE. That implies the results from incorrect memory addresses (e.g. maybe zero) are being used by the model, potentially corrupting the results. I think memory faults, WINE or not, are an indication of an incorrect program or a hardware fault. If WINE has some memory protection in it in addition to the hardware, perhaps this is just more proof of my theory. It seems to hide the memory faults. When I first used Windows (Windows 95) it had so many faults that it crashed several times a day even if it was not doing anything. I did not run BOINC then (I do not remember if it existed at that time). Since then Windows has improved some. IIRC Windows 7 was pretty good and I am now running Windows 10 on my other machine. The three current tasks on my Windows machine now have about 18 hours on them with about 9 days to go. |
68)
Message boards :
Number crunching :
Batch 996 Weather@Home2 East Asia25
(Message 69665)
Posted 5 Oct 2023 by Jean-David Beyer Post: The task on my Ryzen running Windows natively failed at the usual point with a signal 11 (segmentation fault) during the first model day. Tasks running under Wine appear to be progressing nicely. I have three of those tasks running on my Windows10 macine. They started at about one-hour intervals and have about 1.7, 2.7, and 3.7 hours completed. About 9.5 days for them to complete. 8-core machine running on 7 of the cores. Machine not doing anything else (except 4 other Boinc tasks). |
69)
Message boards :
Number crunching :
New work discussion - 2
(Message 69662)
Posted 5 Oct 2023 by Jean-David Beyer Post: Got one on my pipsqueek Windows10 machine and it has over 15 minutes on it so far. Predicting 9 days 18 hours to go. Task 22340449 Computer 1512658 |
70)
Message boards :
Number crunching :
New work discussion - 2
(Message 69639)
Posted 17 Sep 2023 by Jean-David Beyer Post: Wow, that's pure insanity, but being Linux it doesn't surprise me. Swapping would be more sensible. I would not blame Linux. And when things get so bad as to run the system out of memory, swapping may not be possible: buffers would be required to do the swap, and there is proibably no space for the needed buffers. As I said earlier, in over 20 years of running Linux, this has never happened to me. |
71)
Message boards :
Number crunching :
New work discussion - 2
(Message 69637)
Posted 17 Sep 2023 by Jean-David Beyer Post: I know Linux can do that, but I have never had it happen and I have been running Linux since about 1998 (Red Hat not enterprise Linux 5 to begin with). I do not mean swap to disk. https://neo4j.com/developer/kb/linux-out-of-memory-killer/ This one is probably better: https://rakeshjain-devops.medium.com/linux-out-of-memory-killer-31e477a45759 |
72)
Message boards :
Number crunching :
New work discussion - 2
(Message 69635)
Posted 17 Sep 2023 by Jean-David Beyer Post: Einstein doesn't freeze my computers. Boinc removes tasks if the memory is too full. Einstein does not freeze my computers either. I do not know if Boinc removes tasks if memory is too full, whatever that means. I know Linux can do that, but I have never had it happen and I have been running Linux since about 1998 (Red Hat not enterprise Linux 5 to begin with). I am currently running Red Hat Enterprise Linux release 8.8 (Ootpa) |
73)
Message boards :
Number crunching :
Credit handed out weekly?
(Message 69619)
Posted 15 Sep 2023 by Jean-David Beyer Post: I'm hoping to get more Linux tasks haha! Me roo. Remember, though: Hope is just deferred disappointment. |
74)
Message boards :
Number crunching :
Credit handed out weekly?
(Message 69591)
Posted 6 Sep 2023 by Jean-David Beyer Post: Check the stats sites again as looks like something changed today. I just picked up a ton of credit. Kudos Andy if that was your work! Me too. I just picked up 6 megabits of credit. All in one swell foop. They were issued some months ago. |
75)
Message boards :
Number crunching :
New work discussion - 2
(Message 69583)
Posted 4 Sep 2023 by Jean-David Beyer Post: If you accidentally go over the RAM limit and go into the pagefile, a rust spinner grinds the computer to a halt, so much so you can't even use the interface to stop the problem. For sure. But my machine has 128 GBytes of RAM and 16 cores, of which 12 are allowed for boinc. Furthermore I set app_config files to limit how many of each type of task is allowed to run. So I do not remember ever using the pagefile for much of anything. Running 24/7 for a little over three days, I seem to be using only one megabyte of pagefile. And that pagefile is on the reasonably fast NVME drive. top - 10:59:10 up 3 days, 17:18, 2 users, load average: 12.46, 12.70, 12.54 Tasks: 471 total, 11 running, 460 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.8 us, 5.4 sy, 68.9 ni, 24.7 id, 0.0 wa, 0.1 hi, 0.0 si, 0.0 st MiB Mem : 128086.0 total, 1183.3 free, 7824.3 used, 119078.5 buff/cache MiB Swap: 15992.0 total, 15991.0 free, 1.0 used. 118833.2 avail Mem |
76)
Message boards :
Number crunching :
New work discussion - 2
(Message 69575)
Posted 3 Sep 2023 by Jean-David Beyer Post: I think most of us have SSDs by now. I gave up on rust spinners for anything but backups, security cameras, and TV/Films years ago. I am guessing that without a reasonably fast NVME drive,some users will notice the slow down. Well, I do have an NVMe drive on my machine, but the partition for Boinc is on an SATA hard drive. OTOH, the other two partitions on that drive store videos and sound files that I seldom use, and surely I would go at least 8 hours a day without using them at all and my machine runs 24/7 except for occasional system updates. so writing checkpoint files will, at least, not be doing a lot of seeking on that drive. The other concern is disk I/O. The hi-mem OIFS models will be writing larger checkpoint (aka restart files) to disk. We need time to tune the model I/O so not to cause problems. IIRC when the Oifs tasks were being sent out early this year, I was running 3 or 4 of those at a time with no problems with computation or even trickle uploads. I do have a 75 megabit/sec Internet connection. |
77)
Message boards :
Number crunching :
New work discussion - 2
(Message 69565)
Posted 2 Sep 2023 by Jean-David Beyer Post: Until we get more experience with volunteers running these high memory apps I think it makes sense to restrict it to a single task for now. We can change it later in light of experience. One way to get more experience with volunteers running these high memory apps would be to send more of them to we volunteers. |
78)
Message boards :
Number crunching :
New work discussion - 2
(Message 69543)
Posted 30 Aug 2023 by Jean-David Beyer Post: I can tell you we have successfully run higher resolution configurations of OpenIFS on the dev-site that use 11Gb & 20Gb RAM and CPDN has okayed testing an even higher one that uses ~28Gb RAM. I don't think we will go beyond that yet, as these models also produce more output that might cause issues when uploading (plus more I/O to disk). Because of the memory size, CPDN will also limit the no. of 'in progress' tasks a user can have so 64Gb you have already is fine. My main machine runs Red Hat Enterprise Linux release 8.8 (Ootpa) and is like this: Computer 1511241 CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.8 (Ootpa) [4.18.0-477.15.1.el8_8.x86_64|libc 2.28] BOINC version 7.20.2 Memory 125.08 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 480.57 GB Measured floating point speed 6.02 billion ops/sec Measured integer speed 25.36 billion ops/sec Average upload rate 139.34 KB/sec Average download rate 22391.7 KB/sec I normally have it run 12 Boinc tasks at a time. My Internet connection isVerizon FiOS guaranteed to run at 75 Megabits/second. It acrually gets response like this. CPDN reports slower upload speeds than download speeds.I do not know why the speeds should be so different. I do not believe the download speeds are as fast as CPDN says. Those speeds could be true if they were in Kilobits per second, but KBytes per second is not really possible. When I was getting oifs jobs, the trickles went up quite fast as long as the upload servers were running. I guess I could run one 28 GByte Oifs task at a time as well as some smaller tasks at the same time. Timestamp Download Upload Latency Jitter Quality Score Test Server 8/30/2023 17:9:28 76.65 Mbps 89.02 Mbps 4 ms 2 ms Excellent speedtest1.nyc1.nitelusa.net.prod.hosts.ooklaserver.net 6/7/2023 20:3:31 78.13 Mbps 63.66 Mbps 6 ms 1 ms Excellent ny2.speedtest.gslnetworks.com.prod.hosts.ooklaserver.net 5/5/2023 11:23:28 76.26 Mbps 89.16 Mbps 6 ms 1 ms Excellent speedtest.nyc.rr.com |
79)
Message boards :
Cafe CPDN :
The Climate Machine
(Message 69472)
Posted 14 Aug 2023 by Jean-David Beyer Post: In OpeniFS, the low level computer interface (debugging/tracing/hardware) is handled in C, the number crunching in Fortran (because fortran compilers still produce the fastest code in general) and the upper level control code was written in C++. While I have no intent to contradict you, I wonder if claims like this are actually useful, even if true. At one time, I was working as part of a two-man team to write an assembly-level optimizer for the C compiler. We were given a bunch of benchmarks to optimize, and we got some truly impressive speed-ups. For the famous Whetstone benchmark, supposedly a test of floating point computation, for example, we got over 10,000:1 speedup. This took several parts. Whetstone had several modules and one was thought to be a test of floating point computation because it was called 10,000 times and it did a bunch of floating point operations. That module was actually there to test function and subroutine calling overhead. And we defeated that by expanding the routine in-line. The loop-invariant code motion optimization moved all those floating point operations outside the loop, causing an enormous speed up. Then live-dead analysis noticed the results were never used, so it eliminated the instructions (including the loop overhead) altogether. Marketing was pleased because we could do that benchmark so much better than Motorola (who made a better processor than we did). We had a huge IBM 370 machine that was running UNIX and they gathered a lot of data, so we had them tell us how many processes were run per day and what the programs were that took the most time. nroff/troff (text processor) was the biggest so we ran that through our optimizer and it sped up a little bit (IIRC 10%), but not 10,000:1 or anywhere near. IMAO, it does not matter much how good a compiler is (unless it is really awful), or how good the programming language is. What matters is what the algorithms are and how well the system is programmed. And fixing those is what really matters these days. So the language best used is probably the one with which the programmers are most familiar, and IMAO, FORTRAN is not it. For purely numeric calculation, I preferred Algol-60, but would hesitate to recommend it for CPDN since my guess is that most programmers never even heard of it, and I do not know any compilers for it either. I got to be pretty good at C and C++, but have not written anything in over 20 years, so I am probably nowhere near as good a programmer as I used to be before I retired; |
80)
Message boards :
Number crunching :
Website certificate problem
(Message 69453)
Posted 7 Aug 2023 by Jean-David Beyer Post: Seems fixed now. |
©2024 climateprediction.net