Thread 'New Work Announcements 2024'

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4572 Credit: 19,039,635 RAC: 18,944	Message 70146 - Posted: 18 Jan 2024, 11:29:55 UTC The NZ batch had a missing file so the submission failed. It should be resubmitted soon. ID: 70146 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1074 Credit: 17,020,946 RAC: 5,160	Message 70147 - Posted: 18 Jan 2024, 11:35:01 UTC - in response to Message 70136. Last modified: 18 Jan 2024, 11:37:18 UTC The server side rules for this need to be modified. Other projects don't use these same impossible rules. I disagree - the rule protects the server from wasting time sending out tasks to a machine likely to break the next one. It doesn't matter if it's the task's fault, the point is it's better off sending the next task to another machine. I disagree with your disagreement. There are still 16k tasks just waiting to be sent. There are no "other machines" at this point. Edit: And there is no harm sending a task to a bad machine. It just gets resent to the next. This is a feature, not a fault. The server is doing what we want it to do. There are only 3 retries allowed and as these batches have high failure rates, it makes sense to target machine returning completed tasks. The aim is to get the tasks to complete successfully so the server should not continually push tasks to machines that have a high failure rate (for whatever reason). So, yes, there is 'harm' in sending tasks that are known to likely fail on machines. We end up with more hard fails. CPDN had really hoped to get the code working before sending out these new batches. There was discussion about using the linux version since that works but the feeling was it was better to keep the Windows version for the time being. Unfortunately I wasn't about to fix all the bugs before the batches had to go out. --- CPDN Visiting Scientist ID: 70147 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1074 Credit: 17,020,946 RAC: 5,160	Message 70148 - Posted: 18 Jan 2024, 11:35:39 UTC - in response to Message 70143. Oops, sorry Dave, Just seen this! I am going to open a new thread for the East Asia batches 1001-4. To free this thread for new work announcements rather than discussion. It would be good if anyone starting discussions for subsequent batches such as the NZ ones that should appear tomorrow could do the same. Thank you. ID: 70148 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1066 Credit: 36,887,369 RAC: 1,533	Message 70149 - Posted: 18 Jan 2024, 12:30:32 UTC - in response to Message 70147. So, yes, there is 'harm' in sending tasks that are known to likely fail on machines. We end up with more hard fails. Not to mention that internet bandwidth is not a zero-cost resource, in either climate or financial terms. ID: 70149 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 70150 - Posted: 18 Jan 2024, 14:17:58 UTC - in response to Message 70149. Not to mention that internet bandwidth is not a zero-cost resource, in either climate or financial terms. The incremental bandwidth for me took a big step-up recently when Verizon replaced my FiOS hardware. My old hardware was installed in about 2004 and they did not want to support it any more. The new is about 10x faster than the old. Timestamp Download Upload Latency Jitter Quality Score Test Server 1/18/2024 8:54:7 840.78 Mbps 906.51 Mbps 7 ms 1 ms Excellent newyork02.speedtest.windstream.net 12/1/2023 10:26:27 750.33 Mbps 926.59 Mbps 5 ms 1 ms Excellent speedtest1.nyc1.nitelusa.net.prod.hosts.ooklaserver.net 11/30/2023 21:38:48 836.55 Mbps 846.46 Mbps 5 ms 4 ms Excellent newyork02.speedtest.windstream.net ID: 70150 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 691 Credit: 4,391,754 RAC: 6,918	Message 70151 - Posted: 18 Jan 2024, 14:51:29 UTC - in response to Message 70150. Not to mention that internet bandwidth is not a zero-cost resource, in either climate or financial terms. The incremental bandwidth for me took a big step-up recently when Verizon replaced my FiOS hardware. My old hardware was installed in about 2004 and they did not want to support it any more. The new is about 10x faster than the old. I assume he meant for the university sending them out. For us, it costs most of us no more to use more bandwidth, as it's flat rate. ID: 70151 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 70152 - Posted: 18 Jan 2024, 18:33:38 UTC - in response to Message 70145. Don't waste your time looking into these segmentation failures. I know exactly where the problem is in the code, I've been working on this for weeks. The same code works fine under Linux but fails on Windows (same compiler too). Am trying to find a workaround that doesn't involve rewriting the code too much. As I know you're technically minded it relates to the old way in which Fortran was coded for low memory machines years ago, where arrays were "misused" and shared between data of different types. A v large REAL array is being equivalenced to both an integer and logical array. It should work (and does on Linux) but we get a bad memory address under Windows (which only serves to reinforce my dislike of Windows :P) I don't know if it helps, but in years past, I found Windows was rather harshly less tolerant of "out of bounds array accesses" compared to Linux. In general, if you read an entry or two beyond the end of an array, Linux is unlikely to segfault. Windows lays things out differently, and I have absolutely seen "things that work fine under Linux and segfault under Windows" being off-by-one errors in end of array access. Are you familiar with Valgrind? It's a memory correctness testing tool, and will flag stuff like this. If you can build a small reproduction case, point Valgrind at it, and it'll pop out exactly what and where you're doing something wrong with your memory accesses. ID: 70152 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1074 Credit: 17,020,946 RAC: 5,160	Message 70153 - Posted: 18 Jan 2024, 21:15:07 UTC - in response to Message 70152. Are you familiar with Valgrind? It's a memory correctness testing tool, and will flag stuff like this. If you can build a small reproduction case, point Valgrind at it, and it'll pop out exactly what and where you're doing something wrong with your memory accesses. Yes, I've used valgrind. It wouldn't help for this case though as the code fails accessing the first array element. Also fails to create a pointer to same element. I'm not sure what's causing it, stack problem is my current theory. --- CPDN Visiting Scientist ID: 70153 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 691 Credit: 4,391,754 RAC: 6,918	Message 70172 - Posted: 21 Jan 2024, 7:17:19 UTC - in response to Message 70153. Yes, I've used valgrind. It wouldn't help for this case though as the code fails accessing the first array element. Also fails to create a pointer to same element. I'm not sure what's causing it, stack problem is my current theory. I can't find it now, but I think somewhere (you?) said they're more likely to fail on newer machines. I'm seeing something different and inexplicable. I have two newer machines, Ryzen 9 3900X and Ryzen 9 3900XT, and they're fine. I have two older machines, both dual Xeon X5650, and one of them fails every task. But the other is fine! Only difference is the motherboard, the one with the older R410 board crashes, the newer R510 board is ok. there may also be a minor difference in RAM - one of them has better matched RAM sticks and is running triple channel and the other isn't. I have various other old machines and they're also fine. Bad Xeon: https://www.cpdn.org/show_host_detail.php?hostid=1509742 (Ignore the GPU, that was added a couple of days ago and didn't change the crashability of CPDN). Good Xeon: https://www.cpdn.org/show_host_detail.php?hostid=1544690 Good Ryzens: https://www.cpdn.org/show_host_detail.php?hostid=1509739 and https://www.cpdn.org/show_host_detail.php?hostid=1535126 I hope something in there helps. ID: 70172 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4572 Credit: 19,039,635 RAC: 18,944	Message 70188 - Posted: 23 Jan 2024, 12:31:10 UTC Batch 1005 4650 WAH2 tasks for the NZ region have joined the East Asia ones still waiting to be snapped up. If the testing site ones are anything to go by then on my box they will take two or three days less to complete. ID: 70188 · Reply Quote

rob Send message Joined: 5 Jun 09 Posts: 99 Credit: 3,776,658 RAC: 1,196	Message 70191 - Posted: 24 Jan 2024, 11:56:43 UTC Just landed (well about 4 hours ago) a wah2_nz25 task. Let's see how this one does..... ID: 70191 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1074 Credit: 17,020,946 RAC: 5,160	Message 70193 - Posted: 24 Jan 2024, 12:34:22 UTC - in response to Message 70191. The NZ batch uses a smaller domain to the EAS ones and is much less likely to fail. Just landed (well about 4 hours ago) a wah2_nz25 task. Let's see how this one does..... --- CPDN Visiting Scientist ID: 70193 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1074 Credit: 17,020,946 RAC: 5,160	Message 70241 - Posted: 30 Jan 2024, 20:40:14 UTC Last modified: 30 Jan 2024, 20:40:32 UTC OpenIFS linux batch An OpenIFS linux batch will be released in the next 7 days. It's about to go into testing. This batch is based on the earlier batch 993 but with reduced model output (and hence smaller upload files). All of the forecasts in this batch will be exactly the same. The aim is to see how much variation we get in running multiple identical forecasts across all the linux machines attached to CPDN, and, if we get the same result from exact same forecasts from each host (which is not a given). The objective is to compare the perturbations from running across different hosts to the perturbations previously applied to batch 993's initial conditions. --- CPDN Visiting Scientist ID: 70241 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1121 Credit: 17,202,915 RAC: 2,154	Message 70244 - Posted: 31 Jan 2024, 3:03:16 UTC - in response to Message 70241. An OpenIFS linux batch will be released in the next 7 days. It's about to go into testing. This batch is based on the earlier batch 993 but with reduced model output (and hence smaller upload files). My most recent one of those was this one. It worked, so perhaps this new batch should work too. Right? I must have those compatibility libraries in there although, IIRC, these OIFS programs do not need them. Task 22318024 Name oifs_43r3_0187_2019110100_123_993_12215029_2 Workunit 12215029 Created 25 Apr 2023, 18:24:32 UTC Sent 25 Apr 2023, 18:24:40 UTC Report deadline 24 Jun 2023, 18:24:40 UTC Received 26 Apr 2023, 10:24:47 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1511241 Run time 15 hours 25 min 7 sec CPU time 15 hours 14 min 11 sec Validate state Valid Credit 14,873.04 Device peak FLOPS 6.06 GFLOPS Application version OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu Peak working set size 4,780.11 MB Peak swap size 4,974.23 MB Peak disk usage 1,267.49 MB ID: 70244 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4572 Credit: 19,039,635 RAC: 18,944	Message 70245 - Posted: 31 Jan 2024, 8:14:34 UTC although, IIRC, these OIFS programs do not need them. Correct. OIFS is 64 bit. Which leads me to ask, is the re compiling of the WAH2 tasks 32 or 64 bit? ID: 70245 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1074 Credit: 17,020,946 RAC: 5,160	Message 70248 - Posted: 31 Jan 2024, 10:20:38 UTC - in response to Message 70245. Which leads me to ask, is the re compiling of the WAH2 tasks 32 or 64 bit? 32 bit. Going to 64bit is on the todo list, not least because boinc stopped supporting 32bit libs a year ago, but it's not trivial. Let's get bugs out of the Hadley models first. --- CPDN Visiting Scientist ID: 70248 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4572 Credit: 19,039,635 RAC: 18,944	Message 70252 - Posted: 31 Jan 2024, 16:10:40 UTC - in response to Message 70248. Which leads me to ask, is the re compiling of the WAH2 tasks 32 or 64 bit? 32 bit. Going to 64bit is on the todo list, not least because boinc stopped supporting 32bit libs a year ago, but it's not trivial. Let's get bugs out of the Hadley models first. I suspected as much but thought I would check. Thanks Glenn. ID: 70252 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,915,412 RAC: 16,463	Message 70253 - Posted: 31 Jan 2024, 20:00:01 UTC - in response to Message 70241. Last modified: 31 Jan 2024, 20:00:24 UTC The aim is to see how much variation we get in running multiple identical forecasts across all the linux machines attached to CPDN, and, if we get the same result from exact same forecasts from each host (which is not a given). Interesting. There... shouldn't be any variation in results for the same code on the same host with the same initial conditions. If so, look for uninitialized memory reads somewhere, I guess? I know floating point is messy, but it should at least be consistently messy. I've no shortage of starved machines I can point at stuff when it shows up! ID: 70253 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4572 Credit: 19,039,635 RAC: 18,944	Message 70254 - Posted: 31 Jan 2024, 20:50:38 UTC Interesting. There... shouldn't be any variation in results for the same code on the same host with the same initial conditions. If so, look for uninitialized memory reads somewhere, I guess? I know floating point is messy, but it should at least be consistently messy. My understanding from work on the Hadley models a long time ago is that there is with that model some variation between hosts possible due to FP rounding being different between operatingsystems/cpu manufacurers. In those days all model types went out on all platforms. To me, it makes sense to actually check this. My coding experience is with different languages and is also very very rusty but it may be there are things that could be done in teh code to mitigate this if there is significant variance with the OIFS models. ID: 70254 · Reply Quote

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 691 Credit: 4,391,754 RAC: 6,918	Message 70256 - Posted: 31 Jan 2024, 22:03:51 UTC - in response to Message 70253. Last modified: 31 Jan 2024, 22:06:17 UTC I've no shortage of starved machines I can point at stuff when it shows up! I have a Ryzen 9 3900X and a Ryzen 9 3900XT running Linux in an Oracle VirtualBox. Will these be useful? I'm guessing you want to check they're ok on virtual machines too, although I don't know if you can tell they're virtual machines from your end: https://www.cpdn.org/show_host_detail.php?hostid=1542648 https://www.cpdn.org/show_host_detail.php?hostid=1539015 ID: 70256 · Reply Quote