climateprediction.net home page
Posts by geophi

Posts by geophi

21) Message boards : Number crunching : New work discussion - 2 (Message 68948)
Posted 24 Jun 2023 by Profile geophi
Post:
To my knowledge, the wah2 application was not compiled with an avx optimization switch. The wah2 executables were last compiled in November 2016. The last I knew, SSE2 is the highest level optimization used for compiling these models.

If it was the AVX thing, then every Windows batch since late 2016 would be displaying similar behavior with older PCs throwing errors. However, that has not occurred.

In this batch, on the 3 I have been running on my Ryzen 5600 for 17 hours, 2 of them had 2 previous errors in their work units with SEGV signal 11 errors and 1 had 1 error of that type. All of the PCs in those work units with the failed tasks have AVX capability as does my Ryzen which hasn't failed any of those 3 tasks so far.

If it was an ancillary file error, I would think all tasks in the work unit should have failed.

This is very frustrating and it does not appear at all obvious what the problem might be. Hopefully Sarah analyzing the errors can find some correlation as to why this is happening as it is.

Also frustrating, while my trickles are going up fine, the first zip file is failing to upload to upload7 saying it can't connect.
22) Message boards : Number crunching : East Asia testing. (Message 68927)
Posted 23 Jun 2023 by Profile geophi
Post:
Looks like the regional model portion of the task takes 400+ MB of resident memory while the global portion of each task takes about 200 MB, so 600 to 700 MB total resident memory for each task.

My Ryzen 5600 is running 3 at a time and it works out to about 8.5 days to complete them.
23) Message boards : Number crunching : Is a queue of > 5 million workunits waiting for assimilation a bad thing? (Message 68838)
Posted 4 Jun 2023 by Profile geophi
Post:
Should we be concerned?

No. CPDN doesn't use boinc's validation function.

Workunits waiting for validation
Workunits waiting for assimilation
Workunits waiting for file deletion

aren't used by cpdn, but they are in the boinc server software. Those stats are left on the server status page (which can be confusing) since they are not removed every time a server upgrade occurs.
24) Message boards : Number crunching : Download issues (Message 68689)
Posted 21 Apr 2023 by Profile geophi
Post:
Notified Andy of the continuing certificate/download problems with a link to the recent posts in this thread.
25) Questions and Answers : Wish list : Merge computers despite different OS (Message 68645)
Posted 11 Apr 2023 by Profile geophi
Post:
I'd like to be able to delete the old, obsolete computers in my account, some of which were scrapped over ten years ago.

This may not necessarily be a good idea as if I'm not mistaken, credit would be lost if a computer with credit is deleted. The only computers that would be safe to delete are those that for one reason or another have produced 0 total credit.

AndreyOR is close. You can delete your computer if it has never downloaded a task. But if it has downloaded one or more, even if the credit is 0 for it, you cannot delete that host/computer.

If some a computer listed on your host list is just a newer identified version of the same PC, you have a chance to try the Merge function for the host, which will merge the boinc-identified computers into one.
26) Message boards : Number crunching : East Asia testing. (Message 68598)
Posted 17 Mar 2023 by Profile geophi
Post:
If these Windows work units are long running as you suggest, I hope there will be a mechanism in place on the server to ensure that everyone who wants some will get them, shared fairly and equally, instead of by greedy, selfish users who download dozens of work units, or more, and then can't complete them by the deadline.
I have no idea on that. My hope is that those in charge have learned from the effectiveness of the shorter deadlines given to OIFS tasks. Slower machines will I would guess take over three months to complete these tasks so setting the deadline at 3 months would stop some users from downloading them at all because they wouldn't finish in time.

If they do it the way they did for the nz25 batches, the development site spinups ran for 113 model months and took about 20 days on my i7-4790K. When they later sent out stash/ancil test nz25 batches to the dev site, they were for 25 model months and took less than a quarter of the time that the spinups did. The nz25 batches sent to the main cpdn site were also 25 model months. Just guessing but I don't think 119 model month batches will come to the main site.
27) Message boards : Number crunching : no credit awarded? (Message 68493)
Posted 26 Feb 2023 by Profile geophi
Post:
Another strange thing: my event log has an entry for

26-Feb-2023 00:51:38 [climateprediction.net] [sched_op] handle_scheduler_reply(): got ack for task oifs_43r3_01i7_2019110100_123_993_12215389_0
That's task 22316800, which the server says is still in progress. The event log timing (also UTC) suggests that it was reported right in the middle of the period when I'm suggesting the credit script was running. Could that have interfered with the status update?

Absolutely. I started seeing that behavior sometimes early in the hadam4h era. I've lost getting a status for several completed tasks over the last 3 or 4 years if they reported during the credit run. It doesn't always happen, but occasionally.

There are others who posted about this situation as well but those posts are probably scattered among several threads.
28) Message boards : Number crunching : OpenIFS Discussion (Message 68431)
Posted 24 Feb 2023 by Profile geophi
Post:
Send Personal Message to me if interested rather than reply here. If there is sufficient interest, I'll share the files on dropbox. I'll post answers to PM'd questions here.


How do I do that?


Click on his name in the Author section for his post. It'll bring up an abbreviated profile page for him and then click on "Send personal message" on the right hand side of the webpage.

Or, easier, just click on the "Send Message" button under his name in the Author section.
29) Message boards : Number crunching : OpenIFS Discussion (Message 68279)
Posted 12 Feb 2023 by Profile geophi
Post:
Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is
free(): invalid next size (fast)

Dave, I think you mixed those up.

The Intel machine had the double corruption and the AMD was the invalid next size.
30) Message boards : Number crunching : Weather at Home still running? Can't send back files. (Message 68202)
Posted 4 Feb 2023 by Profile geophi
Post:
I see some Windows tasks have completed in the last couple days. Has anyone in this thread reporting upload problems for WAH2 NZ tasks had their tasks upload?
31) Message boards : Number crunching : Weather at Home still running? Can't send back files. (Message 68112)
Posted 30 Jan 2023 by Profile geophi
Post:
I also have 3 WAH WU's running on a Windows computer and I haven't seen it upload anything in days. Is this the same issue that is affecting OpenIFS? I seem to be having at least some luck with that.


Those are uploaded to a different server in Hobart Tasmania. Over the years, it has been periodically unreliable. I e-mailed Andy so hopefully he can communicate with them and get this resolved.
32) Questions and Answers : Preferences : CP takes over (Message 68073)
Posted 27 Jan 2023 by Profile geophi
Post:
No climate work for months ... is there a check list for to confirm my settings?
Or am I holding my mouth wrong ...

There hasn't been any Windows work since Aug/Sep of last year. The project depends on climate researchers from various institutions around the world submitting work requests. Recently all of the requests have come for the models that run in Linux. I'm not sure when there might be more work for the models that run in Windows.
33) Questions and Answers : Unix/Linux : *** Running 32bit CPDN from 64bit Linux - Discussion *** (Message 67912)
Posted 19 Jan 2023 by Profile geophi
Post:
For the "had" models, the .so file has a dependency of libnsl.so. The command lines installing the needed 32bit libraries for cpdn installs the listed lib32ncurses6 to get this to satisfy the libnsl.so requirement.

ncurses doesn't directly provide libnsl. Here's what it provides: https://packages.ubuntu.com/jammy/amd64/lib32ncurses6/filelist. One of ncurses dependencies is libc6-i386, that's the package that provides libnsl: https://packages.ubuntu.com/jammy/amd64/libc6-i386/filelist. That's one of the small dependencies (for lib32stdc++6) I was referring to. As a matter of fact libc6-i386 provides 7 of 9 required libraries, lib32stdc++6 provides 1 of 9 and lib32gcc-s1 provides 1 of 9. But because of the dependency hierarchy among those packages, installing lib32stdc++6 installs everything one needs. lib32stdc++-9-dev also installs all of the required libraries, all via dependency packages. But it also installs a bunch of other unnecessary stuff so that's why I don't think it should be recommended. Either the contents of the packages changed or the original recommendation was provided somewhat haphazardly.

As for the lib32z1, that was included for the hadcm3s models when they ran on linux.

I'm familiar with this and while I realize that we don't know for sure that it'll remain for mac only I do believe that it very likely will. So if I was to recommend the simplest, install only what's needed way to get 32-bit libraries, for currently available Linux Hadley models, I'd omit it. If it does come back to Linux at some point, I'd update the recommendation.

@AndreyOR
Thanks for your explanation on simplifying what is the minimum package needed for 32bit compatibility for the current "had" linux apps in Ubuntu. Dave updated the instructions for Ubuntu 18.04 and later versions for lib32stdc++6. We left lib32z1 in the install command line but with a disclaimer that it would only be needed if the hadcm3s model comes back to linux.
34) Questions and Answers : Unix/Linux : *** Running 32bit CPDN from 64bit Linux - Discussion *** (Message 67820)
Posted 17 Jan 2023 by Profile geophi
Post:
sudo apt install lib32ncurses6 lib32z1 lib32stdc++-9-dev

This one is listed for Ubuntu 21.
I also successfully used it on Debian 11 after adding i386 through dpkg.

I suspect it wasn't necessary. Although I can't be sure since I don't have Debian (use Ubuntu), once you added i386 architecture, the required libraries (for CPND) probably got installed as part of that.

The 3 packages listed are not needed for any of the available Linux models. The last one installs all the required libraries but it also installs a lot of unnecessary stuff that you don't need and eventually gets outdated actually. I did a detailed investigation into this and some testing (for Ubuntu and Arch) and made some posts about it. Basically the only package one needs for Debian, Ubuntu (and very likely any derivatives) is lib32stdc++6 . There are a couple of small dependencies but they get installed automatically. This way you install only what's needed without anything unnecessary, which can take up a lot of space.

Additionally I don't think you even need to add the i386 architecture (at least in Ubuntu), just install the one package.

If you're interested in testing, I'd be interested in results as I don't have Debian. Once your tasks complete and report, uninstall the 3 packages and remove the i386 architecture. Install just the lib32stdc++6 package (2 small dependencies should install automatically). Go to the CPDN directory and run the ldd command on each executable and see if any libraries are missing. If they're not, try to get some tasks.

For the "had" models, the .so file has a dependency of libnsl.so. The command lines installing the needed 32bit libraries for cpdn installs the listed lib32ncurses6 to get this to satisfy the libnsl.so requirement. Without it, the intermediate and final model results zip files are not created and uploaded. libnsl may well be installed more efficiently through some other command, but that's what was suggested long ago as a method to install it.

As for the lib32z1, that was included for the hadcm3s models when they ran on linux. It was a requirement of the .so file for that model so the zip files get created and transmitted. That model type is now Mac only and may remain so. However we do not know that for sure.
35) Message boards : Number crunching : Upload server is out of disk space (Message 67714)
Posted 14 Jan 2023 by Profile geophi
Post:
Thank You Dave,

There is that in the xml file :

<file>
    <name>wah2_nz25_a0d2_198905_25_936_012150232_0_r951897616_18.zip</name>
    <nbytes>90031062.000000</nbytes>
    <max_nbytes>150000000.000000</max_nbytes>
    <md5_cksum>e20a8b248529e2d3f15e277a2a530f41</md5_cksum>
    <status>1</status>
    <upload_url>http://upload4.cpdn.org/cgi-bin/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>56</num_retries>
        <first_request_time>1671650199.948561</first_request_time>
        <next_request_time>1673693268.434832</next_request_time>
        <time_so_far>46278.530403</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>


Kali.

upload4 is the Hobart server in Tasmania, which periodically has issues. I've alerted Andy with a link to your post. Hopefully the server will be back up in the not too distant future.

Edit...looks like Dave might have beat me to it.
36) Message boards : News : Request to volunteers to please enable: 'Leave non-GPU tasks in memory' (Message 67625)
Posted 13 Jan 2023 by Profile geophi
Post:
Finally CPDN has used the BOINC push notification:

Finally is correct.

It could've been also used for the missing 32 bit libs as well

Couldn't agree more.

I thought it was for the third blurb on the front page back in 2019.
https://www.cpdn.org/cpdnboinc/index.php
37) Message boards : Number crunching : OpenIFS Discussion (Message 66668)
Posted 30 Nov 2022 by Profile geophi
Post:
I've got a few of these new units. So far two completed ok and two with errors.
The first error log ends with:
Uploading trickle at timestep: 1900800
00:22:36 STEP 530 H= 530:00 +CPU= 15.541
double free or corruption (out)

The other:
18:58:37 STEP 482 H= 482:00 +CPU= 10.168
free(): invalid next size (fast)
Ah! Excellent. I've been trying to understand why some tasks are apparently stopping with nothing in the stderr.txt returned to the server to explain why it stopped.

@DarkAngel - can you tell me which resultids those were so I can look them up?
Also, what machine & OS are you using these on?

This kind of error message indicates a memory problem, often caused by a bug in the code but I've also seen it caused by certain versions of compilers/system libraries. I've never seen it with the model itself but then I've never run the model on such a wide range of systems like this. Could also be the wrapper code we use.

Quick question. When the tasks are running, if you do 'ps -ef' you should see the same number of 'master.exe' processes as 'oifs_43r3_ps_1.01_x86_64-pc-linux-gnu'. The latter is the 'controller' for the model itself (master.exe). Do you have the same number of each? I ask because we know of one issue that can kill the 'oifs_43r3....' process running but still leave the model 'master.exe' running.

Thanks for your help.

I received a "double free or corruption (out)" error on this task https://www.cpdn.org/cpdnboinc/result.php?resultid=22247251 around step 1539.

Another problem has occurred on the same PC. This time, apparently the task ran to the end (got to step 2952 (listed in stderr.txt and ifs.stat), but never completed/reported. The "master.exe" associated with this process is labeled as defunct in ps -ef master, and the task in boinc manager has a progress of 3.256% (stuck) with CPU time continuing to increase. Task: https://www.cpdn.org/cpdnboinc/result.php?resultid=22247938 I'm going to suspend this task since it is blocking others from running. If you need anything from the slots directory, let me know.

Four other tasks have run successfully to completion on this same PC.
Ryzen 5 5600 with 32 GB of DDR4 3200 running fully updated Ubuntu 20.04 LTS.
38) Message boards : Number crunching : New work discussion - 2 (Message 66632)
Posted 29 Nov 2022 by Profile geophi
Post:
Those times depend on what %cpu boinc is allowed to use. 100%? Perhaps add that info. Machine load affects wall clock time too.

I'm more interested in any failures. If you get one, let me know. Thx.

There's another 2000 ready to go as soon as Andy gets to it. And then there's plenty more after, the scientist needs to run a minimum of ~42000.

One of the two on my i7-4790K crashed at the end with exit status of "194 (0x000000C2) EXIT_ABORTED_BY_CLIENT". In stderr, it has "Process still present 5 min after writing finish file; aborting".
https://www.cpdn.org/result.php?resultid=22245298
Both the successful task and the errored task ran through step 2592.

Both tasks on my Ryzen 5600X completed successfully in just under 9 hours CPU and wall clock time.
39) Message boards : Number crunching : New work discussion - 2 (Message 66621)
Posted 29 Nov 2022 by Profile geophi
Post:
Looks like about 13 hours running two at a time on an i7-4790K and about 9 hours running two at a time on my Rzyen 5 5600X.
40) Message boards : Number crunching : OpenIFS Frequently Asked Questions (Message 66572)
Posted 25 Nov 2022 by Profile geophi
Post:
Have moved the discussion that was in this thread to the "OpenIFS Discussion" thread . Please discuss the OpenIFS in that thread and not the this FAQ.


Previous 20 · Next 20

©2024 climateprediction.net