climateprediction.net home page
Posts by AndreyOR

Posts by AndreyOR

21) Message boards : Number crunching : no credit awarded? (Message 68562)
Posted 3 Mar 2023 by AndreyOR
Post:
I have a trickle pending from a task of the latest OIFS run, here's some current data on what Richard was talking about.

The entire contents of trickle_up_oifs_43r3_001t_2019110100_123_993_12213503_0_1677879929.xml file:
<variety>orig</variety>
<wu>oifs_43r3_001t_2019110100_123_993_12213503</wu>
<result>oifs_43r3_001t_2019110100_123_993_12213503_0_r1949673894</result>
<ph></ph>
<ts>10623600</ts>
<cp>84718</cp>
<vr></vr>

What I think is the relevant section of the sched_request_climateprediction.net.xml file:
<msg_from_host>
      <result_name>oifs_43r3_001t_2019110100_123_993_12213503_0</result_name>
      <time>1677877810</time>
<variety>orig</variety>
<wu>oifs_43r3_001t_2019110100_123_993_12213503</wu>
<result>oifs_43r3_001t_2019110100_123_993_12213503_0_r1949673894</result>
<ph></ph>
<ts>10368000</ts>
<cp>82660</cp>
<vr></vr>

  </msg_from_host>
22) Message boards : Number crunching : no credit awarded? (Message 68561)
Posted 3 Mar 2023 by AndreyOR
Post:
OpenIFS first appeared on the production CPDN site in 2020. There is a paper in the scientific literature based on the results from those batches. Then there was a long pause when the model was updated but small batches were released prior to the very big batches we saw end of last year. There has been no change to the way trickles are handled from the task/client side since 2020. I think the issues are at the CPDN server end.

Ok. That's before my time here so that's why I didn't know about it. I believe you didn't show up on the forums until last year too so to me it seemed like OIFS just started at CPDN last year, although I did see evidence on the website that its arrival has been in the works at least.

I've always assumed that the problem is at the CPDN server end. With my comments, I was thinking that possibly the arrival of OIFS disrupted some things with trickles and credit handling by CPDN servers, not that there's an issue with the model or BOINC client.
23) Message boards : Number crunching : no credit awarded? (Message 68544)
Posted 3 Mar 2023 by AndreyOR
Post:
There have been times in the past when credit hasn't shown despite zips being on the website but most of them have been when there are problems with the credit script having fallen over or not been restarted after an event of some kind. There have also been times when the credits have appeared despite zips not showing on the task pages, presumably because the problem occurs after the processes to display them and the ones to go into the credit script separate.

I think I have seen the former but not the latter. If the latter is possible, that means there's another complication as to where one could look to investigate the current problem.
24) Message boards : Number crunching : no credit awarded? (Message 68540)
Posted 2 Mar 2023 by AndreyOR
Post:
There's something that hasn't been mentioned before, that I've seen, that I think may be related and worth looking into at least. The problem seems to have started around the same time OIFS got released. OIFS is credited as all or nothing, and trickles never show up on the website. Hadley is credited per trickle and they (should) show up on the website. These seem to be 2 different credit mechanisms and perhaps the Hadley one is not running but OIFS one is? If it's not exactly that perhaps there's some other connection, interaction, friction between how the 2 models get credited that's causing the issue?
It's not related to when OpenIFS was released. OpenIFS tasks first went out years ago. The models know nothing about each other, the controlling code is completely different between the two (though that might be the cause of the problem).

I'm not sure what you mean as I don't know what went on behind scenes but I got my first OIFS tasks on 28 November of last year, and my first un-credited Hadley tasks reported as competed on 30 November. So what I see is that OIFS production release happened around the same time as Hadley models stopped getting credit. It just seems like there just might be a connection there somehow.

And I'm still puzzled because the last conversation I had with Andy was that only trickles (for OpenIFS) are awarded credit, not completion. But I see what you mean.

Yeah that is puzzling, I've never seen partially credited OIFS tasks, it seems like it's all or nothing and not based on trickles. Maybe there's some kind of redundant process that happens with OIFS (but not Hadley) that if a task is successfully completed but has no credit (even though it should because trickles are credited) then full credit is awarded at completion? This would explain why OIFS is credited and Hadley isn't even though both are credited by trickles but neither has trickles show up on the website so neither should be credited. It seems like the question to investigate is why starting 30 November trickles stopped showing up on the website (for Hadley). It seems like a narrow and specific enough question. Reviewing the process of how trickles show up on the website might be a good starting point. It might also reveal why OIFS trickles don't show up.
25) Message boards : Number crunching : no credit awarded? (Message 68536)
Posted 1 Mar 2023 by AndreyOR
Post:
I think it's already been relatively well established in this thread as a strong hypothesis, if i can put it that way, that the credit problem affects all Hadley models, it started at the end of November, and the reason that there's no credit is because there're no trickles showing up on the website and credit is awarded per trickle. It seems like we're rediscovering things again, hopefully this time it won't be forgotten.

There's something that hasn't been mentioned before, that I've seen, that I think may be related and worth looking into at least. The problem seems to have started around the same time OIFS got released. OIFS is credited as all or nothing, and trickles never show up on the website. Hadley is credited per trickle and they (should) show up on the website. These seem to be 2 different credit mechanisms and perhaps the Hadley one is not running but OIFS one is? If it's not exactly that perhaps there's some other connection, interaction, friction between how the 2 models get credited that's causing the issue?

Another idea is perhaps asking around at the BOINC workshop for ideas as to where to look for a problem like this?
26) Message boards : News : New study going out to volunteer's machines (Message 68509)
Posted 28 Feb 2023 by AndreyOR
Post:
Glenn, Mr. P Hucker,

Just to clarify, I was not asking to remove notifications, just to modify them. I agree that they're helpful for CPDN as work is not constant so it's helpful to know that work appeared and one can make adjustments to one's client. It's also helpful to know some technical details, such as RAM and # of CPUs (in future presumably it'll vary) so one can make adjustments. I follow the boards closely and did not get the impression that users want lengthy notifications that one has to sometimes scroll to read on a standard sized Manager. The modification I'm proposing is a minimal notification with essential info (name of study, RAM, # of CPUs, Run Tiime) which is actionable. There's always a "More" link at the end of notifications that users who want to read additional, non-essential info, can click and they'll be taken directly to the post or article, they don't have to search the boards. I think World Community Grid does notifications well, one-liners with essential info and a link. The very lengthy ones of other projects is not something to aspire to, I'd propose.

As for how long notifications stick around - it is up to the project, as far as I can tell. I don't think one can just delete contents of the 'notices' folder because they get repopulated every time the client contacts the project. Also, CPDN has it set that the client will poll for notifications every 3 days regardless. Last 10 notices per project, I believe, will show in the feed so things can get cluttered very quickly if multiple projects let things sit around. I suspect that as the amount of notices grows, other users will start saying something, I'm just trying to foresee and prevent that too. I'm just asking that notices are removed no later than when the tasks are all snapped up because at that point it's pretty much obsolete. Perhaps there's a way to automate their removal, Richard may know, in which case I'd suggest 3 days. By then all tasks are usually snapped up.

As for LHC server limit control - it doesn't seem to be working as I've never seen any evidence of these kinds of limits. Unless I control things with app_config, BOINC will download many tasks and will fill all available threads with tasks, even with ATLAS, and it'll really slow my PC down.

I don't have 4 machines, only 2 with a WSL2 instance on each, so it looks like 4. BOINC Manager can control multiple clients, just not from the same screen. Switching between clients is simple and isn't a big deal with my setup. I wouldn't be surprised that a majority of users use BOINC Manager as it's the standard manager and it works well enough for most cases.
27) Message boards : News : New study going out to volunteer's machines (Message 68497)
Posted 27 Feb 2023 by AndreyOR
Post:
Could we please have the client notifications removed? The last 2 have been pretty big comparatively, and both are still showing up despite the fact that all tasks have been snapped up and majority have already been competed.

In the future, I'd propose that they should be removed no later than when all of the tasks have been snapped up. I'd also propose that they be made shorter: Name of study, RAM requirement, # of CPUs, and Run Time. The rest of the info, like study description and more technical data be provided in the forum post with a link in the client notification.
28) Message boards : Number crunching : no credit awarded? (Message 68496)
Posted 27 Feb 2023 by AndreyOR
Post:
That's why I'm suggesting that the time has come (subject to other constraints, which come first) for a thorough re-examination of the current situation.

It's most definitely time. It's been 3 months since Hadley models stopped getting credit. From what I've been able to gather, the problem started at the end of November. There's also the RAC problem, which has been ongoing for weeks now. CPDN has a relatively small and patient user base. I'd be willing to bet that almost everyone likes credit and to a small or large degree cares about getting it. It kind of seems a bit neglectful to the user base to let credit problems be anything but a short term problems. It's the only tangible/visible thing users get out of volunteering. I know that CPDN runs on minimal resources, at the same time, when do we as users become high enough priority?
29) Message boards : Number crunching : no credit awarded? (Message 68495)
Posted 27 Feb 2023 by AndreyOR
Post:
Will this work unit ever get validated, or does it need an admin to intervene?
It may need intervention to get the credit when Andy gets a chance but validation isn't used by CPDN. Credit is based on the trickle up files that generally go at the same time as the zips are uploaded.

Even though there's no cross-task validation as in other projects, validation does seem to happen. I've seen tasks just reported show up as Validation Pending for a very brief period of time, under 30 sec. Perhaps some internal checks get done to make sure the result is valid and isn't tampered or corrupt in some way. That task may not have been checked for some reason or the check wasn't registered?
30) Message boards : Number crunching : OpenIFS Discussion (Message 68403)
Posted 21 Feb 2023 by AndreyOR
Post:
I can easily remove the (l255 & l319) parts from my app_config.xml file, if those warnings get to be too annoying.
(I do not know how to put comments into an xml file.)

When/if those apps come out, we can ask Glenn to post the names ahead of time so as to be prepared. The app names will also be found in client_state.xml.
The way to have a section of an xml file be skipped is by surrounding it with <!-- -->. It doesn't have to be line by line, one set of those can surround many consecutive lines of code.
<!--code to be skipped
 or comment-->
31) Message boards : News : New study going out to volunteer's machines (Message 68402)
Posted 21 Feb 2023 by AndreyOR
Post:
Mr. P Hucker,

So it looks like the Shared Clipboard feature requires installation of something called Guest Additions. It's installed from within the guest OS and is available for most common OSs (including Ubuntu 20.04). It adds drivers and system apps to make the guest OS work better. I believe the process to install it is similar to installing a guest OS but it's done from within the guest OS and it's much quicker, it's been awhile since I've done it.

OIFS tasks are most prone to failure if you push the RAM too much. With the amount of RAM you have on your VM, I'd probably say no more than 9 concurrent currently available OIFS tasks. The desired failure rate is <5% so if you see a higher rate, try reducing the number of concurrent tasks.

For long term usage I'd still suggest taking the time at some point and getting BOINC on WSL2 set up. Resource usage of WSL2 is much less compared to regular VMs. Once it's running, you can close the PowerShell window and it'll run in the background, you won't even see it. It recently became Generally Available and Microsoft simplified the installation significantly.
32) Message boards : News : New study going out to volunteer's machines (Message 68398)
Posted 20 Feb 2023 by AndreyOR
Post:
Mr. P Hucker,

That link is from a BOINC website so they'd have to update it, not any specific project.

The 8 OpenIFS failures you had are unrelated to the 32-bit libraries as OIFS is a 64-bit app and doesn't require them. I think so far everyone's attempts with this latest batch has failed. The reason(s) for failures are being investigated but it's the first time that specific app is being used in production and it seems like some issues haven't been discovered in testing.

I don't think you can share a clipboard between a VM and host, especially with different OSs but I'm not certain of this.

Threads get passed to the VM (assuming HT is on in BIOS). VBox warns if more than 50% of threads and RAM are being assigned to it (not 100% on those numbers) but you can still do it, probably without issues most of the time.

I haven't used a regular Linux VM in a while but to start BOINC I believe there should be a BOINC icon on the desktop or in the Start menu like in Windows. To shut it down I think you have to use the command line, unless there's a way I'm not aware of.

In Windows I wouldn't use VBox as Windows has its own built in hypervisor, Hyper-V, which is a Type 1 as opposed to Type 2 (VBox) so resource usage is better. Better yet, I'd use WSL2 (Windows Subsystem for Linux) which is also part of Windows and uses a lot less resources than a full-bore Virtual Machine (VBox, Hyper-V, etc.). I used to use Hyper-V but once I learned of WSL2 I switched to it for Linux BOINC work. Several different Linux distributions are available on it but the default one that gets installed is Ubuntu. WSL2 also already comes with all of the necessary 32-bit libraries preinstalled for the older Hadley models. The setup is a bit different than a regular VM, I think it's a bit simpler and quicker. As for BOINC, you'd just install the BOINC client (not manager) and set things up to control it from Windows BOINC manager (as I do), and probably BoincTasks.
33) Message boards : Number crunching : OpenIFS Discussion (Message 68395)
Posted 20 Feb 2023 by AndreyOR
Post:
Is that the name of the tasks I use in app_config.xml?

Yes, that's the name of the app for this latest OIFS batch that's used in app_config. However, the last 2 (l255 & l319) are not valid app names. Those are the ones Glenn would use for a test run, if there's enough interest, but it'll be outside of BOINC so app_config won't be read. Whether those apps will ever come to BOINC or with those names we don't know yet. It could be that the current 3 apps will be able to run both lower and higher resolution models. If you leave them there, not commented out, BOINC will be prompting you with error messages as it does whenever there's something invalid with the app_config file.
34) Message boards : Number crunching : OpenIFS Discussion (Message 68382)
Posted 20 Feb 2023 by AndreyOR
Post:
Usually there's a feature in the Boinc Server software to send beta-tasks to users via the client if those users have opted-in to receive those tasks.
I would have done so btw. if the CPDN preferences had supported this. ;-)

That's right and a number of projects have it, you also get some credit for it. Since CPDN already has a development site, it's unlikely they're going to change their system when it comes to beta testing.
35) Message boards : Number crunching : OpenIFS Discussion (Message 68362)
Posted 16 Feb 2023 by AndreyOR
Post:
We're not sure yet whether boinc is a suitable framework for these higher resolution tasks (will need > 20Gb RAM, more output etc).
If you would enable Application-Selection for the User you could make it a different Application and only Users that have Opted-In should get them
I agree. Application selection for the user used to be available. I think it was stopped around the same time that model types were only available for single platforms but the way OIFS is developing to my mind at least makes its reintroduction a good idea.
Yes, that's how we're doing it. Each high res configuration will be done as a separate app so it can be controlled differently on the server & user's app_config.xml, but also selected on/off on the Project Preferences page. I suspect by default we'll have these apps deselected so it's an 'opt-in'. It's an identical OpenIFS binary, the model gets its configuration from the input files. This will be setup on the dev site first.

I also agree that this should be reinstated as well as be an 'opt-in'. I think the only models that should be automatically opted-in are Windows and current, lowest resolution OIFS. All others require or will require either special configurations (32-bit libraries) or hardware (RAM, older Mac).
36) Message boards : Number crunching : OpenIFS Discussion (Message 68347)
Posted 15 Feb 2023 by AndreyOR
Post:
I'm running 3 OpenIFS in a 16 GB RAM-Environment together with a Squid-Instance and so far this box has run 32 WUs successfull without any errors.

That's impressive, also this PC is almost certainly in the minority of PCs that can do that. I'm assuming you don't use the machine for anything else? Does it have ECC RAM? Is Squid there from when you used it for LHC?
37) Message boards : Number crunching : OpenIFS Discussion (Message 68324)
Posted 14 Feb 2023 by AndreyOR
Post:
Model crashes. There will also be tasks that fail in the model because this experiment is designed to push the limits of what the model can do. If you have a failed task with a very long code traceback in the output, that'll be why. Credit is based on trickles, so credit will be given for work done.

Is this one of them or is it something else? https://www.cpdn.org/result.php?resultid=22311226. I'm wondering as it was easy to tell with Hadley models when failures were due to this reason, not sure what it looks like with OIFS yet.
38) Message boards : Number crunching : OpenIFS Discussion (Message 68323)
Posted 14 Feb 2023 by AndreyOR
Post:
My first ever double free or corruption

Pretty cool run, it's like your PC is custom built for CPDN, sprinkled with a bit of luck and good resource usage.

You're also the only person that I've seen who uses RHEL. I wonder if Glenn has seen any correlations between failure rates and Linux distros?
39) Message boards : Number crunching : OpenIFS Discussion (Message 68309)
Posted 14 Feb 2023 by AndreyOR
Post:
Had a failure on a newer Ryzen PC with the following code, but there's nothing in the error log, at least that I could pick up on, indicating a problem. Task: https://www.cpdn.org/result.php?resultid=22312560
process exited with code 9 (0x9, -247)
40) Message boards : Number crunching : OpenIFS Discussion (Message 68253)
Posted 11 Feb 2023 by AndreyOR
Post:
Unfortunately, I have no estimate of how long they were to take.

Those 2 tasks are from a BL test batch (949) from a coupe of months ago using the old app version (1.07). I'm not sure that I'd use them for any significant info or comparison as they were just part of the initial test runs in preparation for OIFS release. Production runs are likely to be different and will use the latest app version (1.11 or newer).


Previous 20 · Next 20

©2024 climateprediction.net