climateprediction.net home page
OpenIFS Discussion

OpenIFS Discussion

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 31 · Next

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 66896 - Posted: 14 Dec 2022, 10:30:46 UTC - in response to Message 66894.  

A new version (1.05) of the OpenIFS 43r3 Perturbed Surface application was distributed on Dec. 12th. The last 4 resends I received used the new app. version. Three completed successfully and 1 is in progress


Mine, too. One has completed successfully, and two are about 2/3 done, though it has a gross exaggeration of the time remaining.

The one that completed successfully is interesting because the previous machine it ran on used the 1.01 version and I ran the 1.05 version. I.e., the task does not seem to care which version it uses. Notice it now says this instead of the old task name.

/var/lib/boinc/slots/9/oifs_43r3_model.exe
/var/lib/boinc/slots/10/oifs_43r3_model.exe

I always wonder why the failing tasks use more time than the ones I get that work correctly. What are they doing with all the wasted time? His machine is a little bit faster than mine, so that is not the reason.

Workunit 12164019
name 	oifs_43r3_ps_0930_2021050100_123_945_12164019
application 	OpenIFS 43r3 Perturbed Surface
created 	28 Nov 2022, 21:33:48 UTC
canonical result 	22250176
granted credit 	0.00
minimum quorum 	1
initial replication 	1
max # of error/total/success tasks 	3, 3, 1
Task
click for details	Computer	Sent	Time reported
or deadline
explain	Status	Run time
(sec)	CPU time
(sec)	Credit	Application
22250176 	1511241 	13 Dec 2022, 8:25:20 UTC 	13 Dec 2022, 23:24:41 UTC 	Completed 	52,607.22 	52,023.88 	0.00 	OpenIFS 43r3 Perturbed Surface v1.05
x86_64-pc-linux-gnu
22245964 	1518460 	29 Nov 2022, 4:15:46 UTC 	13 Dec 2022, 8:22:42 UTC 	Error while computing 	98,458.51 	91,682.23 	--- 	OpenIFS 43r3 Perturbed Surface v1.01
x86_64-pc-linux-gnu

ID: 66896 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 823
Credit: 13,729,538
RAC: 7,232
Message 66897 - Posted: 14 Dec 2022, 10:35:13 UTC - in response to Message 66895.  

And another oddity. two of the resends, from batch 947 went up to 100% and then dropped back to 99.990% as I was watching. When time remaining dropped to 0 they kept showing as running despite negligble cpu usage. top. I have suspended them in case getting any information from the slot files might be scuppered by letting them continue, though it may be I have to kill the processes to stop them showing as running. I will wait till the third task from 945 finishes just to ensure I don't kill the wrong process.
Dave, this 100% -> 99.9% is just an oddity of the way the boinc client computes the time remaining. Ignore it. There's nothing wrong with the task. I can probably tweak the fraction done computation but I can never eliminate it completely.

What's happening is the model itself proceeds at, say, 1% every 10mins. When the model finishes, there is a bit more work to do to zip up the remaining files, housekeeping etc. This doesn't proceed at 1% every 10mins, it's slower, but the client seems to use the previous running time of 1% every 10mins and therefore thinks the task will finish sooner. But then the task gets to a point in the code where it sends a message to the client saying it's actually only at 99.99% and you see the value done change.

This is why I turn off time remaining, it's not accurate, only an estimate.

The tasks know what they are doing, there is no error here, it's just the client trying to be clever.

The proper way of working out where the task is would be to go into the slot directory and look at the 'stderr.txt' output.
ID: 66897 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,372,005
RAC: 4,704
Message 66898 - Posted: 14 Dec 2022, 10:38:06 UTC - in response to Message 66895.  

And another oddity. two of the resends, from batch 947 went up to 100% and then dropped back to 99.990% as I was watching. When time remaining dropped to 0 they kept showing as running despite negligble cpu usage. top.
I saw that during the test runs too. Those finished normally around 4 minutes after the 100% flash - it doesn't seem to be detrimental to the run.
ID: 66898 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 66899 - Posted: 14 Dec 2022, 11:02:29 UTC - in response to Message 66898.  
Last modified: 14 Dec 2022, 11:05:47 UTC

And another oddity. two of the resends, from batch 947 went up to 100% and then dropped back to 99.990% as I was watching. When time remaining dropped to 0 they kept showing as running despite negligble cpu usage. top.
I saw that during the test runs too. Those finished normally around 4 minutes after the 100% flash - it doesn't seem to be detrimental to the run.


Thanks for that Richard and Glen. They obviously keep doing the zipping up after being suspended as one has now completed and the other is now uploading. Obviously, I haven't had my eyes glued to the screen at that point with these tasks before!
ID: 66899 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 66900 - Posted: 14 Dec 2022, 11:37:05 UTC
Last modified: 14 Dec 2022, 13:47:14 UTC

And the last of my resends has survived a reboot albeit with suspending computation, waiting, exiting BOINC first. If success rate proves to be substantially higher with the next batch, I might try the hard reboot test.

Edit: three backed up zips remaining, all getting transient upload error but new zips from the still running task are going through. Same whether I use router and bored band or 4G via my phone which is 4 times faster on a bad day, 10 times faster on a good day but with 15G data limit I am only going to upload a few tasks a month using it!

Edit2: Started working again after half an hour. Probably still some congestion on the system?
ID: 66900 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 823
Credit: 13,729,538
RAC: 7,232
Message 66901 - Posted: 14 Dec 2022, 13:24:32 UTC - in response to Message 66900.  
Last modified: 14 Dec 2022, 13:31:52 UTC

And the last of my resends has survived a reboot albeit with suspending computation, waiting, exiting BOINC first. If success rate proves to be substantially higher with the next batch, I might try the hard reboot test.
A power on/off reboot works fine now with OpenIFS and the updated control code. From a programming point of view, it's the same as restarting the client (without powering off).

Just remember to tick/check the 'keep non-GPU apps in memory', otherwise if openifs exe gets swapped out, it'll have to do a restart. This is why people are reporting seeing the task running for a long time, if it restarts it will repeat a few timesteps on each restart. If it's restarting frequently, it could potentially never finish. This is why we've upped the memory bounds to keep it away from 8Gb RAM machines. If it stays in memory suspended, that's fine.

CPDN have moved onto a newer managed cloud upload server so that too should be faster & more stable.
ID: 66901 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 823
Credit: 13,729,538
RAC: 7,232
Message 66902 - Posted: 14 Dec 2022, 13:34:18 UTC

I understand there will be some more OpenIFS work with the BL & PS apps appearing in the next week now recent issues have been resolved.
ID: 66902 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 66903 - Posted: 14 Dec 2022, 14:01:19 UTC - in response to Message 66897.  

This is why I turn off time remaining, it's not accurate, only an estimate.

The tasks know what they are doing, there is no error here, it's just the client trying to be clever.


With the "1.01" tasks, I noticed after I had completed a few, my Boinc client learned pretty quickly to make much better estimates.
I expect the same will apply to the "1.05" tasks.
ID: 66903 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 66904 - Posted: 14 Dec 2022, 14:11:01 UTC - in response to Message 66900.  

Edit2: Started working again after half an hour. Probably still some congestion on the system?


I have not noticed any congestion. I expected some when it started working the other day. But it just started taking about two at a time every second or so until all 900 uploaded. The upload speed is fabulous on my 75 megabit/second fiber-optic Internet connection that has not changed. The speedup I attribute to the server at the other (CPDN) end.

Now this is typically what happens. I have noticed no exceptions.

Wed 14 Dec 2022 08:54:12 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_0622_2021050100_123_945_12163711_1_r1059561091_118.zip
Wed 14 Dec 2022 08:54:17 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0622_2021050100_123_945_12163711_1_r1059561091_118.zip
Wed 14 Dec 2022 08:54:34 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_1602_2021050100_123_946_12164691_1_r1605124282_118.zip
Wed 14 Dec 2022 08:54:39 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1602_2021050100_123_946_12164691_1_r1605124282_118.zip
Wed 14 Dec 2022 09:01:14 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_0622_2021050100_123_945_12163711_1_r1059561091_119.zip
Wed 14 Dec 2022 09:01:19 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_0622_2021050100_123_945_12163711_1_r1059561091_119.zip
Wed 14 Dec 2022 09:01:45 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_1602_2021050100_123_946_12164691_1_r1605124282_119.zip
Wed 14 Dec 2022 09:01:49 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1602_2021050100_123_946_12164691_1_r1605124282_119.zip

ID: 66904 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 66937 - Posted: 17 Dec 2022, 7:08:13 UTC

Good to see #949 is now at 84% success as of 0400 UCT. (GMT in old money.) That is a significant increase on the three previous batches which have been out for longer. Not sure what happened to #948?
ID: 66937 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,523,307
RAC: 1,707
Message 66938 - Posted: 17 Dec 2022, 14:59:09 UTC - in response to Message 66937.  

In case it hasn't been mentioned previously: if an oifs task has bugged out part way through then a directory for that task may be left in the project folder that you may want to manually delete to reclaim the disk space.

Another task failed with error 9 today. The host had previously run 7 of the last batch successfully and 1 more since without a problem. The host has been running the same projects continuously with 8 - 11 GB free memory according to top, with roughly 70GB of disk space available.

I believe a wrapper is being used to run the oifs tasks, is that correct ?
Regarding the entries that are put into the stderr.txt file whilst the 'task' is running, is that just by the actual real application, by the wrapper, or both ?
These entries for example:
  04:51:49 STEP 1440 H= 360:00 +CPU= 25.486
The child process terminated with status: 0
Moving to projects directory: /var/lib/boinc-client/slots/0/ICMGGh7zg+001440
Moving to projects directory: /var/lib/boinc-client/slots/0/ICMSHh7zg+001440
Moving to projects directory: /var/lib/boinc-client/slots/0/ICMUAh7zg+001440
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001392
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001416
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001404
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001416
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001344
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001356
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001344
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001428
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001380
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001368
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001404
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001392
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001440
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001368
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001440
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001344
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001356
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001356
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001416
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001428
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001368
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001428
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001440
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMGGh7zg+001392
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMUAh7zg+001380
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001380
Adding to the zip: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_12166594/ICMSHh7zg+001404
Zipping up the final file: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_bl_a05k_2016092300_15_949_12166594_0_r812719879_14.zip
Uploading the final file: upload_file_14.zip
Uploading trickle at timestep: 1295100
04:55:23 (1414522): called boinc_finish(0)



The code in app_control.cpp for the client that is handling the error 9 reporting is this section:
if (WIFEXITED(stat)) {
            result->exit_status = WEXITSTATUS(stat);

            double x;
            char buf[256];
            bool is_notice;
            int e;
            if (temporary_exit_file_present(x, buf, is_notice)) {
                handle_temporary_exit(will_restart, x, buf, is_notice);
            } else {
                if (log_flags.task_debug) {
                    msg_printf(result->project, MSG_INFO,
                        "[task] process exited with status %d\n",
                        result->exit_status
                    );
                }
                if (result->exit_status) {
                    set_task_state(PROCESS_EXITED, "handle_exited_app");
                    snprintf(err_msg, sizeof(err_msg),
                        "process exited with code %d (0x%x, %d)",
                        result->exit_status, result->exit_status,
                        (~0xff)|result->exit_status
                    );
                    gstate.report_result_error(*result, err_msg);
                } else {
                    if (finish_file_present(e)) {
                        set_task_state(PROCESS_EXITED, "handle_exited_app");
                    } else {
                        handle_premature_exit(will_restart);
                    }
                }
            }
        } else if (WIFSIGNALED(stat)) {
            int got_signal = WTERMSIG(stat);

            if (log_flags.task_debug) {
                msg_printf(result->project, MSG_INFO,
                    "[task] process got signal %d", got_signal
                );
            }


if (WIFEXITED(stat)) { <=== This evaluates to TRUE, the exit was NORMAL

result->exit_status = WEXITSTATUS(stat); <==== This sets the result's exit_status flag to be that for the child, in this case 9

if (result->exit_status) { <==== Because non-zero we do this section of code
set_task_state(PROCESS_EXITED, "handle_exited_app");
snprintf(err_msg, sizeof(err_msg),
"process exited with code %d (0x%x, %d)",
result->exit_status, result->exit_status,
(~0xff)|result->exit_status
);
gstate.report_result_error(*result, err_msg);


So that is why I'm asking if a wrapper is being used and it is the wrapper code that I need to go and look at. Is the wrapper code being used a standard BOINC flavour or have you made your own version that is not available to the public ?

It would seem to me that the application/model that is doing all the work has done a splendid job and got all the right answers and sent them to the server but at the very last moment the wrapper has indicated a problem of its own that marks all that work and the task done as invalid.

I've seen it mentioned several times here that error 9 is signal 9 but it isn't. There is a separate bit of code, at the end of that posted above, that handles signals and you get a different message in the log. I'm solely looking at those tasks that get an error 9 after completing the work correctly as best I can tell. You could check that by requiring a quorum to validate against but then you are duplicating every task rather than just having those that fail get repeated by a second or third host.

Linux says Error 9 is EBADF - It is displayed in case of bad file descriptor
I have been looking for any remnants in the slot or project directories that would indicate something went amiss but have found no such evidence. I enabled slot_debug that showed approximately 1,950 files got cleared out of the slot upon termination without a problem. The temporary task directory under the projects folder is gone. The task job files (jf_*) have gone.

To get any further with the cause of this problem will require a better understanding of what is happening right at the end.
If it is the wrapper code that is returning an error code of 9 right at the end then that is where we need to look to determine the cause.
ID: 66938 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 66943 - Posted: 18 Dec 2022, 4:22:41 UTC

I just got 8 oifs_43r3_ps tasks, and three of them are running with about 1.5 hours on each. My guess is they are not going to crash.
All are re-runs. They seem to crash for no apparent reason. Each is different. Many leave no stderr, but some do.

IIRC, the ones that leave stderr seem to have the model crash, which seems to confuse the wrapper.
Just my impression.
ID: 66943 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 66944 - Posted: 18 Dec 2022, 11:12:47 UTC

#949 now up to 88%success with only 2 hard fails.
ID: 66944 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 66945 - Posted: 18 Dec 2022, 13:57:27 UTC - in response to Message 66944.  
Last modified: 18 Dec 2022, 14:02:20 UTC

#949 now up to 88%success with only 2 hard fails.


My ten are all re-runs that have failed for the previous users. Most are #946 and #947. None are #949.

The three that are running now are a little over 90% complete, so my guess is they will complete correctly.
So far, all oifs_43r3_ps tasks have run successfully for me.
ID: 66945 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 66946 - Posted: 18 Dec 2022, 14:14:46 UTC - in response to Message 66944.  

The three that are running now are a little over 90% complete, so my guess is they will complete correctly.
So far, all oifs_43r3_ps tasks have run successfully for me.
945, 6 and 7 are 79, 82 and 82% complete when I looked earlier but 945 is over 70% failures. I think that figure though is artificially high as I suspect a task that fails three times counts as three failures. I don't know enough to check that out with the BOINC server code. In contrast, #949 is only 11% fails and only 2 hard fails.
ID: 66946 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 66947 - Posted: 18 Dec 2022, 15:13:49 UTC - in response to Message 66945.  

The three that are running now are a little over 90% complete, so my guess is they will complete correctly.
So far, all oifs_43r3_ps tasks have run successfully for me.


The three re-runs just completed successfully.
ID: 66947 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 823
Credit: 13,729,538
RAC: 7,232
Message 66948 - Posted: 18 Dec 2022, 17:08:13 UTC

OpenIFS model fails due to inappropriate perturbations.
Some model tasks (but very few) are failing due to inappropriate perturbations pushing the model too far. Think of this as equivalent to the 'theta level' type error that the Hadley models give.

For the BL OpenIFS app, you will see a long stack-trace at the bottom of the task's report on the CPDN website, which mentions the function 'vdfmain'. This is the model's vertical diffusion scheme and indicates vertical air velocity is too high somewhere (typically near the surface).
For the PS OpenIFS app, you will see a similar stack-trace which refers to 'gp_norm'. Similar kind of thing, numbers at or near the surface have got too large.

The scientists in both these projects are perturbing some of the model's fixed parameters across quite a range of values so some fails are expected - though, to date, surprisingly few.
ID: 66948 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 823
Credit: 13,729,538
RAC: 7,232
Message 66949 - Posted: 18 Dec 2022, 17:58:41 UTC - in response to Message 66938.  

Hi, I can answer some of this. As it's getting technical maybe the moderators might want to move it to a separate thread?

In case it hasn't been mentioned previously: if an oifs task has bugged out part way through then a directory for that task may be left in the project folder that you may want to manually delete to reclaim the disk space.
Yes, this is a known issue that's being looked into.

Another task failed with error 9 today. The host had previously run 7 of the last batch successfully and 1 more since without a problem.
Tasks are failing with error codes 1 & 9. Some seem to report more information: 'double free corruption' or 'free()... fast' errors appear at the end of the stderr but most don't. My guess is that the error is not being flushed to the log before the process dies.

With the help of others here, we know that 'double free' corruption seems to be symptomatic of the model itself failing. The fail with error code 9 happens after the model has finished as you say, and sometimes with the 'free()...' message as well.

What I find interesting is both these only seem to happen on AMD hardware. I didn't do an exhaustive trawl through the logs but I could not find a single intel machine with these fails. My suspicion is that both these errors are memory related. 'double free' corruption obviously is, the error code 9 with the 'free()..' error could also refer to a memory resident file, but quite what I am not sure. Both codes were compiled on Intel with the latest Intel compiler. Whether there's additional compiler options required I don't know. I have not been able to reproduce these errors on my little AMD box.

It's possible AMD chips are triggering memory bugs in the code depending on what else happens to be in memory at the same time (hence the seemingly random nature of the fail). Hard to say exactly at the moment but it could also been something system/hardware related specific to Ryzens. I have never seen the model fail like this before on the processors I've worked with in the past (none of which were AMD unfortunately). I am tempted to turn down the optimization and see what happens....

I believe a wrapper is being used to run the oifs tasks, is that correct?
Yes, and it's a code created by CPDN, which I've also recently worked on. I don't know the history of where it came from. It's undergoing some refactoring & bugfixing. To answer your other question, the code is in github here: https://github.com/CPDN-git/openifs, just be aware Andy & I both have private forks which are ahead of this.

Regarding the entries that are put into the stderr.txt file whilst the 'task' is running, is that just by the actual real application, by the wrapper, or both ?
Both. The lines with 'STEP' are coming from the model's output to stderr, everything else is coming from the separate controlling wrapper process.

It would seem to me that the application/model that is doing all the work has done a splendid job and got all the right answers and sent them to the server but at the very last moment the wrapper has indicated a problem of its own that marks all that work and the task done as invalid.
Agreed. I've asked CPDN if there is a way of getting the server to check the upload was received OK to reclassify this as a success. It may not be easy as the uploads go to a cloud server first. Not my expertise.

Linux says Error 9 is EBADF - It is displayed in case of bad file descriptor
I have been looking for any remnants in the slot or project directories that would indicate something went amiss but have found no such evidence. I enabled slot_debug that showed approximately 1,950 files got cleared out of the slot upon termination without a problem. The temporary task directory under the projects folder is gone. The task job files (jf_*) have gone.
There are some possible issues at the end of the code I've already noted with Andy. I've been working on the more urgent fixes up to now. As I mentioned above, this could refer to a memory resident file.

Comments/suggestions/help/advice; all welcome.
ID: 66949 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,372,005
RAC: 4,704
Message 66950 - Posted: 18 Dec 2022, 18:33:48 UTC - in response to Message 66949.  

It would seem to me that the application/model that is doing all the work has done a splendid job and got all the right answers and sent them to the server but at the very last moment the wrapper has indicated a problem of its own that marks all that work and the task done as invalid.
Agreed. I've asked CPDN if there is a way of getting the server to check the upload was received OK to reclassify this as a success. It may not be easy as the uploads go to a cloud server first. Not my expertise.
I have a theory on this, which may suggest where to look.

BOINC clients communicate with projects on two completely separate levels. There are very simple-minded file copying processes, both download and upload, which simply move complete files form one place to another, with no information on or interest in what those files might contain. Those files belong to the project's scientists. And separately, BOINC clients communicate with the project's "scheduler" about the administration and inner workings of BOINC itself.

CPDN is unusual in producing intermediate returns of both types: 'uploads' go to the scientists, and 'trickles' go to the administrators. Most projects only communicate once, when the task has completely finished, and BOINC is careful to wait until the data transfer has completed, before finalising the administration.

My suspicion is that the completion of intermediary upload file transfers isn't checked so carefully by the client. Some volunteers with powerful computers but slower data lines may still have intermediate files chugging along when the rest of the task has completed, and I'm worried that the final administrative 'all done' message may be sent too early in those cases.

Next time I get some tasks, I'll trying holding some files back and studying the logs, to see what order the various 'finished' processes are handled in.
ID: 66950 · Report as offensive     Reply Quote
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66951 - Posted: 18 Dec 2022, 18:43:14 UTC
Last modified: 18 Dec 2022, 18:47:30 UTC

I take it all these problems mean you haven't even considered starting on virtualbox to get windows machines running it? I've got 126 cores sat waiting....

On desktops and laptops, Linux = 1.46%, Windows = 82.56%.

Add windows and you'd get over 57 times as many users.
ID: 66951 · Report as offensive     Reply Quote
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 31 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 climateprediction.net