climateprediction.net home page
Must set rsc_memory_bound correctly

Must set rsc_memory_bound correctly

Message boards : Number crunching : Must set rsc_memory_bound correctly
Message board moderation

To post messages, you must log in.

AuthorMessage
Jacob Klein

Send message
Joined: 28 Mar 13
Posts: 16
Credit: 5,383,625
RAC: 0
Message 48651 - Posted: 1 Apr 2014, 1:14:53 UTC
Last modified: 1 Apr 2014, 1:18:55 UTC

ClimatePrediction Team:

You need to change your work unit parameters, to properly set <rsc_memory_bound> correctly. BOINC 7.3.14 alpha (and potentially future versions also) will read that value, and compare it to the Working Set size, and will auto-abort the work unit if it exceeds the bound.

As of right now, I am getting errors due to your incorrect settings.

For example:
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=16297167
Exit status 198 (0xc6) (EXIT_MEM_LIMIT_EXCEEDED)
<core_client_version>7.3.14</core_client_version>
<![CDATA[
<message>
working set size > workunit.rsc_memory_bound: 167.57MB > 118.26MB
</message>
<stderr_txt>

Could you please promptly fix this?

Regards,
Jacob Klein
ID: 48651 · Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 28 Mar 13
Posts: 16
Credit: 5,383,625
RAC: 0
Message 48652 - Posted: 1 Apr 2014, 2:05:50 UTC
Last modified: 1 Apr 2014, 2:08:11 UTC

It looks like this change is being reverted for now, per David's email below.
So, there is no longer an immediate need to correct the value...
But please consider setting it correctly at some point, in case it gets used by the client in the future.


> Date: Mon, 31 Mar 2014 18:53:33 -0700
> From: d...a@ssl.berkeley.edu
> To: b...c_alpha@ssl.berkeley.edu
> Subject: Re: [boinc_alpha] 7.3.14 - Heads up - Memory bound enforcement
>
> On further thought, I'm going to change things back to the way they were, namely
>
> 1) workunit.rsc_memory_bound is used only by the server;
> it won't send a job if rsc_memory_bound > host's available RAM
> 2) the client aborts a job if working set size > host's available RAM
> 3) the client will run a set of jobs only if the sum of their WSSs
> fits in available RAM
> (i.e. if a job's WSS is close to all available RAM,
> it would run that job and nothing else)
>
> The reason for not aborting jobs when WSS > rsc_memory_bound is that
> it requires projects to come up with very accurate estimates of RAM usage,
> which I don't think is feasible in general.
> Also, it will lead to lots of aborted jobs, which is bad for volunteer morale.
>
> -- David
ID: 48652 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 48653 - Posted: 1 Apr 2014, 3:07:01 UTC

I'll make sure that Andy is aware of this, but cpdn doesn't cater for people using alpha versions of BOINC.
Some changes will require re-testing of the models, which can take months.

ID: 48653 · Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 28 Mar 13
Posts: 16
Credit: 5,383,625
RAC: 0
Message 48654 - Posted: 1 Apr 2014, 3:43:59 UTC - in response to Message 48653.  

As an Alpha tester, it is my responsibility to report problems as soon as I see them. In this case, I saw a problem (over half of my tasks were instantly aborted across various projects), it was caused by incorect rsc_memory_bound settings, and I reported it to various projects including yours, such that you guys would have as much time as possible to take the necessary action. At the time I reported the problem, we were going to keep the change, but as the 2nd post indicates, the change will be reverted.

I wasn't asking you to cater for me or for BOINC Alpha; I was trying to prevent a problem for your project's general user base, as we ramp up towards our public BOINC release.

I'd like to think you'd be less pessimistic about this. Perhaps I read your response wrong. It's been a long day.

Regards,
Jacob
ID: 48654 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 48655 - Posted: 1 Apr 2014, 6:43:41 UTC - in response to Message 48652.  
Last modified: 1 Apr 2014, 7:07:22 UTC

...
> 1) workunit.rsc_memory_bound is used only by the server;
> ...
> -- David

Might be partially wrong, BOINC/client/client_state.cpp (not the current version) :

// alert user if any jobs need more RAM than available
//
static void check_too_large_jobs() {
    double m = gstate.max_available_ram();
    bool found = false;
    for (unsigned int i=0; i<gstate.results.size(); i++) {
        RESULT* rp = gstate.results[i];
        if (rp->wup->rsc_memory_bound > m) {
            found = true;
            break;
        }
    }
    if (found) {
        msg_printf(0, MSG_USER_ALERT,
            _("Some tasks need more memory than allowed by your preferences.  Please check the preferences.")
        );
    }
}


and - from a much older source version (usually commented out so they knew it might cause trouble) :

// if an app has exceeded its maximum allowed memory, abort it
//
bool ACTIVE_TASK::check_max_mem_exceeded() {
    // TODO: calculate working set size elsewhere
    if (working_set_size > max_mem_usage || working_set_size/1048576 > gstate.global_prefs.max_memory_mbytes) {
        msg_printf(
            result->project, MSG_INFO,
            "Aborting result %s: exceeded memory limit %f\n",
            result->name,
            min(max_mem_usage, gstate.global_prefs.max_memory_mbytes*1048576)
        );
        abort_task(ERR_RSC_LIMIT_EXCEEDED, "Maximum memory usage exceeded");
        return true;
    }
    return false;
}


where max_mem_usage is derived from the workunit's value "rsc_memory_bound"

So it depends on your core client version wether it will ignore the value or not. And it is clearly _not_ only a server-side value.
ID: 48655 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 48656 - Posted: 1 Apr 2014, 7:14:53 UTC

Some of us saw your email to the boinc_alpha list last night, Jacob. Thyme Lawn said at UK bedtime that he'd email Andy about it this morning. The settings will clearly need to be modified, but fortunately not now in a rush.

If this had happened a few hours later you'd have wondered whether it was an ill-conceived April Fools joke. Pity those two well-advanced models crashed, but you can't alpha-test without the occasional casualty.


Cpdn news
ID: 48656 · Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 28 Mar 13
Posts: 16
Credit: 5,383,625
RAC: 0
Message 48663 - Posted: 1 Apr 2014, 11:04:07 UTC - in response to Message 48656.  

Thanks for understanding. I was a bit miffed to see most of my tasks get aborted, too, but as you said, it comes with the territory of being a tester. I'm glad you agree that it'd be wise to correct the work unit parameters.
ID: 48663 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 48664 - Posted: 1 Apr 2014, 11:51:23 UTC

Just an idea for one of the next core client betas : if the core client would insert a hint about the maximum memory usage it found for a workunit, it would help the project developers adjust their limits, i.e. something like :

<core_client_version>7.3.20</core_client_version>
<max_mem_usage_found>168570139</max_mem_usage_found>
<![CDATA[

...

I might be wrong but a tag outside of the CDATA value should not confuse the server side.
ID: 48664 · Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 28 Mar 13
Posts: 16
Credit: 5,383,625
RAC: 0
Message 48665 - Posted: 1 Apr 2014, 12:04:51 UTC - in response to Message 48664.  
Last modified: 1 Apr 2014, 12:05:07 UTC

That is not a bad idea. I have passed along the info to the dev team, via the email below.


From: j...@msn.com
To: b..._alpha@ssl.berkeley.edu
Subject: Request - Report "maximum usage" variables when reporting tasks
Date: Tue, 1 Apr 2014 08:03:49 -0400

Below you will see an idea that came from the memory discussion on Climate Prediction's forum.

If I understand correctly, it would nice to have the client keep track of the "maximum working set used" and also maybe the "maximum virtual memory used", and then report those values back to the server, when reporting results. While I don't agree about them being displayed in stderr.txt, I do think it's a valid idea, and is one that could give feedback to the projects. It would be useful to users especially, to display those variables on the task's result page. Oh, and my idea is to do it for "maximum disk usage used" too :-p Thoughts on adding these 3 variables?




http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7802#48664
------------------------------------------
Just an idea for one of the next core client betas : if the core client would insert a hint about the maximum memory usage it found for a workunit, it would help the project developers adjust their limits, i.e. something like :

<core_client_version>7.3.20</core_client_version>
<max_mem_usage_found>168570139</max_mem_usage_found>
<![CDATA[

...

I might be wrong but a tag outside of the CDATA value should not confuse the server side.
------------------------------------------
ID: 48665 · Report as offensive     Reply Quote
rbpeake

Send message
Joined: 27 Feb 08
Posts: 41
Credit: 1,402,356
RAC: 0
Message 48667 - Posted: 1 Apr 2014, 16:07:29 UTC

FYI, I found this message on another project:
David Anderson wrote:
On further thought, I'm going to change things back to the way they were, namely

1) workunit.rsc_memory_bound is used only by the server;
it won't send a job if rsc_memory_bound > host's available RAM
2) the client aborts a job if working set size > host's available RAM
3) the client will run a set of jobs only if the sum of their WSSs
fits in available RAM (i.e. if a job's WSS is close to all available RAM, it would run that job and nothing else)

The reason for not aborting jobs when WSS > rsc_memory_bound is that
it requires projects to come up with very accurate estimates of RAM usage,
which I don't think is feasible in general.
Also, it will lead to lots of aborted jobs, which is bad for volunteer morale.

-- David


7.3.15 will again have normal values, and will not be using the immediate check of memory used on tasks.
Regards,
Bob P.
ID: 48667 · Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 28 Mar 13
Posts: 16
Credit: 5,383,625
RAC: 0
Message 48669 - Posted: 1 Apr 2014, 16:11:04 UTC - in response to Message 48667.  

Correct. See post 2.
ID: 48669 · Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 28 Mar 13
Posts: 16
Credit: 5,383,625
RAC: 0
Message 48679 - Posted: 2 Apr 2014, 13:30:53 UTC - in response to Message 48665.  

That is not a bad idea. I have passed along the info to the dev team


It turns out, David liked the idea. He has implemented it too, so.. BOINC will probably start sending that data with the next release (7.3.16+).

It looks like it'll be saved in the state file as:
<peak_working_set_size>
<peak_swap_size>
<peak_disk_usage>

.. and will be sent to the server as:
<final_peak_working_set_size>
<final_peak_swap_size>
<final_peak_disk_usage>

Again, great idea!

http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=b1a6fa39fc365b050141f5a89bf0d71a2a70303e

Client: keep track of job's peak WSS, swap size, and disk usage; send to server

Also fixed a bug where, if a job was aborted while not running,
its final CPU and elapsed time weren't copied from ACTIVE_TASK to RESULT,
hence not sent to scheduler
ID: 48679 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 921
Credit: 34,100,818
RAC: 11,270
Message 48681 - Posted: 2 Apr 2014, 14:21:01 UTC - in response to Message 48679.  

As yet, I've not seen any changes to the back-end server software which would allow the returned data to be stored and queried. No doubt a patch update will be available for servers running the current BOINC server software, in the course of the next few days.

But I don't think it will be easy to retro-fit it to the somewhat elderly server version here. That may have to wait until the work to upgrade and migrate the CPDN BOINC server to the latest version is complete.
ID: 48681 · Report as offensive     Reply Quote

Message boards : Number crunching : Must set rsc_memory_bound correctly

©2024 climateprediction.net