climateprediction.net home page
Posts by Ananas

Posts by Ananas

1) Message boards : Number crunching : Hosts burning results - adjustment in sched_send.c (Message 49080)
Posted 12 May 2014 by Profile Ananas
Post:
Assuming that CPDN still uses an older version of the server side scheduler, that is happy with only 2% success results, which would be this one :

void SCHEDULER_REPLY::got_good_result() {
    host.max_results_day *= 2;
    if (host.max_results_day > config.daily_result_quota) {
        host.max_results_day = config.daily_result_quota;
    }
}

void SCHEDULER_REPLY::got_bad_result() {
    host.max_results_day -= 1;
    if (host.max_results_day < 1) {
        host.max_results_day = 1;
    }
}


The 2% are a worst case scenario, example :

Host limit is 100 results/day, after "burning" 50 results it is reduced to 50/day. A singe success result is sufficient to bring the limit back to the full 100/day so the cycle can start over.

this might be changed into this one, that has a harder penalty for bad hosts :

void SCHEDULER_REPLY::got_good_result() {
    host.max_results_day += 1;
    if (host.max_results_day > config.daily_result_quota) {
        host.max_results_day = config.daily_result_quota;
    }
}

void SCHEDULER_REPLY::got_bad_result() {
    host.max_results_day /= 2;
    if (host.max_results_day < 1) {
        host.max_results_day = 1;
    }
}


In later server side BOINC versions the coding in sched_result.cpp is somewhat different but it basically still handles bad/good in the same way, e.g. :

good :
    int n = havp->max_jobs_per_day*2;
    if (n > config.daily_result_quota) {
        n = config.daily_result_quota;
    }


bad :
    n -= 1;
    if (n < 1) {
        n = 1;
    }


which could of course easily be adjusted in the same way.
2) Message boards : Number crunching : RAPIT tasks failing after few seconds (Message 49072)
Posted 8 May 2014 by Profile Ananas
Post:
Have a look at this thread, I checked your active ones and didn't find that message so yours are still needed.
3) Message boards : Number crunching : ANZ model upload problems. (Message 49048)
Posted 5 May 2014 by Profile Ananas
Post:
I'm not totally sure about the effect it has ... but host 1321940 is "nulled" (0 workunits per day).

When I had "-1/day" in the beta project, it rejected my trickle reports (but coupled with a message "not accepting requests from this host") - could it be a similar problem here?

Otoh. ... how would the upload server know about it? So I guess Richard is right, it must be a timeout on server side caused by the slow connection that the reference access complains about too - especially as the host details show "Average upload rate : Unknown"
4) Message boards : Number crunching : Credit updates? (Message 49047)
Posted 5 May 2014 by Profile Ananas
Post:
... Are the stats there but nobody knows where they are? ...

I don't think that this is the problem, the stats export files are where they usually are (default location when the BOINC server is installed) - but the (unix) timestamp in tables.xml currently is 1396540744 which converts to a human readable Thu, 03 Apr 2014 15:59:04 GMT

The stats sites compare this timestamp to the latest one they imported and if it is not higher, they will skip the import of this project.

I cannot check wether it is just tables.xml that is outdated or all the export files are somewhat oldish - the directory is not indexed (not readable from the web) - but usually tables.xml is updated on each export run so the timestamp should be accurate.


p.s.: My patience level is still quite high as we already had major stats delays in the past and they all got fixed so far.

p.p.s.: There are even two locations with identical tables.xml, one at /cpdnboinc/stats/... (default under the project root) and one at /stats/... (directly in the server root), which is a copy, not a separate export, as the timestamp cannot be identical for two separate export runs

I even found a third one on climateprediction.net/stats/ but that's dated 17 Mar 2014 so it is older than the two others.
5) Message boards : Number crunching : CONVERTING TO LINUX (Message 48950)
Posted 28 Apr 2014 by Profile Ananas
Post:
... there are no BOINC applications on Solaris ...

Didn't someone create a Solaris binary for SIMAP? Currently I cannot check it, SIMAP seems to have a downtime.

Dotsch provides BOINC for Solaris plus a SETI project client plus DotschUX, a preBOINCified Linux distribution
6) Message boards : Number crunching : hadam3p eu WU segfault (Message 48918)
Posted 27 Apr 2014 by Profile Ananas
Post:
Hmmm, I just lately had the impression, that two models influenced eachother in this thread. A windows machine with less detailed error output but still with a somehow similar effect.

I guess it might be useful to have a look at the activity of models running concurrent on the same machines when such a thing happens.
7) Questions and Answers : Macintosh : Not accepting Requests from this host (Message 48908)
Posted 26 Apr 2014 by Profile Ananas
Post:
...
Perhaps the project staff can see the BM version in requests for new work or by some other means. In any case, I've passed the message on to reenable that machine.

Yes, they can, the request contains major, minor and release number so they could teach the scheduler to reject certain versions - this would have the advantage that they could even send a specific and detailed message to the client.
8) Questions and Answers : Windows : Email from climateprediction.net team (Message 48851)
Posted 19 Apr 2014 by Profile Ananas
Post:
There is one more thing that needs to happen in order to have proper combined stats.

Each user has one or several unique cross project IDs (CPID).

This ID is not really "cross project", when there is no connection between the projects through the hosts.

If a user runs 4 projects all on different hosts (only one per host), the cross project IDs will never sync and the stats sites will list this one person as 4 separate users.

If host 1 is connected to projects A and B, host two connected to projects C and D, project A will sync with B, project C will sync with D but the two pairs of projects (A+B and C+D) will not sync amongst eachother.

Only if all projects are somehow connected, if there is an uninterrupted "path" between the projects, they all will sync sooner or later, which allows the stats sites to combine all user projects under the same CPID.

In the example above, one more host that connects B and C would do that job, but connecting host 1 to project C or D would have the same effect. Both would allow the A and B CPID to sync with the CPID of projects C and D.


Afaik. it isn't even necessary that the user name is the same across all projects in order to have combined stats - the email needs to be identical though.

Taking Les' example : If all projects of user Fred/Tom/Peter use the same email and they are "somehow" connected as described, finding Tom on a stats site will give you access to his CPID and through the CPID you'll find the combined stats. On BOINCstats you can see the CPID on each project's detail page for each user. If two projects of the same user have a different value there, they cannot be combined. All stats sites work like this and the project list below your account page in each project uses the CPID as well.


edit : I was right, the user name plays no role. One of my team mates has the same CPID on the BBC experiment and on SETI, even though his user name contains the project name in both cases. BoincStats lists all his projects under the same CPID, no matter which name he has choosen.


@SMURLEY : You don't have to change anything on your account, as your account page does show your other projects properly. That means that your CPID is in sync between those projects. If it doesn't get any better, you should rather try to detach from the project and then attach to it again.
9) Questions and Answers : Wish list : Trickle timestamps in stderr (Message 48787)
Posted 13 Apr 2014 by Profile Ananas
Post:
If stderr had two extra lines for each data trickle (the big uploads), it would be easier to figure out which heartbeat or restart events are critical. One line when it starts to collect the data for the upload, one when the upload files are ready for upload.

For example I have received 5 ANZ models within quite a short time on one box that is very likely to produce heartbeat problems, when a new CPDN model starts.

The first model that started most likely will have 4 heartbeat messages, the second one 3 ... and the last one no heartbeat message at all.

After returning 3 of the models, the box downloaded one more ANZ but - quite unexpected - the remaining two "old" ANZ models were not hit by heartbeat problems when the new one started, just a few other projects were affected this time. Lucky me, perfect timing.

But it could as well have happened that the initialisation of the new model caused problems for the two older ones and those interruptions might even have caused a crash.

Unfortunately BOINC stderr messages have only useless timestamps but still trickle messages in stderr would help to see in which order events occured.

My guess would be that a heartbeat error while it prepares the upload data is absolutely destructive, whereas there is a good chance to survive it in the calculation phase.

The reason why this is especially interesting is that the project client might be able to set a "doing critical work" condition, that is not interrupted by heartbeat checks. I think I have seen something like that in the project API sources (long ago).
10) Message boards : Number crunching : ANZ model upload problems. (Message 48761)
Posted 10 Apr 2014 by Profile Ananas
Post:
@Albert :

If you haven't restarted the BOINC core client for a long time, it might have cached an old IP.

The core client will renew the IP cache and retrieve the correct of the upload server only when it is restarted.

A PC restart is usually not necessary, restarting the BOINC client should do.

I hope it helps :-)
11) Questions and Answers : Wish list : Using GPUs for number crunching (Message 48743)
Posted 9 Apr 2014 by Profile Ananas
Post:
The developer of Infernal (used by RNA-World) tested a GPU version and of course the plain calculation was faster - but more time got lost loading the data into / retrieving the results from the GPU memory than it saved crunching them.

Even if there was someone who took on this job for CPDN, my guess would be that CPDN would have an even worse savings (GPU) to expence (bus) ratio.

Btw., I don't think that it basically wouldn't work with Fortran, Fortran can sure call functions in .so or .dll files compiled from C sources.
12) Questions and Answers : Windows : no work from project (Message 48742)
Posted 9 Apr 2014 by Profile Ananas
Post:
The crashes you had now are not the same type as the one you had before.

CPDN is sometimes a bit picky when a virus scanner scans the files it is currently working with, so once you have your computer clean, you could probably exclude the directories containing the CPDN files from beeing scanned in the future.

The crashes now look somewhat like one that Ritterm reported here on his Win7 x64 machine with the same BOINC version, maybe we could somehow figure out how he solved his problem.
13) Message boards : Number crunching : Negative Credit? (Message 48730)
Posted 7 Apr 2014 by Profile Ananas
Post:
My guess : You have two accounts here, the one you posted with has 323,458 credits, the second one has 32,103 credits, and you checked the wrong one.

The connection between your cross-projects account is created through the mail that you entered when you created the project account, not through the user name. So your CPDN account "Gerry D. Mann" is connected to "Smimo" on Einstein and SETI and there are a bunch of other "Smimo" accounts that are each connected to other projects.

Search BOINCstats for Smimo and you'll see a list of separate cross-project accounts.

p.s.: This BOINC wiki entry might help to understand what happened.
14) Questions and Answers : Windows : no work from project (Message 48726)
Posted 7 Apr 2014 by Profile Ananas
Post:
I assumed that "Could not launch model process. Last Error=216" must be the windows errno from a system call the CPDN control process uses to start one of its sub-tasks, that's why I checked the 216.

Of course the important part is "might" in might be infected - the exit code can have several other reasons too.
15) Message boards : Number crunching : Computation Errors (Message 48725)
Posted 7 Apr 2014 by Profile Ananas
Post:
The only thing I can see is that the models crashed when 16395660 must have been just about to trickle / compose the first upload. (trickle times within the same model type are usually quite constant, the other one returned the first upload after 60,837, the crash happened after 59,819.36 of this one - close enough to assume there's a connection between trickle and crash)

Do you allow enough HDD space in your global settings ? And excluded all CPDN stuff from virus scans (scanners can scan inside ZIP files, might disturb the ZIP process)?
16) Questions and Answers : Windows : no work from project (Message 48721)
Posted 6 Apr 2014 by Profile Ananas
Post:
@Phoebe : Microsoft's explanation for error 216 is, that the PC might be infected :

runtime error 216

I see the page in german language but I guess they send English text when your browser setting tells it to.
17) Message boards : Number crunching : Must set rsc_memory_bound correctly (Message 48664)
Posted 1 Apr 2014 by Profile Ananas
Post:
Just an idea for one of the next core client betas : if the core client would insert a hint about the maximum memory usage it found for a workunit, it would help the project developers adjust their limits, i.e. something like :

<core_client_version>7.3.20</core_client_version>
<max_mem_usage_found>168570139</max_mem_usage_found>
<![CDATA[

...

I might be wrong but a tag outside of the CDATA value should not confuse the server side.
18) Message boards : Number crunching : Must set rsc_memory_bound correctly (Message 48655)
Posted 1 Apr 2014 by Profile Ananas
Post:
...
> 1) workunit.rsc_memory_bound is used only by the server;
> ...
> -- David

Might be partially wrong, BOINC/client/client_state.cpp (not the current version) :

// alert user if any jobs need more RAM than available
//
static void check_too_large_jobs() {
    double m = gstate.max_available_ram();
    bool found = false;
    for (unsigned int i=0; i<gstate.results.size(); i++) {
        RESULT* rp = gstate.results[i];
        if (rp->wup->rsc_memory_bound > m) {
            found = true;
            break;
        }
    }
    if (found) {
        msg_printf(0, MSG_USER_ALERT,
            _("Some tasks need more memory than allowed by your preferences.  Please check the preferences.")
        );
    }
}


and - from a much older source version (usually commented out so they knew it might cause trouble) :

// if an app has exceeded its maximum allowed memory, abort it
//
bool ACTIVE_TASK::check_max_mem_exceeded() {
    // TODO: calculate working set size elsewhere
    if (working_set_size > max_mem_usage || working_set_size/1048576 > gstate.global_prefs.max_memory_mbytes) {
        msg_printf(
            result->project, MSG_INFO,
            "Aborting result %s: exceeded memory limit %f\n",
            result->name,
            min(max_mem_usage, gstate.global_prefs.max_memory_mbytes*1048576)
        );
        abort_task(ERR_RSC_LIMIT_EXCEEDED, "Maximum memory usage exceeded");
        return true;
    }
    return false;
}


where max_mem_usage is derived from the workunit's value "rsc_memory_bound"

So it depends on your core client version wether it will ignore the value or not. And it is clearly _not_ only a server-side value.
19) Message boards : Number crunching : NZ Application "not in DB" (Message 48647)
Posted 31 Mar 2014 by Profile Ananas
Post:
Btw., the WU details show the correct full name, imo. it must come from the same source so the table must still be intact.

"UK Met Office HADAM3P" (an older model type) has lost the full name in the list too.
20) Message boards : Number crunching : NZ Application "not in DB" (Message 48641)
Posted 31 Mar 2014 by Profile Ananas
Post:
Index problem, statistics not up to date or table "app" damaged/incomplete (appid lookup not satisfied) ...

Host stats show "not in DB" where "UK Met Office HADAM3P Australia New Zealand" should be.


Next 20

©2020 climateprediction.net