Changes between Initial Version and Version 1 of RpcPolicy


Ignore:
Timestamp:
Apr 25, 2007, 2:17:33 PM (17 years ago)
Author:
Nicolas
Comment:

Converted by an automatic script

Legend:

Unmodified
Added
Removed
Modified
  • RpcPolicy

    v1 v1  
     1= Scheduler RPC timing and retry policies =
     2
     3Each scheduler RPC reports results, gets work, or both. The client's '''scheduler RPC policy''' has several components: when to make a scheduler RPC, which project to contact, which scheduling server for that project, how much work to ask for, and what to do if the RPC fails.
     4
     5The scheduler RPC policy has the following goals:
     6
     7
     8 * Make as few scheduler RPCs as possible.
     9 * Use random exponential backoff if a project's scheduling servers are down (i.e. delay by a random number times 2^N, where N is the number of unsuccessful attempts). This avoids an RPC storm when the servers come back up.
     10 * Eventually re-read a project's master URL file in case its set of schedulers changes.
     11 * Report results before or soon after their deadlines.
     12
     13
     14=== Resource debt ===
     15  The client maintains an exponentially-averaged sum of the CPU time it has devoted to each project. The constant EXP_DECAY_RATE determines the decay rate (currently a factor of e every week).
     16
     17Each project is assigned a '''resource debt''', computed as
     18
     19resource_debt = resource_share / exp_avg_cpu
     20
     21where 'exp_avg_cpu' is the CPU time used recently by the project (exponentially averaged). Resource debt is a measure of how much work the client owes the project, and in general the project with the greatest resource debt is the one from which work should be requested.
     22
     23
     24=== Minimum RPC time ===
     25  The client maintains a '''minimum RPC time''' for each project. This is the earliest time at which a scheduling RPC should be done to that project (if zero, an RPC can be done immediately). The minimum RPC time can be set for various reasons:
     26
     27
     28 * Because of a request from the project, i.e. a <request_delay> element in a scheduler reply message.
     29 * Because RPCs to all of the project's scheduler have failed. An exponential backoff policy is used.
     30 * Because one of the project's computations has failed (the application crashed, or a file upload or download failed). An exponential backoff policy is used to prevent a cycle of rapid failures.
     31
     32
     33=== Scheduler RPC sessions ===
     34  Communication with schedulers is organized into '''sessions''', each of which may involve many RPCs. There are two types of sessions:
     35
     36
     37 * '''Get-work''' sessions, whose goal is to get a certain amount of work. Results may be reported as a side-effect.
     38 * '''Report-result''' sessions, whose goal is to report results. Work may be fetched as a side-effect.
     39
     40 The internal logic of scheduler sessions is encapsulated in the class SCHEDULER_OP. This is implemented as a state machine, but its logic expressed as a process might look like:
     41{{{
     42get_work_session() {
     43    while estimated work < high water mark
     44        P = project with greatest debt and min_rpc_time < now
     45        for each scheduler URL of P
     46            attempt an RPC to that URL
     47            if no error break
     48        if some RPC succeeded
     49            P.nrpc_failures = 0
     50        else
     51            P.nrpc_failures++
     52            P.min_rpc_time = exponential_backoff(P.min_rpc_failures)
     53            if P.nrpc_failures mod MASTER_FETCH_PERIOD = 0
     54                P.fetch_master_flag = true
     55    for each project P with P.fetch_master_flag set
     56        read and parse master file
     57        if error
     58            P.nrpc_failures++
     59            P.min_rpc_time = exponential_backoff(P.min_rpc_failures)
     60        if got any new scheduler urls
     61            P.nrpc_failures = 0
     62            P.min_rpc_time = 0
     63}
     64
     65report_result_session(project P) {
     66    for each scheduler URL of project
     67        attempt an RPC to that URL
     68        if no error break
     69    if some RPC succeeded
     70        P.nrpc_failures = 0
     71    else
     72        P.nrpc_failures++;
     73        P.min_rpc_time = exponential_backoff(P.min_rpc_failures)
     74}
     75}}}
     76 The logic for initiating scheduler sessions is embodied in the [ClientLogic scheduler_rpcs->poll()] function.
     77{{{
     78if a scheduler RPC session is not active
     79    if estimated work is less than low-water mark
     80        start a get-work session
     81    else if some project P has overdue results
     82        start a report-result session for P;
     83        if P is the project with greatest resource debt,
     84        the RPC request should ask for enough work to bring us up
     85        to the high-water mark
     86}}}