Changes between Version 32 and Version 33 of CreditNew


Ignore:
Timestamp:
Mar 26, 2010, 3:23:57 PM (14 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CreditNew

    v32 v33  
    2020Notes:
    2121 * For our purposes, the peak FLOPS of a device
    22    uses single or double precision, whichever is higher.
     22   is based on single or double precision, whichever is higher.
    2323
    2424== Credit system goals ==
     
    113113  They aren't cheat-proof, and we don't use them.
    114114
    115 == Peak FLOP Count (PFC) ==
     115== Peak FLOP Count ==
    116116
    117117This system uses the Peak-FLOPS-based approach,
     
    127127 PFC(J) = T * peak_flops(J)
    128128
     129The credit for a job J is typically proportional to PFC(J),
     130but is limited and normalized in various ways.
     131
    129132Notes:
    130133
    131  * PFC(J) is not cheat-proof;
     134 * PFC(J) is not reliable;
    132135   cheaters can falsify elapsed time or device attributes.
    133136 * We use elapsed time instead of actual device time (e.g., CPU time).
     
    147150   in the trickle message.
    148151
    149 By default, the credit for a job J is proportional to PFC(J),
    150 but is limited and normalized in the following ways:
    151 
    152152== Computing averages ==
    153153
     
    160160   and we need to track this.
    161161   This done as follows: for the first N samples
    162    (N = ~100 for app versions, ~10 for hosts)
    163162   we take the straight average.
    164    After that we use an exponentially-weighted average
    165    (with appropriate parameter for app version and host)
    166  * A given sample may be wildly off,
    167    and we can't let this mess up the average.
    168    Samples after the first are capped at 10 times the current average.
     163   After that we use an exponentially-weighted average with parameter A.
     164   The choice of N and A depends on the entity involved;
     165   for app versions (which typically get thousands of jobs per day)
     166   we might use N=100 and A=.001.
     167   For hosts (which typically get a few jobs per day)
     168   we might use N=10 and A=.01.
     169 * To reduce the effect of erroneously huge samples,
     170   samples after the first are capped at X times the current average.
     171   X depends on the entity:
     172   maybe 10 for hosts, 100 for app versions.
    169173 * We keep track of the number of samples,
    170174   and use an average only if its number of samples
     
    175179We maintain the following estimates:
    176180
    177  app.min_avg_pfc:: an estimate of the average actual FLOPS for an app
     181 app.min_avg_pfc:: an estimate of the average actual FLOPS for the app
    178182   (normalized by wu.fpops_est)
    179183 app_version.pfc_avg:: the average of PFC(J)/wu.fpops_est for an app version.
     184 app_version.pfc_scale:: a PFC scale factor for the app version
    180185 host_app_version.pfc_avg:: for each app version V and host H,
    181186   the average of PFC(J)/wu.fpops_est for jobs completed by H using A.
     187 host_app_version.scale_probation::
     188   if set, the host is suspected of cherry-picking (see below)
     189   and we don't use host normalization
    182190
    183191== Sanity check ==
    184192
    185193If PFC(J) is infinite or is > wu.fpops_bound,
    186 J is assigned a "default PFC" and other processing is skipped.
    187 Default PFC is determined as follows:
     194J is assigned a "default PFC" D and other processing is skipped.
     195D is determined as follows:
    188196
    189197 * If app.min_avg_pfc is defined then
     
    194202
    195203   D = wu.fpops_est
     204
     205We also set host_app_version.scale_probation to true
     206(ensuring that the host scale factor isn't used for a while)
     207and host_app_version.error_rate to an initial value
     208(ensuring that jobs sent to this host are replicated for a while).
    196209
    197210== Cross-version normalization ==
     
    200213(e.g., CPU, multi-thread, and GPU versions).
    201214If jobs are distributed uniformly to versions,
    202 all versions should get the same average credit.
    203 We adjust the credit per job
    204 so that the average is the same for each version.
     215all versions should get the same average granted credit.
     216To make this so, we scale PFC as follows.
    205217
    206218For each app, we periodically compute cpu_pfc
     
    228240
    229241 app.min_avg_pfc = app_version.pfc_avg
     242 app_version.pfc_scale = 1
    230243
    231244Notes:
     
    246259   then this mechanism doesn't work as intended.
    247260   One solution is to create separate apps for separate types of jobs.
    248  * Cheating or erroneous hosts can influence PFC^mean^(V) to some extent.
     261 * Cheating or erroneous hosts can influence app_version.pfc_avg to some extent.
    249262   This is limited by the Sanity Check mechanism,
    250263   and by the fact that only validated jobs are used.
    251264   The effect on credit will be negated by host normalization
    252265   (see below).
    253    There may be an effect on cross-version normalization.
    254    This could be eliminated by computing PFC^mean^(V)
    255    as the sample-median value of PFC^mean^(H, V) (see below).
     266   There may be an adverse effect on cross-version normalization.
     267   This could be eliminated by computing app_version.pfc_avg
     268   as the sample-median value of host_app_version.pfc_avg
    256269
    257270== Host normalization ==
     
    261274Then the average credit per job should be the same for all hosts.
    262275
    263 We scale PFC by the factor
     276To achieve this, we scale PFC by the factor
    264277
    265278 app_version.pfc_avg / host_app_version.pfc_avg
     
    271284   jobs to GPUs with more processors.
    272285
    273 The normalization by wu.fpops_est handles this.
     286The normalization by wu.fpops_est handles this
     287(assuming that it's set correctly).
    274288
    275289Notes:
    276290 * For apps with large variance of job sizes,
    277    the host normalization mechanism is prone to
     291   the host normalization mechanism is vulnerable to
    278292   a type of cheating called "cherry picking".
    279293   A mechanism for defeating this is described below.
     
    290304and it keeps track of PFC and elapsed time statistics there.
    291305There are separate records per resource type.
    292 The app_version_id encodes the app ID and the resource type
     306The record's app_version_id encodes the app ID and the resource type
    293307(-2 for CPU, -3 for NVIDIA GPU, -4 for ATI).
    294308
    295 If app.min_avg_pfc is defined and
     309If app.min_avg_pfc is defined,
    296310host_app_version.pfc_avg is above sample threshold,
     311and host_app_version.scale_probation is not set,
    297312we normalize PFC by the factor
    298313
     
    309324Notes:
    310325
    311  * We don't assume that anonymous platform apps on
    312    different hosts but with the same platform and resource type
    313    are comparable.
     326 * In the current design, anonymous platform jobs don't
     327   contributed to app.min_avg_pfc,
     328   but it may be used to determine their credit.
     329   This may cause problems:
     330   e.g., suppose a project offers an inefficient version
     331   and volunteers make a much more efficient version
     332   and run it anonymous platform.
     333   They'd get an unfair amount of credit.
     334   This could be fixed by creating app_version records
     335   representing all anonymous platform apps of a given
     336   platform and resource type.
    314337
    315338== Summary ==
     
    327350 approx = true;
    328351 if pfc > wu.fpops_bound
     352   host_app_version.scale_probation = true
     353   host_app_version.error_rate = initial value  // replicate for a while
    329354   if app.min_avg_pfc is defined
    330355     F = app.min_avg_pfc * wu.fpops_est
     
    333358 else
    334359   if job is anonymous platform
    335         if app.min_avg_pfc is defined
     360    if app.min_avg_pfc is defined
    336361       if host_app_version.pfc_avg is above sample threshold
    337              approx = false
    338              F = app.min_avg_pfc / host_app_version.pfc_avg
    339            else
    340              F = app.min_avg_pfc * wu.fpops_est
     362            and not host_app_version.scale_probation
     363         F = app.min_avg_pfc / host_app_version.pfc_avg
     364         approx = false
     365       else
     366         F = app.min_avg_pfc * wu.fpops_est
    341367     else
    342            F = wu.fpops_est
     368       F = wu.fpops_est
    343369   else
    344370     F = pfc;
    345      if Scale(V) is defined
    346            F *= Scale(V)
    347          if Scale(H, V) is defined and (H,V) is not on scale probation
    348        F *= Scale(H, V)
     371         host_scale = 0
     372     if host_app_version.pfc_avg is above sample threshold
     373          and not host_app_version.scale_probation
     374           host_scale = min(10, app_version.pfc_avg / host_app_version.pfc_avg)
     375     if app_version.pfc_scale is defined
     376       F *= app_version.pfc_scale
     377           if host_scale
     378         F *= host_scale
     379         approx = false
     380     else
     381           if host_scale
     382         F *= host_scale
     383         app_version.pfc_avg.update(F)
     384         host_app_version.pfc_avg.update(F)
    349385}}}
    350386
     
    353389The claimed credit of a job (in Cobblestones) is
    354390
    355  C = F * 200/86400e9
    356 
     391 C = F * cobblestone_scale
     392
     393where cobblestone_scale is 200/86400e9.
    357394If replication is not used, this is the granted credit.
    358395
     
    364401{{{
    365402 if app.min_avg_pfc is defined
    366    C = app.min_avg_pfc*wu.fpops_est
     403   C = app.min_avg_pfc*wu.fpops_est*cobblestone_scale
    367404 else
    368    C = wu.fpops_est * 200/86400e9
     405   C = wu.fpops_est * cobblestone_scale
    369406}}}
    370407
     
    421458by claiming excessive credit
    422459(i.e., by falsifying benchmark scores or elapsed time).
    423 An exaggerated claim will increase PFC^mean^(H,A),
     460An exaggerated claim will increase host_app_version.pfc_avg,
    424461causing subsequent credit to be scaled down proportionately.
    425462
     
    434471For example, claiming a PFC of 1e304.
    435472
    436 If PFC(J) exceeds some multiple (say, 20) of PFC^mean^(V),
    437 the host's error rate is set to the initial value,
    438 so it won't do single replication for a while,
    439 and scale_probation (see below) is set to true.
    440 
    441 == Cherry picking ==
     473This is handled by the sanity check mechanism,
     474which grants a default amount of credit
     475and treats the host with suspicion for a while.
     476
     477=== Cherry picking ===
    442478
    443479Suppose an application has a mix of long and short jobs.
     
    471507   and now < host_scale_time, don't use the host scale factor
    472508
    473 The idea is to apply the host scaling factor
     509The idea is to use the host scaling factor
    474510only if there's solid evidence that the host is NOT cherry picking.
    475511
     
    534570{{{
    535571int    host_id;
    536 int    app_version_id;          // generalized for anon platform
     572int    app_version_id;        // generalized for anon platform
    537573AVERAGE pfc;
    538 AVERAGE_VAR et;                         // elapsed time / wu.fpops_est
     574AVERAGE_VAR et;                // elapsed time / wu.fpops_est
    539575double host_scale_time;
    540576bool scale_probation;
     
    556592{{{
    557593double min_avg_pfc;
    558 bool host_scale_check;          // whether to do scale probation
     594bool host_scale_check;        // whether to do scale probation
    559595int max_jobs_in_progress;
    560596int max_gpu_jobs_in_progress;