Changes between Version 29 and Version 30 of CreditNew


Ignore:
Timestamp:
Mar 26, 2010, 12:01:12 PM (14 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CreditNew

    v29 v30  
    1 = New credit system design =
    2 
    3 == Definitions ==
     1= A new system for runtime estimation and credit =
     2
     3== Terminology ==
    44
    55BOINC estimates the '''peak FLOPS''' of each processor.
     
    77For GPUs, it's given by a manufacturer-supplied formula.
    88
    9 Other factors,
    10 such as the speed of a host's memory system,
     9Other factors, such as the speed of a host's memory system,
    1110affect application performance.
    1211So a given job might take the same amount of CPU time
    13 and a 1 GFLOPS host as on a 10 GFLOPS host.
     12on 1 GFLOPS and 10 GFLOPS hosts.
    1413The '''efficiency''' of an application running on a given host
    1514is the ratio of actual FLOPS to peak FLOPS.
    1615
    17 GPUs typically have a much higher (50-100X) peak FLOPS than CPUs.
     16GPUs typically have a higher (10-100X) peak FLOPS than CPUs.
    1817However, application efficiency is typically lower
    1918(very roughly, 10% for GPUs, 50% for CPUs).
     
    3433   about the same amount of credit per host, averaged over all hosts.
    3534
    36  * Cheat-proof: there should be a bound (say, 1.1)
    37    on the ratio of credit granted to credit deserved per user account,
    38    regardless of what the user does.
     35 * Gaming-resistance: there should be a bound on the
     36   impact of faulty or malicious hosts.
    3937
    4038== The first credit system ==
    4139
    4240In the first iteration of BOINC's credit system,
    43 "claimed credit" was defined as
    44 
    45  C1 = H.whetstone * J.cpu_time
     41"claimed credit" C of job J on host H was defined as
     42
     43 C = H.whetstone * J.cpu_time
    4644
    4745There were then various schemes for taking the
     
    5250it's based on the CPU's peak performance.
    5351
    54 The problem with this system is that, for a given app version,
    55 efficiency can vary widely between hosts.
    56 In the above example,
    57 the 10 GFLOPS host would claim 10X as much credit,
     52The problem with this system is that,
     53for a given app version, efficiency can vary widely between hosts.
     54In the above example, the 10 GFLOPS host would claim 10X as much credit,
    5855and its owner would be upset when it was granted only a tenth of that.
    5956
     
    6663
    6764We then switched to the philosophy that
    68 credit should be proportional to number of FLOPs actually performed
     65credit should be proportional to the FLOPs actually performed
    6966by the application.
    7067We added API calls to let applications report this.
     
    8784 * Project that can't count FLOPs still have device neutrality problems.
    8885 * It doesn't prevent credit cheating when single replication is used.
    89 
    9086
    9187== Goals of the new (third) credit system ==
     
    10298   grant more credit than projects with inefficient apps.  That's OK).
    10399
    104 == ''A priori'' job size estimates ==
    105 
    106 If we have an ''a priori'' estimate of job size,
    107 we can normalize by this to reduce the variance
    108 of various distributions (see below).
    109 This makes estimates of the means converge more quickly.
    110 
    111 We'll use workunit.rsc_fpops_est as this a priori estimate,
    112 and denote it E(J).
    113 
    114 (''A posteriori'' estimates of job size may exist also,
    115 e.g., an iteration count reported by the app,
    116 but aren't cheat-proof; we don't use them.)
     100== ''A priori'' job size estimates and bounds ==
     101
     102Projects supply estimates of the FLOPs used by a job
     103(wu.rsc_fpops_est)
     104and a limit on FLOPS, after which the job will be aborted
     105(wu.rsc_fpops_bound).
     106
     107Previously, inaccuracy of rsc_fpops_est caused problems.
     108The new system still uses rsc_fpops_est,
     109but its primary purpose is now to indicate the relative size of jobs.
     110Averages of job sizes are normalized by rsc_fpops_est,
     111and if rsc_fpops_est is correlated with actual size,
     112these averages will converge more quickly.
     113
     114We'll denote workunit.rsc_fpops_est as E(J).
     115
     116Notes:
     117
     118 * ''A posteriori'' estimates of job size may exist also,
     119  e.g., an iteration count reported by the app.
     120  They aren't cheat-proof, and we don't use them.
    117121
    118122== Peak FLOP Count (PFC) ==
     
    168172== Cross-version normalization ==
    169173
    170 A given application may have multiple versions (e.g., CPU and GPU versions).
     174A given application may have multiple versions
     175(e.g., CPU, multi-thread, and GPU versions).
    171176If jobs are distributed uniformly to versions,
    172177all versions should get the same average credit.
     
    185190   threshold, let X be the min of the averages.
    186191
     192If X is defined, then for each version V we set
     193
     194 Scale(V) = (X/PFC^mean^(V))
     195
     196An app version V's jobs are scaled by this factor.
     197
     198For each app, we maintain min_avg_pfc(A),
     199the average PFC for the most efficient version of A.
     200This is an estimate of the app's average actual FLOPS.
     201
    187202If X is defined, then we set
    188203
    189204 min_avg_pfc(A) = X
    190205
    191 This is an estimate of the app's average actual FLOPS.
    192 
    193 We also set
    194 
    195  Scale(V) = (X/PFC^mean^(V))
    196 
    197 An app version V's jobs are scaled by this factor.
     206Otherwise, if a version V is above sample threshold, we set
     207
     208 min_avg_pfc(A) = PFC^mean^(V)
    198209
    199210Notes:
     
    212223   then this mechanism doesn't work as intended.
    213224   One solution is to create separate apps for separate types of jobs.
    214  * Cheating or erroneous hosts can influence PFC^mean^(V) to
    215    some extent.
     225 * Cheating or erroneous hosts can influence PFC^mean^(V) to some extent.
    216226   This is limited by the Sanity Check mechanism,
    217227   and by the fact that only validated jobs are used.
     
    272282== Anonymous platform ==
    273283
    274 For anonymous platform apps,
    275 since we don't reliably know anything about the devices involved,
    276 we don't try to estimate PFC.
    277 
    278 For each app, we maintain min_avg_pfc(A),
    279 the average PFC for the most efficient version of A.
    280 
    281 The claimed credit for anonymous platform jobs is
    282 
    283  claimed_credit^mean^(A)*E(J)
    284 
    285 The server maintains host_app_version records for anonymous platform,
    286 and it keeps track of elapsed time statistics there.
    287 These have app_version_id = -2 for CPU, -3 for NVIDIA GPU, -4 for ATI.
     284For jobs done by anonymous platform apps,
     285the server knows the devices involved and can estimate PFC.
     286It maintains host_app_version records for anonymous platform,
     287and it keeps track of PFC and elapsed time statistics there.
     288There are separate records per resource type.
     289The app_version_id encodes the app ID and the resource type
     290(-2 for CPU, -3 for NVIDIA GPU, -4 for ATI).
     291
     292If min_avg_pfc(A) is defined and
     293PFC^mean^(H, V) is above a sample threshold,
     294we normalize PFC by the factor
     295
     296 min_avg_pfc(A)/PFC^mean^(H, V)
     297
     298Otherwise the claimed PFC is
     299
     300 min_avg_pfc(A)*E(J)
     301
     302If min_avg_pfc(A) is not defined, the claimed PFC is
     303
     304 wu.rsc_fpops_est
     305
     306== Summary ==
     307
     308Given a validated job J, we compute
     309
     310 * the "claimed PFC" F
     311 * a flag "approx" that is true if F
     312   is an approximation and may not be comparable
     313   with other instances of the job
     314
     315The algorithm:
     316
     317 pfc = peak FLOP count(J)
     318 approx = true;
     319 if pfc > wu.rsc_fpops_bound
     320   if min_avg_pfc(A) is defined
     321     F = min_avg_pfc(A) * E(J)
     322   else
     323     F = wu.rsc_fpops_est
     324 else
     325   if job is anonymous platform
     326     hav = host_app_version record
     327         if min_avg_pfc(A) is defined
     328       if hav.pfc.n > threshold
     329             approx = false
     330             F = min_avg_pfc(A) /hav.pfc.avg
     331           else
     332             F = min_avg_pfc(A) * E(J)
     333     else
     334           F = wu.rsc_fpops_est
     335   else
     336     F = pfc;
     337     if Scale(V) is defined
     338           F *= Scale(V)
     339         if Scale(H, V) is defined and (H,V) is not on scale probation
     340       F *= Scale(H, V)
    288341
    289342== Claimed and granted credit ==
    290343
    291 The '''claimed FLOPS''' for a given job J is
    292 
    293  F = PFC(J) * S(V) * S(H)
    294 
    295 and the claimed credit (in Cobblestones) is
    296 
    297  C = F*100/86400e9
    298 
    299 When replication is used,
    300 We take the set of hosts that
    301 are not anon platform and not on scale probation (see below).
     344The claimed credit of a job (in Cobblestones) is
     345
     346 C = F* 200/86400e9
     347
     348If replication is not used, this is the granted credit.
     349
     350If replication is used,
     351we take the set of instances for which approx is false.
    302352If this set is nonempty, we grant the average of their claimed credit.
    303 Otherwise we grant
    304 
    305  claimed_credit^mean^(A)*E(J)
     353Otherwise:
     354
     355 if min_avg_pfc(A) is defined
     356   C = min_avg_pfc(A)*E(J)
     357 else
     358   C = wu.rsc_fpops_est * 200/86400e9
    306359
    307360== Cross-project version normalization ==
    308361
    309362If an application has both CPU and GPU versions,
    310 the version normalization mechanism uses the CPU
    311 version as a "sanity check" to limit the credit granted to GPU jobs
    312 (or vice versa).
    313 
    314 Suppose a project has an app with only a GPU version,
    315 so there's no CPU version to act as a sanity check.
     363the version normalization mechanism figures out
     364which version is most efficient and uses that to reduce
     365the credit granted to less-efficient versions.
     366
     367If a project has an app with only a GPU version,
     368there's no CPU version for comparison.
    316369If we grant credit based only on GPU peak speed,
    317370the project will grant much more credit per GPU hour than other projects,