Changes between Version 28 and Version 29 of CreditNew


Ignore:
Timestamp:
Mar 25, 2010, 4:49:20 PM (14 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CreditNew

    v28 v29  
    2626== Credit system goals ==
    2727
    28 Some possible goals in designing a credit system:
     28Some goals in designing a credit system:
    2929
    3030 * Device neutrality: similar jobs should get similar credit
     
    3434   about the same amount of credit per host, averaged over all hosts.
    3535
     36 * Cheat-proof: there should be a bound (say, 1.1)
     37   on the ratio of credit granted to credit deserved per user account,
     38   regardless of what the user does.
     39
    3640== The first credit system ==
    3741
    3842In the first iteration of BOINC's credit system,
    3943"claimed credit" was defined as
    40 {{{
    41 C1 = H.whetstone * J.cpu_time
    42 }}}
     44
     45 C1 = H.whetstone * J.cpu_time
     46
    4347There were then various schemes for taking the
    4448average or min claimed credit of the replicas of a job,
     
    7781but multiplied it by a scaling factor to match SETI@home's average.
    7882
    79 This system had several problems:
    80 
    81  * It didn't address GPUs.
    82  * Project that couldn't count FLOPs still had device neutrality problems.
    83  * It didn't prevent credit cheating when single replication was used.
     83This system has several problems:
     84
     85 * It doesn't address GPUs properly; projects using GPUs
     86   have to write custom code.
     87 * Project that can't count FLOPs still have device neutrality problems.
     88 * It doesn't prevent credit cheating when single replication is used.
    8489
    8590
     
    97102   grant more credit than projects with inefficient apps.  That's OK).
    98103
     104== ''A priori'' job size estimates ==
     105
     106If we have an ''a priori'' estimate of job size,
     107we can normalize by this to reduce the variance
     108of various distributions (see below).
     109This makes estimates of the means converge more quickly.
     110
     111We'll use workunit.rsc_fpops_est as this a priori estimate,
     112and denote it E(J).
     113
     114(''A posteriori'' estimates of job size may exist also,
     115e.g., an iteration count reported by the app,
     116but aren't cheat-proof; we don't use them.)
     117
    99118== Peak FLOP Count (PFC) ==
    100119
     
    108127If the job is finished in elapsed time T,
    109128we define peak_flop_count(J), or PFC(J) as
    110 {{{
    111 PFC(J) = T * peak_flops(J)
    112 }}}
     129
     130 PFC(J) = T * peak_flops(J)
    113131
    114132Notes:
     
    131149   in the trickle message.
    132150
    133 The credit for a job J is proportional to PFC(J),
    134 but is normalized in the following ways:
    135 
    136 == ''A priori'' job size estimates ==
    137 
    138 If we have an ''a priori'' estimate of job size,
    139 we can normalize by this to reduce the variance
    140 of various distributions (see below).
    141 This makes estimates of the means converge more quickly.
    142 
    143 We'll use workunit.rsc_fpops_est as this a priori estimate,
    144 and we'll denote it E(J).
    145 
    146 (''A posteriori'' estimates of job size may exist also,
    147 e.g., an iteration count reported by the app,
    148 but aren't cheat-proof; we don't use them.)
     151By default, the credit for a job J is proportional to PFC(J),
     152but is limited and normalized in the following ways:
     153
     154== Sanity check ==
     155
     156If PFC(J) is infinite or is > wu.rsc_fpops_bound,
     157J is assigned a "default PFC" and other processing is skipped.
     158Default PFC is determined as follows:
     159
     160 * If min_avg_pfc(A) is defined (see below) then
     161
     162 D = min_avg_pfc(A) * E(J)
     163
     164 * Otherwise
     165
     166 D = wu.rsc_fpops_est
    149167
    150168== Cross-version normalization ==
     
    158176We maintain the average PFC^mean^(V) of PFC(J)/E(J) for each app version V.
    159177We periodically compute PFC^mean^(CPU) and PFC^mean^(GPU),
    160 and let X be the min of these.
    161 An app version V's jobs are then scaled by the factor
     178and compute X as follows:
     179
     180 * If there are only CPU or only GPU versions,
     181   and at least 2 versions are above a sample threshold,
     182   X is the average.
     183
     184 * If there are both, and at least 1 of each is above a sample
     185   threshold, let X be the min of the averages.
     186
     187If X is defined, then we set
     188
     189 min_avg_pfc(A) = X
     190
     191This is an estimate of the app's average actual FLOPS.
     192
     193We also set
    162194
    163195 Scale(V) = (X/PFC^mean^(V))
    164196
     197An app version V's jobs are scaled by this factor.
     198
    165199Notes:
     200 * Version normalization is only applied if at least two
     201   versions are above sample threshold.
    166202 * Version normalization addresses the common situation
    167203   where an app's GPU version is much less efficient than the CPU version
     
    176212   then this mechanism doesn't work as intended.
    177213   One solution is to create separate apps for separate types of jobs.
    178 
    179 == Cross-project normalization ==
     214 * Cheating or erroneous hosts can influence PFC^mean^(V) to
     215   some extent.
     216   This is limited by the Sanity Check mechanism,
     217   and by the fact that only validated jobs are used.
     218   The effect on credit will be negated by host normalization
     219   (see below).
     220   There may be an effect on cross-version normalization.
     221   This could be eliminated by computing PFC^mean^(V)
     222   as the sample-median value of PFC^mean^(H, V) (see below).
     223
     224== Host normalization ==
     225
     226The second normalization is across hosts.
     227Assume jobs for a given app are distributed uniformly among hosts.
     228Then the average credit per job should be the same for all hosts.
     229To ensure this, for each app version V and host H
     230we maintain PFC^mean^(H, A),
     231the average of PFC(J)/E(J) for jobs completed by H using A.
     232
     233This yields the host scaling factor
     234
     235 Scale(H) = (PFC^mean^(V)/PFC^mean^(H, A))
     236
     237There are some cases where hosts are not sent jobs uniformly:
     238
     239 * job-size matching (smaller jobs sent to slower hosts)
     240 * GPUGrid.net's scheme for sending some (presumably larger)
     241   jobs to GPUs with more processors.
     242
     243The normalization by E(J) handles this
     244(assuming that wu.fpops_est is set appropriately).
     245
     246Notes:
     247 * For some apps, the host normalization mechanism is prone to
     248   a type of cheating called "cherry picking".
     249   A mechanism for defeating this is described below.
     250 * The host normalization mechanism reduces the claimed credit of hosts
     251   that are less efficient than average,
     252   and increases the claimed credit of hosts that are more efficient
     253   than average.
     254
     255== Computing averages ==
     256
     257Computation of averages needs to take into account:
     258
     259 * The quantities being averaged may gradually change over time
     260   (e.g. average job size may change)
     261   and we need to track this.
     262   This done as follows: for the first N samples
     263   (N = ~100 for app versions, ~10 for hosts)
     264   we take the straight average.
     265   After that we use an exponential average
     266   (with appropriate alpha for app version and host)
     267
     268 * A given sample may be wildly off,
     269   and we can't let this mess up the average.
     270   Non-first samples are capped at 10 times the current average.
     271
     272== Anonymous platform ==
     273
     274For anonymous platform apps,
     275since we don't reliably know anything about the devices involved,
     276we don't try to estimate PFC.
     277
     278For each app, we maintain min_avg_pfc(A),
     279the average PFC for the most efficient version of A.
     280
     281The claimed credit for anonymous platform jobs is
     282
     283 claimed_credit^mean^(A)*E(J)
     284
     285The server maintains host_app_version records for anonymous platform,
     286and it keeps track of elapsed time statistics there.
     287These have app_version_id = -2 for CPU, -3 for NVIDIA GPU, -4 for ATI.
     288
     289== Claimed and granted credit ==
     290
     291The '''claimed FLOPS''' for a given job J is
     292
     293 F = PFC(J) * S(V) * S(H)
     294
     295and the claimed credit (in Cobblestones) is
     296
     297 C = F*100/86400e9
     298
     299When replication is used,
     300We take the set of hosts that
     301are not anon platform and not on scale probation (see below).
     302If this set is nonempty, we grant the average of their claimed credit.
     303Otherwise we grant
     304
     305 claimed_credit^mean^(A)*E(J)
     306
     307== Cross-project version normalization ==
    180308
    181309If an application has both CPU and GPU versions,
     
    198326
    199327Projects will export the following data:
    200 {{{
    201 for each app version
     328
     329 for each app version
    202330   app name
    203331   platform name
     
    205333   plan class
    206334   scale factor
    207 }}}
    208335
    209336The BOINC server will collect these from several projects
    210337and will export the following:
    211 {{{
    212 for each plan class
     338
     339 for each plan class
    213340   average scale factor (weighted by RAC)
    214 }}}
     341
    215342We'll provide a script that identifies app versions
    216343for GPUs with no corresponding CPU app version,
     
    220347
    221348 * The "average scaling factor" is weighted by work done.
    222 
    223 == Host normalization ==
    224 
    225 The second normalization is across hosts.
    226 Assume jobs for a given app are distributed uniformly among hosts.
    227 Then the average credit per job should be the same for all hosts.
    228 To ensure this, for each app version V and host H
    229 we maintain PFC^mean^(H, A),
    230 the average of PFC(J)/E(J) for jobs completed by H using A.
    231 
    232 This yields the host scaling factor
    233 
    234  Scale(H) = (PFC^mean^(V)/PFC^mean^(H, A))
    235 
    236 There are some cases where hosts are not sent jobs uniformly:
    237 
    238  * job-size matching (smaller jobs sent to slower hosts)
    239  * GPUGrid.net's scheme for sending some (presumably larger)
    240    jobs to GPUs with more processors.
    241 
    242 The normalization by E(J) handles this
    243 (assuming that wu.fpops_est is set appropriately).
    244 
    245 Notes:
    246  * The host normalization mechanism reduces the claimed credit of hosts
    247    that are less efficient than average,
    248    and increases the claimed credit of hosts that are more efficient
    249    than average.
    250 
    251 == Claimed credit ==
    252 
    253 The '''claimed FLOPS''' for a given job J is then
    254 
    255  F = PFC(J) * S(V) * S(H)
    256 
    257 and the claimed credit (in Cobblestones) is
    258 
    259  C = F*100/86400e9
    260 
    261 == Computing averages ==
    262 
    263 We need to compute averages carefully because
    264 
    265  * The quantities being averaged may gradually change over time
    266    (e.g. average job size may change)
    267    and we need to track this.
    268  * A given sample may be wildly off,
    269    and we can't let this mess up the average.
    270 
    271 The code that does this is
    272 [http://boinc.berkeley.edu/trac/browser/trunk/boinc/lib/average.h here].
    273 
    274 == Anonymous platform ==
    275 
    276 For anonymous platform apps,
    277 since we don't reliably know anything about the devices involved,
    278 we don't try to estimate PFC.
    279 
    280 For each app, we maintain min_avg_pfc(A),
    281 the average PFC for the most efficient version of A.
    282 
    283 The claimed credit for anonymous platform jobs is
    284 
    285  claimed_credit^mean^(A)*E(J)
    286 
    287 The server maintains host_app_version records for anonymous platform,
    288 and it keeps track of elapsed time statistics there.
    289 These have app_version_id = -2 for CPU, -3 for NVIDIA GPU, -4 for ATI.
    290 
    291 == Replication ==
    292 
    293 We take the set of hosts that
    294 are not anon platform and not on scale probation (see below).
    295 If this set is nonempty, we grant the average of their claimed credit.
    296 Otherwise we grant
    297 
    298  claimed_credit^mean^(A)*E(J)
    299349
    300350== Cheat prevention ==
     
    334384doesn't deal effectively with cherry picking,
    335385
    336 We propose the following mechanism to deal with cherry picking:
     386The following mechanism deals with cherry picking:
    337387
    338388 * For each (host, app version) maintain "host_scale_time".
     
    358408Because this mechanism is punitive to hosts
    359409that experience actual failures,
    360 we'll make it selectable on a per-application basis (default off).
     410it's selectable on a per-application basis (default off).
    361411
    362412In addition, to limit the extent of cheating
    363413(in case the above mechanism is defeated somehow)
    364 the host scaling factor will be min'd with a constant (say, 3).
     414the host scaling factor will be min'd with a constant (say, 10).
    365415
    366416== Error rate, host punishment, and turnaround time estimation ==