Changes between Version 17 and Version 18 of CreditNew


Ignore:
Timestamp:
Nov 16, 2009, 12:28:55 PM (14 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CreditNew

    v17 v18  
    341341New fields in '''app_version''':
    342342{{{
    343 = New credit system design =
    344 
    345 == Peak FLOPS and efficiency ==
    346 
    347 BOINC estimates the peak FLOPS of each processor.
    348 For CPUs, this is the Whetstone benchmark score.
    349 For GPUs, it's given by a manufacturer-supplied formula.
    350 
    351 However, other factors affect application performance.
    352 For example, applications access memory,
    353 and the speed of a host's memory system is not reflected
    354 in its Whetstone score.
    355 So a given job might take the same amount of CPU time
    356 and a 1 GFLOPS host as on a 10 GFLOPS host.
    357 The "efficiency" of an application running on a given host
    358 is the ratio of actual FLOPS to peak FLOPS.
    359 
    360 GPUs typically have a much higher (50-100X) peak FLOPS than CPUs.
    361 However, application efficiency is typically lower
    362 (very roughly, 10% for GPUs, 50% for CPUs).
    363 
    364 Notes:
    365 
    366  * The peaks FLOPS of a device is single or double precision,
    367    whichever is higher.
    368    Differentiating between single and double would unnecessarily
    369    complicate things, and the distinction will disappear soon anyway.
    370 
    371 == Credit system goals ==
    372 
    373 Some possible goals in designing a credit system:
    374 
    375  * Device neutrality: similar jobs should get similar credit
    376    regardless of what processor or GPU they run on.
    377 
    378  * Project neutrality: different projects should grant
    379    about the same amount of credit per day for a given processor.
    380 
    381 It's easy to show that both goals can't be satisfied simultaneously.
    382 
    383 == The first credit system ==
    384 
    385 In the first iteration of BOINC's credit system,
    386 "claimed credit" was defined as
    387 {{{
    388 C1 = H.whetstone * J.cpu_time
    389 }}}
    390 There were then various schemes for taking the
    391 average or min claimed credit of the replicas of a job,
    392 and using that as the "granted credit".
    393 
    394 We call this system "Peak-FLOPS-based" because
    395 it's based on the CPU's peak performance.
    396 
    397 The problem with this system is that, for a given app version,
    398 efficiency can vary widely between hosts.
    399 In the above example,
    400 the 10 GFLOPS host would claim 10X as much credit,
    401 and its owner would be upset when it was granted only a tenth of that.
    402 
    403 Furthermore, the credits granted to a given host for a
    404 series of identical jobs could vary widely,
    405 depending on the host it was paired with by replication.
    406 This seemed arbitrary and unfair to users.
    407 
    408 == The second credit system ==
    409 
    410 We then switched to the philosophy that
    411 credit should be proportional to number of FLOPs actually performed
    412 by the application.
    413 We added API calls to let applications report this.
    414 We call this approach "Actual-FLOPs-based".
    415 
    416 SETI@home's application allowed counting of FLOPs,
    417 and they adopted this system,
    418 adding a scaling factor so that average credit per job
    419 was the same as the first credit system.
    420 
    421 Not all projects could count FLOPs, however.
    422 So SETI@home published their average credit per CPU second,
    423 and other projects continued to use benchmark-based credit,
    424 but multiplied it by a scaling factor to match SETI@home's average.
    425 
    426 This system had several problems:
    427 
    428  * It didn't address GPUs.
    429  * Project that couldn't count FLOPs still had device neutrality problems.
    430  * It didn't prevent credit cheating when single replication was used.
    431 
    432 
    433 == Goals of the new (third) credit system ==
    434 
    435  * Completely automated - projects don't have to
    436    change code, settings, etc.
    437 
    438  * Device neutrality
    439 
    440  * Limited project neutrality: different projects should grant
    441    about the same amount of credit per CPU hour, averaged over hosts.
    442    Projects with GPU apps should grant credit in proportion
    443    to the efficiency of the apps.
    444    (This means that projects with efficient GPU apps will
    445    grant more credit on average.  That's OK).
    446 
    447 == Peak FLOP Count (PFC) ==
    448 
    449 This system uses the Peak-FLOPS-based approach,
    450 but addresses its problems in a new way.
    451 
    452 When a job is issued to a host, the scheduler specifies usage(J,D),
    453 J's usage of processing resource D:
    454 how many CPUs and how many GPUs (possibly fractional).
    455 
    456 If the job is finished in elapsed time T,
    457 we define peak_flop_count(J), or PFC(J) as
    458 {{{
    459 PFC(J) = T * (sum over devices D (usage(J, D) * peak_flop_rate(D))
    460 }}}
    461 
    462 Notes:
    463 
    464  * We use elapsed time instead of actual device time (e.g., CPU time).
    465    If a job uses a resource inefficiently
    466    (e.g., a CPU job that does lots of disk I/O)
    467    PFC() won't reflect this.  That's OK.
    468    The key thing is that BOINC reserved the device for the job,
    469    whether or not the job used it efficiently.
    470  * usage(J,D) may not be accurate; e.g., a GPU job may take
    471    more or less CPU than the scheduler thinks it will.
    472    Eventually we may switch to a scheme where the client
    473    dynamically determines the CPU usage.
    474    For now, though, we'll just use the scheduler's estimate.
    475 
    476 The granted credit for a job J is proportional to PFC(J),
    477 but is normalized in the following ways:
    478 
    479 == Cross-version normalization ==
    480 
    481 If a given application has multiple versions (e.g., CPU and GPU versions)
    482 the granted credit per job is adjusted
    483 so that the average is the same for each version.
    484 The adjustment is always downwards:
    485 we maintain the average PFC^mean^(V) of PFC() for each app version V,
    486 find the minimum X.
    487 An app version V's jobs are then scaled by the factor
    488 
    489  S(V) = (X/PFC^mean^(V))
    490 
    491 
    492 The result for a given job J
    493 is called "Version-Normalized Peak FLOP Count", or VNPFC(J):
    494 
    495  VNPFC(J) = PFC(J) * (X/PFC^mean^(V))
    496 
    497 Notes:
    498  * This addresses the common situation
    499    where an app's GPU version is much less efficient than the CPU version
    500    (i.e. the ratio of actual FLOPs to peak FLOPs is much less).
    501    To a certain extent, this mechanism shifts the system
    502    towards the "Actual FLOPs" philosophy,
    503    since credit is granted based on the most efficient app version.
    504    It's not exactly "Actual FLOPs", since the most efficient
    505    version may not be 100% efficient.
    506  * There are two sources of variance in PFC(V):
    507    the variation in host efficiency,
    508    and possibly the variation in job size.
    509    If we have an ''a priori'' estimate of job size
    510    (e.g., workunit.rsc_fpops_est)
    511    we can normalize by this to reduce the variance,
    512    and make PFC^mean^(V) converge more quickly.
    513  * ''a posteriori'' estimates of job size may exist also
    514    (e.g., an iteration count reported by the app)
    515    but using this for anything introduces a new cheating risk,
    516    so it's probably better not to.
    517 
    518 
    519 == Cross-project normalization ==
    520 
    521 If an application has both CPU and GPU versions,
    522 then the version normalization mechanism uses the CPU
    523 version as a "sanity check" to limit the credit granted to GPU jobs.
    524 
    525 Suppose a project has an app with only a GPU version,
    526 so there's no CPU version to act as a sanity check.
    527 If we grant credit based only on GPU peak speed,
    528 the project will grant much more credit per GPU hour than other projects,
    529 violating limited project neutrality.
    530 
    531 A solution to this: if an app has only GPU versions,
    532 then for each version V we let
    533 S(V) be the average scaling factor
    534 for that GPU type among projects that do have both CPU and GPU versions.
    535 This factor is obtained from a central BOINC server.
    536 V's jobs are then scaled by S(V) as above.
    537 
    538 Notes:
    539 
    540  * Projects will run a periodic script to update the scaling factors.
    541  * Rather than GPU type, we'll probably use plan class,
    542    since e.g. the average efficiency of CUDA 2.3 apps may be different
    543    than that of CUDA 2.1 apps.
    544  * Initially we'll obtain scaling factors from large projects
    545    that have both GPU and CPU apps (e.g., SETI@home).
    546    Eventually we'll use an average (weighted by work done) over multiple projects
    547    (see below).
    548 
    549 == Host normalization ==
    550 
    551 Assuming that hosts are sent jobs for a given app uniformly,
    552 then, for that app,
    553 hosts should get the same average granted credit per job.
    554 To ensure this, for each application A we maintain the average VNPFC^mean^(A),
    555 and for each host H we maintain VNPFC^mean^(H, A).
    556 The '''claimed FLOPS''' for a given job J is then
    557 
    558  F = VNPFC(J) * (VNPFC^mean^(A)/VNPFC^mean^(H, A))
    559 
    560 and the claimed credit (in Cobblestones) is
    561 
    562  C = F*100/86400e9
    563 
    564 There are some cases where hosts are not sent jobs uniformly:
    565  * job-size matching (smaller jobs sent to slower hosts)
    566  * GPUGrid.net's scheme for sending some (presumably larger)
    567    jobs to GPUs with more processors.
    568 In these cases average credit per job must differ between hosts,
    569 according to the types of jobs that are sent to them.
    570 
    571 This can be done by dividing
    572 each sample in the computation of VNPFC^mean^ by WU.rsc_fpops_est
    573 (in fact, there's no reason not to always do this).
    574 
    575 Notes:
    576  * The host normalization mechanism reduces the claimed credit of hosts
    577    that are less efficient than average,
    578    and increases the claimed credit of hosts that are more efficient
    579    than average.
    580  * VNPFC^mean^ is averaged over jobs, not hosts.
    581 
    582 == Computing averages ==
    583 
    584 We need to compute averages carefully because
    585 
    586  * The quantities being averaged may gradually change over time
    587    (e.g. average job size may change,
    588    app version efficiency may change as new versions are deployed)
    589    and we need to track this.
    590  * A given sample may be wildly off,
    591    and we can't let this mess up the average.
    592  * Averages should be weighted by job size.
    593 
    594 In addition, we may as well maintain the variance of the quantities,
    595 although the current system doesn't use it.
    596 
    597 The code that does all this is
    598 [http://boinc.berkeley.edu/trac/browser/trunk/boinc/lib/average.h here].
    599 
    600 == Cross-project scaling factors ==
    601 
    602 We'll have a script that publishes a project's
    603 accounting data (see Implementation).
    604 The BOINC web site will collect these from a few big projects
    605 and publish the averages.
    606 
    607 == Replication and cheating ==
    608 
    609 Host normalization mostly eliminates the incentive to cheat
    610 by claiming excessive credit
    611 (i.e., by falsifying benchmark scores or elapsed time).
    612 An exaggerated claim will increase VNPFC*(H,A),
    613 causing subsequent claimed credit to be scaled down proportionately.
    614 This means that no special cheat-prevention scheme
    615 is needed for single replications;
    616 granted credit = claimed credit.
    617 
    618 For jobs that are replicated, granted credit should be
    619 set to the min of the valid results
    620 (min is used instead of average to remove the incentive
    621 for cherry-picking, see below).
    622 
    623 However, there are still some possible forms of cheating.
    624 
    625  * One-time cheats (like claiming 1e304) can be prevented by
    626    capping VNPFC(J) at some multiple (say, 10) of VNPFC^mean^(A).
    627  * Cherry-picking: suppose an application has two types of jobs,
    628   which run for 1 second and 1 hour respectively.
    629   Clients can figure out which is which, e.g. by running a job for 2 seconds
    630   and seeing if it's exited.
    631   Suppose a client systematically refuses the 1 hour jobs
    632   (e.g., by reporting a crash or never reporting them).
    633   Its VNPFC^mean^(H, A) will quickly decrease,
    634   and soon it will be getting several thousand times more credit
    635   per actual work than other hosts!
    636   Countermeasure:
    637   whenever a job errors out, times out, or fails to validate,
    638   set the host's error rate back to the initial default,
    639   and set its VNPFC^mean^(H, A) to VNPFC^mean^(A) for all apps A.
    640   This puts the host to a state where several dozen of its
    641   subsequent jobs will be replicated.
    642 
    643 == Trickle credit ==
    644 
    645 CPDN breaks jobs into segments,
    646 sends a trickle-up message for each segment,
    647 and grants credit for each completed segment.
    648 In this case,
    649 the trickle message handlers should not grant a fixed amount of credit.
    650 Instead, the trickle-up messages should contain
    651 an "incremental elapsed time" field.
    652 
    653 == Job runtime estimates ==
    654 
    655 Unrelated to the credit proposal, but in a similar spirit.
    656 The server will maintain ET^mean^(H, V), the statistics of
    657 job runtimes (normalized by wu.rsc_fpops_est) per
    658 host and application version.
    659 
    660 The server's estimate of a job's runtime is then
    661 
    662  R(J, H) = wu.rsc_fpops_est * ET^mean^(H, V)
    663 
    664 
    665 == Error rate, host punishment, and turnaround time estimation ==
    666 
    667 Unrelated to the credit proposal, but in a similar spirit.
    668 
    669 Due to hardware problems (e.g. a malfunctioning GPU)
    670 a host may have a 100% error rate for one app version
    671 and a 0% error rate for another.
    672 Similar for turnaround time.
    673 
    674 So we'll move the "error_rate" and "turnaround_time"
    675 fields from the host table to host_app_version.
    676 
    677 The host punishment mechanism is designed to deal with malfunctioning hosts.
    678 For each host the server maintains '''max_results_day'''.
    679 This is initialized to a project-specified value (e.g. 200)
    680 and scaled by the number of CPUs and/or GPUs.
    681 It's decremented if the client reports a crash
    682 (but not if the job was aborted).
    683 It's doubled when a successful (but not necessarily valid)
    684 result is received.
    685 
    686 This should also be per-app-version,
    687 so we'll move "max_results_day" from the host table to host_app_version.
    688 
    689 == Cherry picking ==
    690 
    691 Suppose an application has a mix of long and short jobs.
    692 If a client intentionally discards
    693 (or aborts, or reports errors from) the long jobs,
    694 but completes the short jobs,
    695 its host scaling factor will become large,
    696 and it will get excessive credit for the short jobs.
    697 This is called "cherry picking".
    698 
    699 The host punishment mechanism
    700 doesn't deal effectively with cherry picking,
    701 
    702 We propose the following mechanism to deal with cherry picking:
    703 
    704  * For each (host, app version) maintain "host_scale_time".
    705    This is the earliest time at which host scaling will be applied.
    706  * for each (host, app version) maintain "scale_probation"
    707    (initially true).
    708  * When send a job to a host,
    709    if scale_probation is true,
    710    set host_scale_time to now+X, where X is the app's delay bound.
    711  * When a job is successfully validated,
    712    and now > host_scale_time,
    713    set scale_probation to false.
    714  * If a job times out or errors out,
    715    set scale_probation to true,
    716    max the scale factor with 1,
    717    and set host_scale_time to now+X.
    718  * when computing claimed credit for a job,
    719    and now < host_scale_time, don't use the host scale factor
    720 
    721 The idea is to apply the host scaling factor
    722 only if there's solid evidence that the host is NOT cherry picking.
    723 
    724 Because this mechanism is punitive to hosts
    725 that experience actual failures,
    726 we'll make it selectable on a per-application basis (default off).
    727 
    728 In addition, to limit the extent of cheating
    729 (in case the above mechanism is defeated somehow)
    730 the host scaling factor will be min'd with a
    731 project-wide config parameter (default, say, 3).
    732 
    733 == Implementation ==
    734 
    735 === Database changes ===
    736 
    737 New table '''host_app''':
    738 {{{
    739 int    host_id;
    740 int    app_id;
    741 int    vnpfc_n;
    742 double vnpfc_sum;
    743 double vnpfc_exp_avg;
    744 }}}
    745 
    746 New table '''host_app_version''':
    747 {{{
    748 int    host_id;
    749 int    app_version_id;
    750 int    et_n;
    751 double et_sum;
    752 double et_exp_avg;
    753 // some variable for recent error rate,
    754 // replacing host.error_rate and host.max_results_day
    755 // make sure that a host w/ 1 good and 1 bad GPU gets few GPU jobs
    756 }}}
    757 
    758 New fields in '''app_version''':
    759 {{{
    760343int    pfc_n;
    761344double pfc_sum;