Changes between Version 2 and Version 3 of CreditNew

Show
Ignore:
Author:
davea (IP: 128.32.18.181)
Timestamp:
10/30/09 15:54:58 (4 weeks ago)
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CreditNew

    v2 v3  
    11= New credit system design = 
    22 
    3 == Introduction == 
    4  
    5 We can estimate the peak FLOPS of a given processor. 
     3== Peak FLOPS and efficiency == 
     4 
     5BOINC estimates the peak FLOPS of each processor. 
    66For CPUs, this is the Whetstone benchmark score. 
    77For GPUs, it's given by a manufacturer-supplied formula. 
    1515is the ratio of actual FLOPS to peak FLOPS. 
    1616 
    17 GPUs typically have a much higher (50-100X) peak speed than CPUs. 
     17GPUs typically have a much higher (50-100X) peak speed than GPUs. 
    1818However, application efficiency is typically lower 
    1919(very roughly, 10% for GPUs, 50% for CPUs). 
    2020 
     21== Credit system goals == 
     22 
     23Some possible goals in designing a credit system: 
     24 
     25 * Device neutrality: similar jobs should get similar credit 
     26   regardless of what processor or GPU they run on. 
     27 
     28 * Project neutrality: different projects should grant 
     29   about the same amount of credit per day for a given host. 
     30 
     31It's easy to show that both goals can't be satisfied simultaneously 
     32when there is more than one type of processing resource. 
     33 
    2134== The first credit system == 
    2235 
    23 In the first iteration of credit system, "claimed credit" was defined as 
     36In the first iteration of BOINC's credit system, 
     37"claimed credit" was defined as 
    2438{{{ 
    2539C1 = H.whetstone * J.cpu_time 
    2640}}} 
    2741There were then various schemes for taking the 
    28 average or min of the claimed credit of the 
    29 replicas of a job, and using that as the "granted credit". 
     42average or min of the claimed credit of the replicas of a job, 
     43and using that as the "granted credit". 
    3044 
    3145We call this system "Peak-FLOPS-based" because 
    3347 
    3448The problem with this system is that, for a given app version, 
    35 efficiency can vary widely
     49efficiency can vary widely between hosts
    3650In the above example, 
    37 host B would claim 10X as much credit, 
    38 and its owner would be upset when it was granted 
    39 only a tenth of that. 
     51the 10 GFLOPS host would claim 10X as much credit, 
     52and its owner would be upset when it was granted only a tenth of that. 
    4053 
    4154Furthermore, the credits granted to a given host for a 
    4255series of identical jobs could vary widely, 
    4356depending on the host it was paired with by replication. 
    44  
    45 So host neutrality was achieved, 
    46 but in a way that seemed arbitrary and unfair to users. 
     57This seemed arbitrary and unfair to users. 
    4758 
    4859== The second credit system == 
    4960 
    50 To address the problems with host neutrality, 
    51 we switched to the philosophy that 
     61We then switched to the philosophy that 
    5262credit should be proportional to number of FLOPs actually performed 
    5363by the application. 
    5767SETI@home had an application that allowed counting of FLOPs, 
    5868and they adopted this system. 
    59 They added a scaling factor so that the average credit 
    60 was about the same as in the first credit system. 
     69They added a scaling factor so that the average credit per job 
     70was the same as the first credit system. 
    6171 
    6272Not all projects could count FLOPs, however. 
    6878 
    6979 * It didn't address GPUs. 
    70  * project that couldn't count FLOPs still had host neutrality problem 
    71  * didn't address single replication 
     80 * Project that couldn't count FLOPs still had device neutrality problems. 
     81 * It didn't prevent credit cheating when single replication was used. 
    7282 
    7383 
    7787   change code, settings, etc. 
    7888 
    79  * Device neutrality: similar jobs should get similar credit 
    80    regardless of what processor or GPU they run on. 
     89 * Device neutrality 
    8190 
    8291 * Limited project neutrality: different projects should grant 
    9099== Peak FLOP Count (PFC) == 
    91100 
    92 This system uses to the Peak-FLOPS-based approach, 
     101This system goes back to the Peak-FLOPS-based approach, 
    93102but addresses its problems in a new way. 
    94103 
    95104When a job is issued to a host, the scheduler specifies usage(J,D), 
    96105J's usage of processing resource D: 
    97 how many CPUs, and how many GPUs (possibly fractional). 
     106how many CPUs and how many GPUs (possibly fractional). 
    98107 
    99108If the job is finished in elapsed time T, 
    109118   (e.g., a CPU job that does lots of disk I/O) 
    110119   PFC() won't reflect this.  That's OK. 
     120   The key thing is that BOINC reserved the device for the job, 
     121   whether or not the job used it efficiently. 
    111122 * usage(J,D) may not be accurate; e.g., a GPU job may take 
    112123   more or less CPU than the scheduler thinks it will. 
    115126   For now, though, we'll just use the scheduler's estimate. 
    116127 
    117 The idea of the system is that granted credit for a job J 
    118 is proportional to PFC(J), 
     128The idea of the system is that granted credit for a job J is proportional to PFC(J), 
    119129but is normalized in the following ways: 
    120130 
    121 == Version normalization == 
     131== Cross-version normalization == 
    122132 
    123133 
    128138find the minimum X, 
    129139then scale each app version's jobs by (X/PFC*(V)). 
    130 The results is called NPFC(J). 
     140The result is called "Version-Normalized Peak FLOP Count", or VNPFC(J). 
    131141 
    132142Notes: 
    144154   or new app versions are deployed. 
    145155 
    146 == Project normalization == 
     156== Cross-project normalization == 
    147157 
    148158If an application has both CPU and GPU versions, 
    157167 
    158168The solution to this is: if an app has only GPU versions, 
    159 then we scale its granted credit by a factor, 
    160 obtained from a central BOINC server, 
    161 which is based on the average scaling factor 
     169then we scale its granted credit by the average scaling factor 
    162170for that GPU type among projects that 
    163171do have both CPU and GPU versions. 
     172This factor is obtained from a central BOINC server. 
    164173 
    165174Notes: 
    176185 
    177186For a given application, all hosts should get the same average granted credit per job. 
    178 To ensure this, for each application A we maintain the average NPFC*(A), 
    179 and for each host H we maintain NPFC*(H, A). 
     187To ensure this, for each application A we maintain the average VNPFC*(A), 
     188and for each host H we maintain VNPFC*(H, A). 
    180189The "claimed credit" for a given job J is then 
    181190{{{ 
    182 NPFC(J) * (NPFC*(A)/NPFC*(H, A)) 
    183 }}} 
    184  
    185 Notes: 
    186  * NPFC* is averaged over jobs, not hosts. 
    187  * Both averages are recent averages, so that they respond to 
    188    changes in job sizes and app versions characteristics. 
     191VNPFC(J) * (VNPFC*(A)/VNPFC*(H, A)) 
     192}}} 
     193 
     194Notes: 
     195 * VNPFC* is averaged over jobs, not hosts. 
     196 * Both averages are exponential recent averages, 
     197   so that they respond to changes in job sizes and app versions characteristics. 
    189198 * This assumes that all hosts are sent the same distribution of jobs. 
    190199   There are two situations where this is not the case: 
    191200   a) job-size matching, and b) GPUGrid.net's scheme for sending 
    192201   some (presumably larger) jobs to GPUs with more processors. 
    193    To deal with this, we'll weight the average by workunit.rsc_flops_est. 
     202   To deal with this, we can weight jobs by workunit.rsc_flops_est. 
    194203 
    195204== Replication and cheating == 
    198207by claiming excessive credit 
    199208(i.e., by falsifying benchmark scores or elapsed time). 
    200 An exaggerated claim will increase NPFC*(H,A), 
     209An exaggerated claim will increase VNPFC*(H,A), 
    201210causing subsequent claimed credit to be scaled down proportionately. 
    202211This means that no special cheat-prevention scheme 
    212221 
    213222 * One-time cheats (like claiming 1e304) can be prevented by 
    214    capping NPFC(J) at some multiple (say, 10) of NPFC*(A). 
     223   capping VNPFC(J) at some multiple (say, 10) of VNPFC*(A). 
    215224 * Cherry-picking: suppose an application has two types of jobs, 
    216        which run for 1 second and 1 hour respectively. 
    217        Clients can figure out which is which, e.g. by running a job for 2 seconds 
    218        and seeing if it's exited. 
    219        Suppose a client systematically refuses the 1 hour jobs 
    220        (e.g., by reporting a crash or never reporting them). 
    221        Its NPFC*(H, A) will quickly decrease, 
    222        and soon it will be getting several thousand times more credit 
    223        per actual work than other hosts! 
    224        Countermeasure: 
    225        whenever a job errors out, times out, or fails to validate, 
    226        set the host's error rate back to the initial default, 
    227        and set its NPFC*(H, A) to NPFC*(A) for all apps A. 
    228        This puts the host to a state where several dozen of its 
    229        subsequent jobs will be replicated. 
     225  which run for 1 second and 1 hour respectively. 
     226  Clients can figure out which is which, e.g. by running a job for 2 seconds 
     227  and seeing if it's exited. 
     228  Suppose a client systematically refuses the 1 hour jobs 
     229  (e.g., by reporting a crash or never reporting them). 
     230  Its VNPFC*(H, A) will quickly decrease, 
     231  and soon it will be getting several thousand times more credit 
     232  per actual work than other hosts! 
     233  Countermeasure: 
     234  whenever a job errors out, times out, or fails to validate, 
     235  set the host's error rate back to the initial default, 
     236  and set its VNPFC*(H, A) to VNPFC*(A) for all apps A. 
     237  This puts the host to a state where several dozen of its 
     238  subsequent jobs will be replicated. 
    230239 
    231240== Implementation == 
    232241 
     242Database changes: 
     243 
     244New table "host_app_version" 
     245{{{ 
     246int host_id; 
     247int app_version_id; 
     248double avg_vnpfc;       // recent average 
     249int njobs; 
     250double total_vnpfc; 
     251}}} 
     252 
     253New fields in "app_version": 
     254{{{ 
     255double avg_vnpfc; 
     256int njobs; 
     257double total_vnpfc; 
     258}}} 
     259 
     260New fields in "app": 
     261{{{ 
     262double min_avg_vnpfc;           // min value of app_version.avg_vnpfc 
     263}}} 

If this page is incomplete or incorrect, please edit it or add it to the wiki to-do list. To do this, you must be logged in; click Login or Register above.