Changes between Initial Version and Version 1 of RuntimeEstimation


Ignore:
Timestamp:
Apr 8, 2010, 4:06:19 PM (14 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • RuntimeEstimation

    v1 v1  
     1= Job runtime estimation =
     2
     3== The old system ==
     4
     5Jobs have a FLOP count estimate, wu.rsc_fpops_est.
     6
     7When sending an app version to a host,
     8the scheduler estimates its FLOPS.
     9This is either the CPU benchmark,
     10or a value assigned by the app_plan() function.
     11
     12The app_plan function is expected to predict
     13the performance of an app on all possible hosts.
     14
     15The client maintains a per-project duration correction factor (DCF),
     16which was intended to measure the efficiency of the project's apps,
     17and the systematic error in wu.rsc_fpops_est.
     18DCF was used to scale runtime estimates on both client and server side.
     19
     20Problems with the old system:
     21
     22 * Projects can have lots of apps.  A single DCF does not suffice.
     23 * Projects can't be expected to predict app performance.
     24
     25== The new system ==
     26
     27Projects still have to supply wu.rsc_fpops_est.
     28
     29The new system has a large overlap with [CreditNew the new credit system].
     30In particular, we now maintain:
     31
     32 * A '''host_app_version''' database record
     33   per (host, app version), or per (host, app, resource type) in the case of anonymous platform.
     34   This record includes the average elapsed time per wu.rsc_fpops_est.
     35 * for each app version, a '''pfc_scale''' which approximates the efficiency
     36   of the app version relative to the most efficient version.
     37The app_plan() function now returns peak FLOPS,
     38not the expected actual FLOPS.
     39
     40In the process of selecting an app version for each job,
     41the scheduler estimates its actual FLOPS.
     42This is stored in BEST_APP_VERSION.HOST_USAGE.flops.
     43
     44=== Regular case ===
     45
     46An app version's FLOPS estimate is initially the peak FLOPS.
     47We then look at the host_app_version record.
     48If it exists, and there are sufficient samples, we set
     49{{{
     50estimated_flops = 1/host_app_version.et.avg
     51}}}
     52
     53Otherwise, is app_version.pfc_scale is defined,
     54
     55{{{
     56estimated_flops *= app_version.pfc_scale
     57}}}
     58
     59=== Anonymous platform case ===
     60
     61If the host_app_version record exists and there are sufficient samples,
     62{{{
     63estimated_flops = 1/host_app_version.et.avg
     64}}}
     65
     66Otherwise, we use the estimate supplied by the client.
     67This may be specified in the app_info.xml file.
     68If not, the current client passes the peak FLOPS.
     69
     70Older clients (predating GPU support) don't pass a FLOPS estimate.
     71In this case we use the CPU benchmark.
     72
     73The estimated FLOPS is used to estimate job runtime on the server side.
     74
     75However, the only way to change the client's runtime estimate is by
     76adjusting the wu.rsc_fpops_est that we send to the client.
     77So, in the first case above, we scale wu.rsc_fpops_est by
     78{{{
     79(old estimate flops)/(new estimated flops)
     80}}}