wiki:AutoFlops

Version 3 (modified by davea, 15 years ago) (diff)

--

Automated estimation of job and app version characteristics

Goals

  • Eliminate the need for projects to supply FLOPs estimate for jobs
  • Eliminate the need for projects to supply FLOPS estimates for app versions (in app_plan())
  • Improve job completion time estimates by using different duration correction factors for different app versions
  • Simplify the credit-granting policy

App units

App units are a project-defined measure of how much computation a completed job did. They are typically a count of iterations of the app's main loop. They should be approximately proportional to FLOPs performed, but it doesn't matter what the proportion is (i.e., you don't have to count the FLOPs in your main loop).

Apps use a new API to report how many app units they've performed (call this right before boinc_exit()):

void boinc_app_units(double x);

This is optional (the default is one). However, if an app's jobs have highly variable size and don't report app units, BOINC will tend to give too little credit for them.

The following API functions are deprecated:

boinc_ops_per_cpu_second()
boinc_ops_cumulative()

Job submission has a new optional argument predicted_app_units: the predicted number of app units the job will take. The default is one. The rsc_fpops_est and rsc_fpops_bound arguments to job creation are deprecated.

The predictions don't have to be exact. In fact, it's OK if they're systematically too high or low, as long as there's a linear correlation. However, if predicted app units are not linearly correlated with actual app units, bad completion time estimates will result.

The server maintains the mean and variance of both actual and predicted app units, and scales prediction by the ratio of the means. It also maintains au_variance, the variance of (actual/predicted), which is used for completion time estimates and bounds.

Job completion time estimates and bounds

The BOINC client maintains a per-app-version estimate seconds_per_app_unit; The completion time estimate of a job J is

seconds_per_app_unit
* J.predicted_aus
* (app.mean_actual_aus/app.mean_predicted_aus + X*au_stdev)

where X is, say, 2 (meaning that for 95% of jobs, the actual completion time will be less than the estimate).

The completion time bound for a job is the same, but with X=4 or so (meaning that 99.99% of jobs will complete without being aborted due to time limit).

Estimating FLOPS per app unit

For credit-granting purposes we want to estimate the number of FLOPs per app unit.

When the scheduler dispatches a job, it knows the numbers of CPUs and GPUs used, as well as the peak hardware FLOPS of each device (for CPUs, this is the Whetstone benchmark result; for GPUs, it's derived from hardware parameters). We define peak_flops_per_sec(J) as the sum of these counts times the peak hardware FLOPS of the devices.

The server sends the client peak_flops_per_sec(J). When the client returns a job, it includes a value raw_flops_per_sec(J). This is usually the same as peak_flops_per_sec(J) but it may be less (see note 2 below).

Suppose a job J is executed on a given host using app version V, and that it reports A app units and uses elapsed time T. We then define raw_flops(J) as T * raw_flops_per_sec(J). We define raw_flops_per_au(J) as raw_flops(J)/A.

If we run jobs on lots of different hosts, the distribution of raw_flops_per_au(J) will have variance, because applications run with different efficiency on different hardware. Suppose an app is memory-intensive. Running on machine A with a fast CPU (say 10 GFLOPS Whetstone) and a slow memory system, it might do 1 GFLOPS (10% efficiency), while on machine B with a slow CPU (say 1 GFLOPS Whetstone) and a fast memory system, it might also do 1 GFLOPS (100% efficiency). A and B would report raw_flops_per_au(J) as 10 GFLOPS and 1 GFLOPS respectively.

In the absence of FLOPS counting, we can't estimate the true FLOPS/app_unit. The best we can do is to use raw_flops_per_au(J) from the most efficient hosts. So what we will do is to use an order statistic, say the 5th percentile value, from the distribution of raw_flops_per_au(J). We'll call this est_flops_per_au(app).

Notes:

1) Typically GPUs are much less efficient than CPUs. If a large fraction of jobs (say, >95%) are being done by GPUs, then the 5th percentile value won't give us a good estimate. To deal with this we could maintain separate arrays for GPU and CPU app versions, and take the min of the two 5th-percentile values.

2) If an application has only very inefficient app versions, we'll get an inflated estimate of FLOPS/app_unit. Assuming that GPU apps tend to be more inefficient than CPU apps, this will happen with projects that have only a GPU app (e.g., GPUGRID.net). Here's a scheme to address this problem:

a) have the client report raw_flops_per_sec instead of using the server value.

b) pass the client a flag saying that an app has both CPU and GPU versions.

c) If a client has an app version V for an app with only GPU versions, and it also has app versions V1...Vn for apps that have both CPU and GPU versions, then use the average of the est_flops_per_au for V1...Vn in the raw_flops_per_au(J) that it reports for V.

The net effect is to propagate efficiency estimates between projects.

3) If an app's jobs have variance that is not reflected in app units, est_flops_per_au may be way too low (suppose, e.g., that 5% of jobs end immediately). A solution to this problem is to define a "reference job" that's executed on all machines, and use only that job to compute est_flops_per_au.

Credit

Grant each validated job credit proportional to

app.mean_app_units * app.est_flops_per_au

All jobs for a given app get the same credit. A user doesn't get more credit for a longer-than-average job, but it averages out in the end.

This removes the incentive for hackers to report exaggerated job statistics (say, app units).

However, it introduces a new form of credit cheating: "cherry picking", i.e. completing only short jobs (based on input files, or on trial execution) and aborting or discarding others.

This can be discouraged either by mechanisms that reduce the number of jobs/day when a job is aborted or times out.

Anonymous platform notes

Implementation

Protocol

Request message:

  • for each reported job, add <app_units>
  • for each app version, add <seconds_per_app_unit>

Reply message:

  • for each app, <est_flops_per_au>
  • for each app version, <

Database

app.est_flops_per_au

Transitioner

The transitioner maintains, for each app, an array of the N most recent values of raw_flops_per_au(J) for that app. When this array is full, the transitioner computes the 5th percentile value of the array. It then exponentially averages this value with app.est_flops_per_au.

Scheduler

Job completion time estimate:

estimated_time =
	J.predicted_app_units
	* A.predicted_app_units_scale
	/ app_version.seconds_per_app_unit   // as reported by host

or

	J.predicted_app_units
	* A.predicted_app_units_scale
	* A.est_raw_flops_per_au
	/ J.flops_estimate					// as estimated by scheduler

client

For each app version, maintain

  • seconds_per_app_unit: dynamic estimate of elapsed seconds per app unit

When a job completes, update this as an exponential average. This replaces "duration correction factor".