wiki:CreditNew

Context Navigation

Version 29 (modified by davea, 14 years ago) (diff)
--

New credit system design

Definitions

BOINC estimates the peak FLOPS of each processor. For CPUs, this is the Whetstone benchmark score. For GPUs, it's given by a manufacturer-supplied formula.

Other factors, such as the speed of a host's memory system, affect application performance. So a given job might take the same amount of CPU time and a 1 GFLOPS host as on a 10 GFLOPS host. The efficiency of an application running on a given host is the ratio of actual FLOPS to peak FLOPS.

GPUs typically have a much higher (50-100X) peak FLOPS than CPUs. However, application efficiency is typically lower (very roughly, 10% for GPUs, 50% for CPUs).

Notes:

For our purposes, the peak FLOPS of a device uses single or double precision, whichever is higher.

Credit system goals

Some goals in designing a credit system:

Device neutrality: similar jobs should get similar credit regardless of what processor or GPU they run on.

Project neutrality: different projects should grant about the same amount of credit per host, averaged over all hosts.

Cheat-proof: there should be a bound (say, 1.1) on the ratio of credit granted to credit deserved per user account, regardless of what the user does.

The first credit system

In the first iteration of BOINC's credit system, "claimed credit" was defined as

C1 = H.whetstone * J.cpu_time

There were then various schemes for taking the average or min claimed credit of the replicas of a job, and using that as the "granted credit".

We call this system "Peak-FLOPS-based" because it's based on the CPU's peak performance.

The problem with this system is that, for a given app version, efficiency can vary widely between hosts. In the above example, the 10 GFLOPS host would claim 10X as much credit, and its owner would be upset when it was granted only a tenth of that.

Furthermore, the credits granted to a given host for a series of identical jobs could vary widely, depending on the host it was paired with by replication. This seemed arbitrary and unfair to users.

The second credit system

We then switched to the philosophy that credit should be proportional to number of FLOPs actually performed by the application. We added API calls to let applications report this. We call this approach "Actual-FLOPs-based".

SETI@home's application allowed counting of FLOPs, and they adopted this system, adding a scaling factor so that average credit per job was the same as the first credit system.

Not all projects could count FLOPs, however. So SETI@home published their average credit per CPU second, and other projects continued to use benchmark-based credit, but multiplied it by a scaling factor to match SETI@home's average.

This system has several problems:

It doesn't address GPUs properly; projects using GPUs have to write custom code.
Project that can't count FLOPs still have device neutrality problems.
It doesn't prevent credit cheating when single replication is used.

Goals of the new (third) credit system

Completely automated - projects don't have to change code, settings, etc.

Device neutrality

Limited project neutrality: different projects should grant about the same amount of credit per host-hour, averaged over hosts. Projects with GPU apps should grant credit in proportion to the efficiency of the apps. (This means that projects with efficient GPU apps will grant more credit than projects with inefficient apps. That's OK).

A priori job size estimates

If we have an a priori estimate of job size, we can normalize by this to reduce the variance of various distributions (see below). This makes estimates of the means converge more quickly.

We'll use workunit.rsc_fpops_est as this a priori estimate, and denote it E(J).

(A posteriori estimates of job size may exist also, e.g., an iteration count reported by the app, but aren't cheat-proof; we don't use them.)

Peak FLOP Count (PFC)

This system uses the Peak-FLOPS-based approach, but addresses its problems in a new way.

When a job J is issued to a host, the scheduler computes peak_flops(J) based on the resources used by the job and their peak speeds.

If the job is finished in elapsed time T, we define peak_flop_count(J), or PFC(J) as

PFC(J) = T * peak_flops(J)

Notes:

PFC(J) is not cheat-proof; e.g. cheaters can falsify elapsed time.
We use elapsed time instead of actual device time (e.g., CPU time). If a job uses a resource inefficiently (e.g., a CPU job that does lots of disk I/O) PFC() won't reflect this. That's OK. The key thing is that BOINC allocated the device to the job, whether or not the job used it efficiently.
peak_flops(J) may not be accurate; e.g., a GPU job may take more or less CPU than the scheduler thinks it will. Eventually we may switch to a scheme where the client dynamically determines the CPU usage. For now, though, we'll just use the scheduler's estimate.
For projects (CPDN) that grant partial credit via trickle-up messages, substitute "partial job" for "job". These projects must include elapsed time and result ID in the trickle message.

By default, the credit for a job J is proportional to PFC(J), but is limited and normalized in the following ways:

Sanity check

If PFC(J) is infinite or is > wu.rsc_fpops_bound, J is assigned a "default PFC" and other processing is skipped. Default PFC is determined as follows:

If min_avg_pfc(A) is defined (see below) then

D = min_avg_pfc(A) * E(J)

Otherwise

D = wu.rsc_fpops_est

Cross-version normalization

A given application may have multiple versions (e.g., CPU and GPU versions). If jobs are distributed uniformly to versions, all versions should get the same average credit. We adjust the credit per job so that the average is the same for each version.

We maintain the average PFC^mean(V) of PFC(J)/E(J) for each app version V. We periodically compute PFC^mean(CPU) and PFC^mean(GPU), and compute X as follows:

If there are only CPU or only GPU versions, and at least 2 versions are above a sample threshold, X is the average.

If there are both, and at least 1 of each is above a sample threshold, let X be the min of the averages.

If X is defined, then we set

min_avg_pfc(A) = X

This is an estimate of the app's average actual FLOPS.

We also set

Scale(V) = (X/PFC^mean(V))

An app version V's jobs are scaled by this factor.

Notes:

Version normalization is only applied if at least two versions are above sample threshold.
Version normalization addresses the common situation where an app's GPU version is much less efficient than the CPU version (i.e. the ratio of actual FLOPs to peak FLOPs is much less). To a certain extent, this mechanism shifts the system towards the "Actual FLOPs" philosophy, since credit is granted based on the most efficient app version. It's not exactly "Actual FLOPs", since the most efficient version may not be 100% efficient.
If jobs are not distributed uniformly among versions (e.g. if SETI@home VLAR jobs are done only by the CPU version) then this mechanism doesn't work as intended. One solution is to create separate apps for separate types of jobs.
Cheating or erroneous hosts can influence PFC^mean(V) to some extent. This is limited by the Sanity Check mechanism, and by the fact that only validated jobs are used. The effect on credit will be negated by host normalization (see below). There may be an effect on cross-version normalization. This could be eliminated by computing PFC^mean(V) as the sample-median value of PFC^mean(H, V) (see below).

Host normalization

The second normalization is across hosts. Assume jobs for a given app are distributed uniformly among hosts. Then the average credit per job should be the same for all hosts. To ensure this, for each app version V and host H we maintain PFC^mean(H, A), the average of PFC(J)/E(J) for jobs completed by H using A.

This yields the host scaling factor

Scale(H) = (PFC^mean(V)/PFC^mean(H, A))

There are some cases where hosts are not sent jobs uniformly:

job-size matching (smaller jobs sent to slower hosts)
GPUGrid.net's scheme for sending some (presumably larger) jobs to GPUs with more processors.

The normalization by E(J) handles this (assuming that wu.fpops_est is set appropriately).

Notes:

For some apps, the host normalization mechanism is prone to a type of cheating called "cherry picking". A mechanism for defeating this is described below.
The host normalization mechanism reduces the claimed credit of hosts that are less efficient than average, and increases the claimed credit of hosts that are more efficient than average.

Computing averages

Computation of averages needs to take into account:

The quantities being averaged may gradually change over time (e.g. average job size may change) and we need to track this. This done as follows: for the first N samples (N = ~100 for app versions, ~10 for hosts) we take the straight average. After that we use an exponential average (with appropriate alpha for app version and host)

A given sample may be wildly off, and we can't let this mess up the average. Non-first samples are capped at 10 times the current average.

Anonymous platform

For anonymous platform apps, since we don't reliably know anything about the devices involved, we don't try to estimate PFC.

For each app, we maintain min_avg_pfc(A), the average PFC for the most efficient version of A.

The claimed credit for anonymous platform jobs is

claimed_credit^mean(A)*E(J)

The server maintains host_app_version records for anonymous platform, and it keeps track of elapsed time statistics there. These have app_version_id = -2 for CPU, -3 for NVIDIA GPU, -4 for ATI.

Claimed and granted credit

The claimed FLOPS for a given job J is

F = PFC(J) * S(V) * S(H)

and the claimed credit (in Cobblestones) is

C = F*100/86400e9

When replication is used, We take the set of hosts that are not anon platform and not on scale probation (see below). If this set is nonempty, we grant the average of their claimed credit. Otherwise we grant

claimed_credit^mean(A)*E(J)

Cross-project version normalization

If an application has both CPU and GPU versions, the version normalization mechanism uses the CPU version as a "sanity check" to limit the credit granted to GPU jobs (or vice versa).

Suppose a project has an app with only a GPU version, so there's no CPU version to act as a sanity check. If we grant credit based only on GPU peak speed, the project will grant much more credit per GPU hour than other projects, violating limited project neutrality.

A solution to this: if an app has only GPU versions, then for each version V we let S(V) be the average scaling factor for that resource type among projects that have both CPU and GPU versions. This factor is obtained from a central BOINC server. V's jobs are then scaled by S(V) as above.

Projects will export the following data:

for each app version

app name platform name recent average granted credit plan class scale factor

The BOINC server will collect these from several projects and will export the following:

for each plan class

average scale factor (weighted by RAC)

We'll provide a script that identifies app versions for GPUs with no corresponding CPU app version, and sets their scaling factor based on the above.

Notes:

The "average scaling factor" is weighted by work done.

Cheat prevention

Host normalization mostly eliminates the incentive to cheat by claiming excessive credit (i.e., by falsifying benchmark scores or elapsed time). An exaggerated claim will increase PFC^mean(H,A), causing subsequent credit to be scaled down proportionately.

This means that no special cheat-prevention scheme is needed for single replications; in this case, granted credit = claimed credit.

However, two kinds of cheating still have to be dealt with:

One-time cheats

For example, claiming a PFC of 1e304.

If PFC(J) exceeds some multiple (say, 20) of PFC^mean(V), the host's error rate is set to the initial value, so it won't do single replication for a while, and scale_probation (see below) is set to true.

Cherry picking

Suppose an application has a mix of long and short jobs. If a client intentionally discards (or aborts, or reports errors from) the long jobs, but completes the short jobs, its host scaling factor will become large, and it will get excessive credit for the short jobs. This is called "cherry picking".

The host punishment mechanism doesn't deal effectively with cherry picking,

The following mechanism deals with cherry picking:

For each (host, app version) maintain "host_scale_time". This is the earliest time at which host scaling will be applied.
for each (host, app version) maintain "scale_probation" (initially true).
When send a job to a host, if scale_probation is true, set host_scale_time to now+X, where X is the app's delay bound.
When a job is successfully validated, and now > host_scale_time, set scale_probation to false.
If a job times out or errors out, set scale_probation to true, max the scale factor with 1, and set host_scale_time to now+X.
when computing claimed credit for a job, and now < host_scale_time, don't use the host scale factor

The idea is to apply the host scaling factor only if there's solid evidence that the host is NOT cherry picking.

Because this mechanism is punitive to hosts that experience actual failures, it's selectable on a per-application basis (default off).

In addition, to limit the extent of cheating (in case the above mechanism is defeated somehow) the host scaling factor will be min'd with a constant (say, 10).

Error rate, host punishment, and turnaround time estimation

Unrelated to the credit proposal, but in a similar spirit.

Due to hardware problems (e.g. a malfunctioning GPU) a host may have a 100% error rate for one app version and a 0% error rate for another. Similar for turnaround time.

So we'll move the "error_rate" and "turnaround_time" fields from the host table to host_app_version.

The host punishment mechanism is designed to deal with malfunctioning hosts. For each host the server maintains max_results_day. This is initialized to a project-specified value (e.g. 200) and scaled by the number of CPUs and/or GPUs. It's decremented if the client reports a crash (but not if the job was aborted). It's doubled when a successful (but not necessarily valid) result is received.

This should also be per-app-version, so we'll move "max_results_day" from the host table to host_app_version.

App plan functions

App plan functions no longer have to make a FLOPS estimate. They just have to return the peak device FLOPS.

The scheduler adjusts this, using the elapsed time statistics, to get the app_version.flops_est it sends to the client (from which job durations are estimated).

Job runtime estimates

Unrelated to the credit proposal, but in a similar spirit. The server will maintain ET^mean(H, V), the statistics of job runtimes (normalized by wu.rsc_fpops_est) per host and application version.

The server's estimate of a job's runtime is then

R(J, H) = wu.rsc_fpops_est * ET^mean(H, V)

Implementation

Database changes

New table host_app_version:

int    host_id;
int    app_version_id;		// generalized for anon platform
AVERAGE pfc;
AVERAGE_VAR et;				// elapsed time / wu.rsc_fpops_est
double host_scale_time;
bool scale_probation;
double error_rate;
int max_jobs_per_day;
int n_jobs_today;
AVERAGE_VAR turnaround;

New fields in app_version:

AVERAGE pfc;
double pfc_scale;
double expavg_credit;
double expavg_time;

New fields in app:

double min_avg_pfc;
bool host_scale_check;		// whether to do scale probation
int max_jobs_in_progress;
int max_gpu_jobs_in_progress;
int max_jobs_per_rpc;
int max_jobs_per_day_init;

Scheduler changes

When dispatch anonymous app job, set result.app_version_id to -2/-3/-4 depending on resource.
update host_app_version.host_scale_time for app versions for which jobs are being sent and for which scale_probation is set.

Validator changes

To reduce DB access, validator maintains a vector of app_versions. This is appended to by assign_credit_set(). At the start of every validator pass, the pfc and expavg_credit fields of the app versions are saved. Updates are accumulated, and at the end of the validator pass (before sleep()) the incremental changes are written to the DB. This scheme works correctly even with multiple validators per app.
The updating of app_versions is done in such a way that we pick up changes to pfc_scale by the feeder.
The app record is reread at the start of each scan, in case its min_avg_pfc has been changed by the feeder.
check_set() no longer returns credit (leave arg there for now)
update host_app_version.scale_probation in is_valid()
don't grant credit in is_valid()
compute and grant credit in handle_wu()

Feeder changes

If we're the "main feeder" (mod = 0, or mod not used), update app_version.pfc_scale and app.min_avg_pfc every 10 minutes.

Download in other formats:

Plain Text