Version 5 (modified by davea, 12 years ago) (diff)


Validation, credit, and replication

The execution of a job produces:

  • The output files;
  • The amount of CPU time used; this may be used to determine how much credit to grant for the result.

In general, neither of these can be trusted, because:

  • Some hosts have consistent or sporadic hardware problems, typically causing errors in floating-point computation.
  • Some volunteers may maliciously return wrong results; they may even reverse-engineer your application, deciphering and defeating any internal validation mechanism it might contain.
  • Some volunteers may return correct results but falsify the CPU time.

BOINC offers several mechanisms for validating results and credit. However, there is no "one size fits all" solution. The choice depends on your requirements, and on the nature of your applications (you can use different mechanisms for different applications).

For each of your applications, you must supply two server-side programs:

  • A "validator", which decides whether results are correct;
  • An "assimilator", which handles validated results

BOINC provides examples of each of these, as well as a framework that makes it easy to develop your own.

No replication

The first option is to not use replication. Each job gets done once. The validator examines single results.

This approach is useful if you have some way (application-specific) of detecting wrong results with high probability.

The credit question remains. Some possibilities:

  • Grant fixed credit (feasible if your jobs are uniform).
  • Put a cap on granted credit (this allows cheating).
  • If claimed credit exceeds a threshold, replicate the job.


BOINC supports replication: each job gets done on N different hosts, and a result is considered valid if a strict majority of hosts return it.

Replication also provides a good solution to credit cheating, even for non-uniform apps: grant the average claimed credit, throwing out the low and high first (if N=2, grant the minimum).

One problem with replication is that there are discrepancies in the way different computers do floating point math. This makes it hard to determine when two results "agree"; two different results may be equally correct.

There are several different ways of dealing with this problem.

Eliminate discrepancies

It may be possible to eliminate numerical discrepancies. To do so you'll need to select appropriate compiler, compiler options, and math libraries, and to make sure that your checkpoint files are full precision.

This lets you do bitwise comparison of results. However, it is difficult and generally reduces the performance of your application.

Fuzzy comparison

If your application is numerically stable (i.e., small discrepancies lead to small differences in the result) you can write a "fuzzy comparison function" for the validator that considers two results as equivalent if they agree within some tolerance.

Homogeneous replication

With this variant of replication, once an instance of a job has been sent to a host, additional instances are sent only to hosts that are "numerically equivalent" (i.e. that will return bit-identical results).

The notion of "numerical equivalence" depends on your application and how it was compiled. BOINC supplies two pre-defined equivalence relations, "coarse" and "fine". Use either of these ("coarse" is preferable, if it's fine enough for your app) or define your own if needed.

Adaptive replication

This is a refinement of the replication policy. It randomly decides whether to replicate jobs, based on the measured error rate of hosts. If the first instance of a job is sent to a host with a low error rate, then with high probability no further instances will be sent.

Adaptive replication is independent of the comparison policy; you can use it with either bitwise comparison, fuzzy comparison, or homogeneous replication.