wiki:ValidationSummary

Version 3 (modified by davea, 16 years ago) (diff)

--

Validation, credit, and replication

A computational result consists of:

  • The result itself (the output files)
  • The amount of CPU time used; this may be used to determine how much credit to grant for the result.

In general, neither of these can be trusted:

  • Some hosts have hardware problems (consistent or sporadic) that cause errors (usually in floating-point computation).
  • Some volunteers may maliciously return wrong results; they may even reverse-engineer your application, deciphering and defeating any sort of internal validation mechanism you might include in it.
  • Some volunteers may return correct results but falsify the CPU time.

BOINC offers several mechanisms for validating results and credit. However, there is no "one size fits all" solution. The choice depends on your requirements, and on the nature of your applications (you can use different mechanisms for different applications).

For each of your applications, you must supply two server-side programs:

  • A "validator", which decides whether results are correct;
  • An "assimilator", which handles validated results

PICTURE: different validator/assim for different apps

BOINC provides examples of each of these, as well as a framework that makes it easy to develop your own.

No replication

The first option is to not use replication. Each job gets done once. The validator examines single results.

This approach is useful if you have some way (application-specific) of detecting wrong results with high probability.

The credit question remains. Some possibilities:

  • Grant fixed credit (feasible if your jobs are uniform).
  • Put a cap on granted credit (this allows cheating).
  • If claimed credit exceeds a threshold, replicate the job.

Replication

BOINC supports replication: each job gets done on N different hosts, and a result is considered valid if a strict majority of hosts return it.

Replication also provides a good solution to credit cheating, even for non-uniform apps: grant the average claimed credit, throwing out the low and high first (if N=2, grant the minimum).

One problem with replication is that there are discrepancies in the way different computers do floating point math. This makes it hard to determine when two results "agree"; two different results may be equally correct.

There are several different ways of dealing with this problem.

Eliminate discrepancies

By selecting the right compiler, compiler options, and math libraries, it may be possible to eliminate numerical discrepancies. This lets you do bitwise comparison of results. However, it is difficult and generally reduces the performance of your application.

Fuzzy comparison

If your application is numerically stable (i.e., small discrepancies lead to small differences in the result) you can write a "fuzzy comparison function" for the validator that considers two results as equivalent if they agree within some tolerance.

Homogeneous replication

There are

may bound credit

redundancy

exact match

exact computation

sixtrack example: compiler,

compiler options (disable DP hardware), numerical libraries

homogeneous redundancy exact checkpointing

fuzzy matching