BOINC's default replication policy replicates a job even if one of the hosts is known to be highly reliable. The overhead of replication is high - at least 50% of total CPU time is spent checking validity.
Adaptive replication is an optional policy that avoids replicating a job if it has been sent to a highly reliable host. The goal of this policy is to provide a target level of confidence with minimal overhead - perhaps only 5% or 10% of total CPU time.
BOINC maintains an estimate E(H) of host H's recent error rate. This is maintained as follows:
- It is initialized to 0.1
- It is multiplied by 0.95 when H reports a correct (replicated) result.
- It is incremented by 0.1 when H reports an incorrect (replicated) result.
Thus, it takes a long time to earn a good reputation and a short time to lose it.
The adaptive replication policy is as follows.
- Each job is initially marked as unreplicated.
- On each request, the scheduler decides whether to trust the host as follows:
- If E(H) > A, don't trust the host.
- Otherwise, trust the host with probability 1 - sqrt( E(H)/A ).
- If we decide to trust the host, preferentially send it unreplicated jobs.
- Otherwise, preferentially send it replicated jobs. If we have to send it an unreplicated job, mark it as replicated and create new instances accordingly.
In the current code base (as of r18056), A is hardcoded to be 0.05 in sched_send.cpp as ER_MAX.
Using adaptive replication
To use adaptive replication for a given app:
- Set app.target_nresults to 2 in the database.
- Create jobs with target_nresults=1 and min_quorum=1; i.e. include
<target_nresults>1</target_nresults> <min_quorum>1</min_quorum>in the input template file.
- Add "target_nresults" field to app table. Default is zero (app doesn't use adaptive replication).
- Decide whether to trust host as described above.
- If we send an unreplicated job (i.e., target_nresults=1 and app.target_nresults>1) to an untrusted host, set wu.target_nresults = app.target_nresults and flag the WU for transitioning.
- Don't update host.error_rate for unreplicated results (i.e., wu.target_nresults=1 and app.target_nresults>1).