Changes from Version 1 of JobReplication

Show
Ignore:
Author:
KSMarksPsych (IP: 67.149.82.49)
Timestamp:
04/20/07 12:40:58 (3 years ago)
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • JobReplication

    v v1  
     1= Redundancy and errors = 
     2         
     3A BOINC 'result' abstracts an instance of a computation, possibly not performed yet. Typically, a BOINC server sends 'results' to clients, and the clients perform the computation and replies to the server. But many things can happen to a result: 
     4 
     5    * The client computes the result correctly and returns it. 
     6    * The client computes the result incorrectly and returns it. 
     7    * The client fails to download or upload files. 
     8    * The application crashes on the client. 
     9    * The client never returns anything because it breaks or stops running BOINC. 
     10    * The scheduler isn't able to send the result because it requires more resources than any client has.  
     11 
     12BOINC provides a form of redundant computing in which each computation is performed on multiple clients, the results are compared, and are accepted only when a 'consensus' is reached. In some cases new results must be created and sent. 
     13 
     14BOINC manages most of the details; however, there are two places where the application developer gets involved: 
     15 
     16    * '''Validation:''' This performs two functions. First, when a sufficient number (a 'quorum') of successful results have been returned, it compares them and sees if there is a 'consensus'. The method of comparing results (which may need to take into account platform-varying floating point arithmetic) and the policy for determining consensus (e.g., best two out of three) are supplied by the application. If a consensus is reached, a particular result is designated as the 'canonical' result. Second, if a result arrives after a consensus has already been reached, the new result is compared with the canonical result; this determines whether the user gets credit. 
     17    * '''Assimilation:''' This is the mechanism by which the project is notified of the completion (success or unsuccessful) of a work unit. It is performed exactly once per work unit. If the work unit was completed successfully (i.e. if there is a canonical result) the project-supplied function reads the output file(s) and handles the information, e.g. by recording it in a database. If the workunit failed, the function might write an entry in a log, send an email, etc.  
     18 
     19---- 
     20 
     21In the following example, the project creates a workunit with 
     22min_quorum = 2 
     23target_nresults = 3 
     24max_delay = 10 
     25 
     26BOINC automatically creates three results, which are sent at various times. At time 8, two successful results have returned so the validator is invoked. It finds a consensus, so the work unit is assimilated. At time 10 result 3 arrives; validation is performed again, this time to check whether result 3 gets credit. 
     27 
     28{{{ 
     29time        0   1   2   3   4   5   6   7   8   9   10  11  12  13  14 
     30 
     31            created                          validate; assimilate 
     32WU          x                                x  x 
     33                created sent            success 
     34result 1        x       x---------------x 
     35                created sent                success 
     36result 2        x       x-------------------x 
     37                created     sent                    success 
     38result 3        x           x-----------------------x 
     39}}} 
     40 
     41---- 
     42 
     43In the next example, result 2 is lost (i.e., there's no reply to the BOINC scheduler). When result 3 arrives a consensus is found and the work unit is assimilated. At time 13 the scheduler 'gives up' on result 2 (this allows it to delete the canonical result's output files, which are needed to validate late-arriving results). 
     44 
     45{{{ 
     46time        0   1   2   3   4   5   6   7   8   9   10  11  12  13  14 
     47 
     48            created                                  validate; assimilate 
     49WU          x                                        x  x 
     50                created sent            success 
     51result 1        x       x---------------x 
     52                created sent    lost                            giveup 
     53result 2        x       x--------                               x 
     54                created     sent                    success 
     55result 3        x           x-----------------------x 
     56}}} 
     57 
     58---- 
     59 
     60In the next example, results 2 returns an error at time 5. This reduces the number of outstanding results to 2; because target_nresults is 3, BOINC creates another result (result 4). A consensus is reached at time 9, before result 4 is returned. 
     61 
     62{{{ 
     63time        0   1   2   3   4   5   6   7   8   9   10  11  12  13  14 
     64 
     65            created                              validate; assimilate 
     66WU          x                                    x  x 
     67                created sent            success 
     68result 1        x       x---------------x 
     69                created sent    error 
     70result 2        x       x-------x 
     71                created     sent                success 
     72result 3        x           x-------------------x 
     73                                 created     sent           success 
     74result 4                         x   x----------------------x 
     75}}} 

If this page is incomplete or incorrect, please edit it or add it to the wiki to-do list. To do this, you must be logged in; click Login or Register above.