| 64 | |
| 65 | == Batch scheduling == |
| 66 | |
| 67 | The second requirement of science portals is support for '''batches''': |
| 68 | groups of jobs that have value only when all of the jobs have been completed. |
| 69 | The interval between batch submission and completion is called '''makespan'''. |
| 70 | |
| 71 | === Minimizing makespan === |
| 72 | |
| 73 | BOINC's scheduler should be modified to reduce makespan. |
| 74 | The scheduler has control over two things: |
| 75 | |
| 76 | * what hosts to send which job instances to (and when to create new instances) |
| 77 | * what deadline to assign to each job |
| 78 | |
| 79 | A variety of techniques are possible: |
| 80 | |
| 81 | * Use slower hosts for the first jobs of a batch, |
| 82 | with the expectation that the host will finish one job |
| 83 | in the duration of the batch |
| 84 | (and by extension, very slow hosts should possibly |
| 85 | not get any jobs for some batches). |
| 86 | * Use faster, more reliable hosts for the last jobs of a batch. |
| 87 | * Use increased replication for the last jobs of a batch |
| 88 | to reduce the expected minimum turnaround. |
| 89 | |
| 90 | These techniques can be combined into policies. |
| 91 | To study the relative performance of these policies, |
| 92 | we can develop a simulator that takes as input a description |
| 93 | of a real volunteer host population (e.g., SETI@home) |
| 94 | and the parameters of a random batch arrival process, |
| 95 | and computes the statistics of makespan, |
| 96 | and perhaps the difference between predicted and actual makespan (see below). |
| 97 | |
| 98 | === Estimating makespan === |
| 99 | |
| 100 | Given the description of a possible new batch, |
| 101 | the system should be able to provide a reasonably accurate estimate |
| 102 | of its makespan (perhaps with error bars). |
| 103 | The computation of this estimate must take into account |
| 104 | |
| 105 | * The computing resource (i.e. the distributions of speed and availability |
| 106 | of the volunteer hosts, and their number). |
| 107 | * The existing workload (batches currently in progress, and their state) |
| 108 | * The parameters of the batch |
| 109 | * The computing share and debt of the requesting user |
| 110 | * The batch scheduling policy |
| 111 | |
| 112 | In addition, the system should provide users with a web-based interface |
| 113 | for viewing batches in progress, |
| 114 | showing their fraction done, |
| 115 | an updated estimate of their completion time, |
| 116 | information about errors, |
| 117 | and a button for aborting the batch. |
| 118 | |
| 119 | == Job DAGs == |
| 120 | |
| 121 | Batches are a special case of Directed Acyclic Graphs (DAGs) of jobs. |
| 122 | The nodes in such a graph represent jobs; |
| 123 | there is an edge from job A to job B if A must be completed |
| 124 | before B can be started. |
| 125 | |
| 126 | A batch can be viewed as a graph consisting of |
| 127 | a "start node" fanning out to N job nodes, |
| 128 | then fanning back in to a "finish node". |
| 129 | |
| 130 | Science portals may offer the ability to submit collections of work |
| 131 | that can be described as DAGs but not as batches: |
| 132 | for example, a collection of units, each of which consists of |
| 133 | running program A and using its output as input to program B. |
| 134 | |
| 135 | (Note: BOINC's wrapper mechanism, which allows multiple applications |
| 136 | to be run in sequence as part of a BOINC job, |
| 137 | supports some simple cases, but it not general because it |
| 138 | requires dependent jobs to run on a single host). |
| 139 | |
| 140 | As an extension of the work described above, |
| 141 | we could support DAGs rather than just batches. |