[[PageOutline]] = Scheduling for multi-user projects = == Computing share == Demand for a multi-user project's computing power may exceed supply, in which case we need a mechanism for dividing the supply among users. Assume that every job is associated with a particular user, and that each such user has a numeric '''quota''' representing the fraction of the computing resource they should get. What are the precise semantics of quotas? Ideally the mechanism should accommodate two classes of scientists: * '''Throughput-oriented''': those who have an infinite stream of jobs, want maximum throughput, and don't care about the turnaround times of individual jobs. * '''Latency-oriented''': those who occasionally have large batches of jobs, and who want the entire batch to be finished fast. Such a user, for example, might not submit any jobs for a month, then submit a batch that takes a day to complete using the entire resource. === Scheduling batches === Scheduling batches involves two decisions: * which jobs to send to which hosts * what deadlines to assign to jobs Arnaud Legrand suggested the following general technique for making these decisions at any point during the batch's execution. * for each host, estimate the delay until it can start processing jobs, and the rate at which it can process jobs * find the minimum time T at which, if all hosts begin processing jobs as soon as possible, the batch will be completed * Let H be the set of hosts which complete at least one job by time T. * Send jobs only to hosts in H, and give them deadline T. This algorithm can be repeated as jobs complete or time out, and as new hosts arrive. == Prioritizing batches == Goals: * Give short batches priority over long batches * But don't let a long stream of short batches starve long batches * Enforce quotas over the long term For each user U, we maintain a "logical start time" LST(U). LST(U) is always at least the current time. When U submits a batch B, LST(U) is incremented by R / share(U) where R is the expected runtime of B given the project's entire resource. Each batch B has a "logical end time" LET(B). This is set when the batch is submitted as LST(U) + R Priority is given to batches for which LET(B) is least. === Prioritizing a user's batches === With the above design, the batches of a particular user are processed in FCFS (possibly with overlap). It's possible to refine the mechanism to let users prioritize their own batches. Example: suppose a user U has a long batch A in progress, with LST(A)=x, and they submit a short batch B, and they want B to have priority over A. Then: let LST(B) = x and add R(B)/share(U) to both LST(A) and LET(A). In effect, B inherits A's initial global position, and A's position is moved back accordingly. === Notes === The scheme uses a-priori batch size estimates. These may be wildly wrong, perhaps intentionally. We need a way to adjust logical start and end times while batches are done (or even in progress) to compensate for bad initial estimates. For throughput-oriented users (i.e., infinite streams of single jobs) the scheme should handle them OK by viewing each job as a 1-element batch. == Combining multiple scheduling types == BOINC potentially supports the following types of work, each with its own scheduling mechanism: * Job-stream: the input is a stream of jobs; the goal is to maximize throughput and eventually finish all jobs. * Batch: described above. * Locality: a variant of job-stream in which jobs are assigned according to the files resident on the client. * Co-scheduling: like batch, except all jobs in each batch must run simultaneously (e.g., MPI applications). A single project (such as a portal) may have work of some or all types. How should we handle multiple scheduling types - in particular, how can we maintain resource shares in their presence? One simple approach is to maintain, for each scheduling type T, the least LET(T) of any unsent job of type T. Then use this top-level policy: {{{ for each scheduling type T in order of increasing LET(T) send jobs of type T }}} How to maintain LET(T): * Job-stream: maintain in the feeder (consider only jobs in the cache) * Locality: constraint: only 1 user can use locality scheduling * Batch: maintain in batch module * Co-scheduling: maintain in co-scheduling module