Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Version 1 and Version 2 of JobPrioritization

Timestamp:: Sep 18, 2012, 6:05:34 PM (12 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

JobPrioritization

-                      v1
+                      v2
 = Server scheduling improvements =
+= Server job scheduling =
+By default, the BOINC scheduler dispatches jobs in the order returned by
+a database select, which is more or less FIFO.
+This document describes proposed changes to server scheduling policies:
+This is non-optimal in the following situations:
+ * Feeder enumeration order
+ * Of the jobs in shared mem, which to send to a host
+ * Which app versions to use
+ * What deadlines to assign
+ * If a job fails or times out, a '''retry job''' is created.
+  If there are lots of sendable jobs already in the DB,
+  it may be days or weeks before the retry job is dispatched.
+  During this period, completed replicas are uncredited and take up disk space.
+ * Jobs in the tail end of a batch should be done faster.
+These policies work in conjunction with
+[PortalFeatures batch scheduling policies].
+To optimize these situations, there are two policies we can play with:
+== Quality of service types ==
+ * The order in which the feeder enumerates jobs from the DB.
+ * Preferentially sending particular jobs to fast/reliable hosts.
+We seek to handle the following types of QoS:
+BOINC has mechanisms in the [BackendPrograms#feeder feeder]
+and [ProjectOptions#Acceleratingretries scheduler] that address these issues
+to some extent.
+However, these mechanisms are out of date.
+This is a proposal for revisions to these mechanisms.
+ * Non-batch, throughput-oriented, fixed latency bound.
+ * Batches with long deadlines
+   (greater then the turnaround time of almost all hosts).
+ * Batches to be completed "as fast as possible" (AFAP),
+   with no a priori deadline.
+ * Batches with a short deadline.
+(to be completed)
+== Goals ==
+== Notes ==
+The goals of the policies include:
+ * We should eliminate as much config as possible.
+   There should be no thresholds for turnaround time.
+   (especially a project-wide one; this should be per app).
+ * The notion of "reliable host" need not be binary.
+   Maybe we should do it in terms of order statistics -
+th percentile hosts, 90th percentile, etc.
+   Note: this is on a per (host, app version) basis.
+ * We need to think about how this interacts with HR.
+ * Support the QoS features.
+ * Avoid assigning "tight" deadlines unnecessarily, because
+  * Doing so may make it impossible to assign tight deadlines
+    to jobs that actually need it.
+  * Tight deadlines often force the client to preempt other jobs,
+    which irritates some volunteers.
+ * Avoid long delays between the completion of a job instance
+   and its validation.
+   These delays irritate volunteers and increase server disk usage.
+ * Minimize server configuration.
+We need to think carefully about the dispatch model.
+In general we have some "special" jobs in cache
+and we get RPCs, some from "special" hosts.
+Two extreme policies:
+== Host statistics ==
+ * Send special jobs only to special hosts.
+  The danger: a special job may sit in the cache
+  for a long time, maybe forever.
+ * If we get a request from a non-special host,
+  and we can't satisfy it with non-special jobs,
+  send it special jobs too.
+  The danger: special jobs may be sent to a slow or unreliable host.
+We need a way to identify hosts that can turn around jobs quickly
+and reliably.
+Notes:
+ * This is a property of (host, app version), not host.
+ * This is not the same as processor speed.
+   A host may have high turnaround time for various reasons:
+  * Large min work buffer size.
+  * Attached to lots of other projects.
+  * Long periods of unavailability or network disconneciton.
+Compromises are possible;
+e.g. we could associate a "min percentile" with each job in cache,
+and send a job only to (host, app version) of that percentile or greater.
+The min percentile could be decayed over time
+so that job would always eventually get sent.
+We propose the following.
+For each app A:
+ * For each (host, app version) let X be the percentile
+   of turnaround time
+ * For each (host, app version) let Y be the percentile
+   of "consecutive valid results" (or +infinity if > 10)
+   over all active hosts and all current app versions.
+ * Let P(H, AV) = min(X, Y)
+This will be computed periodically (say, 24 hours)
+by a utility program.
+Notes:
+ * When a new app version is deployed,
+   the host_app_version records for the previous version should be copied,
+   on the assumption that hosts reliable for on version
+   will be reliable for the next.
+== Batch completion estimation ==
+The proposed policies require estimates C(B) of batch completion,
+I'm not sure exactly how to compute these, but
+ * it should be based on completed and validated jobs rather than
+   a priori FLOPs estimates.
+ * it should reflect (host, app version) information
+   (e.g. turnaround and elapsed time statistics)
+   for the hosts that have completed jobs,
+   and for the host population as a whole
+ * They should be computed by a daemon process,
+   triggered by the passage time and
+   by the validation of jobs in the batch.
+Notes:
+ * C(B) is different from the "logical end time" of the batch
+   used in batch scheduling.
+ * For long-deadline batches, C(B) should probably be at least
+   the original delay bound plus the greatest dispatch time of first
+   instance of a job.
+   I.e. if it takes a long time to dispatch the first instances,
+   adjust the deadline accordingly to avoid creating a deadline crunch.
+== Proposed feeder policy ==
+Proposed enumeration order:
+{{{
+(LET(J) asc, nretries desc)
+}}}
+where LET(J) is the logical end time of the job's batch.
+== Proposed scheduler policy ==
+For each processor type T (CPU and GPU) we have a "busy time" BT:
+the time already committed to high-priority jobs.
+For a given job J we can compute the estimated runtime R.
+The earliest we can expect to finish J is then BT + R,
+so that's the earliest deadline we can assign.
+Call this MD(J, T).
+For each app A and processor type T, compute the best app version
+BAV(A, T) at the start of handling each request.
+The rough policy then is:
+{{{
+for each job J in the array, belonging to batch B
+        for each usable app version AV of type T
+                if B is AFAP an there's no estimate yet
+                        if P(H, AV) > 50%
+                                send J using AV, with deadline BT + R
+                else
+                        x = MD(J, T)
+                        if x < C(B)
+                                send J using AV, with deadline C(B)
+                        else if P(H, AV) > 90%
+                                send J using AV, with deadline x
+}}}
+Make an initial pass through the array
+sending only jobs that have a percentile requirement.
+Notes
+ * The 50% and 90% can be parameterized
+ * Retries are not handled differently at this level,
+   although we could add a restriction like sending
+   them only to top 50% hosts
+ * In the startup case (e.g. new app) no hosts will be high percentile.
+   How to avoid starvation?
+ * I think that score-based scheduling is now deprecated.
+   The feasibility and/or desirability of a job may depend
+   on what other jobs we're sending,
+   so it doesn't make sense to assign it a score in isolation.
+   It's simpler to scan jobs and make a final decision for each one.
+   There are a few properties we need to give priority to:
+   * Limited locality scheduling
+   * beta jobs
+   We can handle these in separate passes, as we're doing now.