Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of JobSizeMatching

Timestamp:: Feb 13, 2013, 1:13:27 PM (11 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

JobSizeMatching

                       v1
+= Job size matching =
+The difference in throughput between a slow resource
+(e.g. an Android device that runs infrequently)
+and a fast resource (e.g. a GPU that's always on)
+can be a factor of 1,000 or more.
+Having a single job size can therefore present problems:
+ * If the size is too small, hosts with GPUs get huge numbers of jobs
+   (which causes various problems) and there is a high DB load on the server.
+ * If the size is too large, slow hosts can't get jobs,
+   or they get jobs that take weeks to finish.
+This document describes a set of mechanisms that address these issues.
+== Regulating the flow of jobs into shared memory ==
+Let's suppose that an app's work generator can produce several sizes of job -
+say, small, medium, and large.
+'''We won't address the issue of how to pick these sizes.'''
+How can we prevent shared memory from becoming "clogged" with jobs one size?
+One approach would be to allocate slots for each size.
+This would be complex because we already have two allocation schemes
+(for HR and all_apps).
+We could modify the work generator to so that it polls the number of unsent
+jobs of each size, and creates a few more jobs of a given size when this
+number falls below a threshold.
+Problem: this might not be able to handle a large spike in demand.
+We'd like to be able to have a large buffer of unsent jobs in the DB.
+Solution:
+ * when jobs are created (in the transitioner) set their state to
+  INACTIVE rather than UNSENT.
+  (a per-app flag would indicate this should be done).
+ * have a new daemon (called it the "regulator") that polls for number of unsent
+  jobs of each type, and changes a few jobs from INACTIVE to UNSENT.
+ * Add a "size_class" field to workunit and result to indicate S/M/L.
+== Scheduler changes ==
+We need to revamp the scheduler.
+Here's how things currently work:
+ * The scheduler makes up to 5 passes through the array:
+  * "need reliable" jobs
+  * beta jobs
+  * previously infeasible jobs
+  * locality scheduling lite (job uses file already on client)
+  * unrestricted
+ * We maintain a data structure that maps app to the "best" app version for that app.
+  * In the "need reliable" phase this includes only reliable app versions;
+    the map is cleared at the end of the phase.
+  * If we satisfy the request for a particular resource and the best app version
+    uses that resource, we clear the entry.