Context Navigation

Changes between Version 1 and Version 2 of GpuSched

Timestamp:: Oct 13, 2008, 10:33:23 AM (16 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

GpuSched

-                      v1
+                      v2
 = Client CPU/GPU scheduling =
+Prior to version 6.3, the BOINC client assumed that a running application
+uses 1 CPU.
+Prior to version 6.3, the BOINC client assumed that each running application uses 1 CPU.
 Starting with version 6.3, this is generalized.
  * Apps may use coprocessors (such as GPUs)
+ * Apps may use coprocessors (such as GPUs).
  * The number of CPUs used by an app may be more or less than one, and it need not be an integer.
 …
 == The way things used to work ==
 The old scheduling policy:
+The old scheduling policy is:
  * Make a list of runnable jobs, ordered by "importance" (as determined by whether the job is in danger of missing its deadline, and the long-term debt of its project).
+ * Order runnable jobs by "importance" (determined by whether the job is in danger of missing its deadline, and the long-term debt of its project).
  * Run jobs in order of decreasing importance.  Skip those that would exceed RAM limits.  Keep going until we're running NCPUS jobs.
 There's a bit more to it than that - e.g., we avoid preempting jobs that haven't checkpoint -
+There's a bit more to it than that - e.g., we avoid preempting jobs that haven't checkpointed recently -
 but that's the basic idea.
 == How things work in 6.3 ==
+Suppose we're on a machine with 1 CPU and 1 GPU,
+The main design goal of the new scheduler is to use all resources.
+In particular, we try to always use the GPU even if that means
+overcommitting the CPU.
+"Overcommitting" means running a set of apps whose demand for for CPUs exceeds
+the actual number of CPUs.
+The new policy is:
+ * Scan the set of runnable jobs in decreasing order of importance.
+ * If a job uses a resource that's not already fully utilized, and fits in RAM, run it.
+Example: suppose we're on a machine with 1 CPU and 1 GPU,
 and that we have the following runnable jobs (in order of decreasing importance):
 {{{
 …
 and it seems like we should use it if at all possible.
 This leads to the following policy:
+The new policy will do the following:
+ * Run job 1.
+ * Skip job 2 because the CPU is already fully utilized.
+ * Run job 3 because the GPU is not fully utilized.
+So we end up running jobs whose CPU demand is 1.5.
+That's OK - they just run slower than if running alone.
 == Unresolved issues ==
 …
 the GPU sits idle and the entire program runs slowly.
+The CPU scheduler on Windows doesn't work well,
+and when the CPU is overcommitted the CPU part of GPU applications
+doesn't run as often as it needs to in order to keep the GPU "fed".
+As a result the GPU is underutilized and the program runs slowly.
+(This seems to happen even if the GPU app is run at high priority
+while other apps run at low priority).
+If we can't resolve this we'll have to change the scheduling policy
+to avoid overcommitting the CPU in the presence of GPU apps.