Context Navigation

Changes between Version 18 and Version 19 of GpuWorkFetch

Timestamp:: Dec 29, 2008, 3:53:31 PM (15 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

GpuWorkFetch

-                      v18
+                      v19
  * If there is no CPU shortfall, no work will be fetched even if GPUs are idle.
  * If a GPU is idle, we should get work from a project that potentially has jobs for it.
  * If a project has both CPU and GPU jobs, the client should be able to tell it to send only GPU (or only CPU) jobs.
  * LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningful comparison between projects that use only GPUs, or between a GPU and CPU projects.
 …
 However, it is straightforward to extend the design to handle additional GPU types.
+== Example ==
+Suppose that:
+ * Project A has only GPU jobs and project B has both GPU and CPU jobs.
+ * A host is attached to projects A and B with equal resource shares.
+ * The host's GPU is twice as fast as its CPU.
+In this case, the target behavior is for the host to use
+% of the CPU for project B,
+% of the GPU for project B,
+and 75% of the GPU for project A.
+This provides equal processing to the two projects.
 == Terminology ==
 …
 '''double ninstances_cuda''': send enough jobs to occupy this many CUDA devs
+For compatibility with old servers, the message still has '''work_req_seconds''';
+this is the max of (cpu,cuda)_req_seconds.
+For compatibility with old servers, the message still has '''work_req_seconds''',
+which is the max of (cpu,cuda)_req_seconds.
+The semantics are: a scheduler should send jobs for a resource type
+only if the request for that type is nonzero.
 New fields in the scheduler reply message (these are not currently used):
 …
 There are two processing resource types: CPU and CUDA.
+The notion of long-term debt
+=== Long-term debt ===
+We'll continue to use the idea of '''long-term debt''' (LTD).
+LTD represents how much work is "owed" to each project.
+This increases over time in proportion to its resource share,
+and decreases as it uses resources.
+Simplified summary: when we need work for a resource,
+we ask the project that may have that type of job and whose LTD is greatest.
+The idea of using RAC as a surrogate for LTD was set aside for various reasons.
+The notion of LTD needs to span resources;
+otherwise, in the above example, projects A and B would each get 50% of the GPU.
+On the other hand, if there's a single cross-resource LTD,
+and only one project has GPU jobs,
+then its LTD would go unboundedly negative,
+and the others would go unboundedly positive.
+This is undesirable.
+It could be fixed by limiting the LTD to a finite range,
+but this would lose information.
+So the current plan is:
+ * There is a separate LTD for each resource
+ * The "overall LTD", which is used in the work-fetch decision, is the sum of the resource LTDs, weighted by the speed of the resource (FLOPs per instance-second).
 === Per-resource-type backoff ===