Changes between Version 18 and Version 19 of GpuWorkFetch


Ignore:
Timestamp:
Dec 29, 2008, 3:53:31 PM (16 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GpuWorkFetch

    v18 v19  
    1717
    1818 * If there is no CPU shortfall, no work will be fetched even if GPUs are idle.
    19 
    2019 * If a GPU is idle, we should get work from a project that potentially has jobs for it.
    21 
    2220 * If a project has both CPU and GPU jobs, the client should be able to tell it to send only GPU (or only CPU) jobs.
    23 
    2421 * LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningful comparison between projects that use only GPUs, or between a GPU and CPU projects.
    2522
     
    2926However, it is straightforward to extend the design to handle additional GPU types.
    3027
     28== Example ==
     29
     30Suppose that:
     31 * Project A has only GPU jobs and project B has both GPU and CPU jobs.
     32 * A host is attached to projects A and B with equal resource shares.
     33 * The host's GPU is twice as fast as its CPU.
     34
     35In this case, the target behavior is for the host to use
     36100% of the CPU for project B,
     3725% of the GPU for project B,
     38and 75% of the GPU for project A.
     39This provides equal processing to the two projects.
     40
    3141== Terminology ==
    3242
     
    4959'''double ninstances_cuda''': send enough jobs to occupy this many CUDA devs
    5060
    51 For compatibility with old servers, the message still has '''work_req_seconds''';
    52 this is the max of (cpu,cuda)_req_seconds.
     61For compatibility with old servers, the message still has '''work_req_seconds''',
     62which is the max of (cpu,cuda)_req_seconds.
     63
     64The semantics are: a scheduler should send jobs for a resource type
     65only if the request for that type is nonzero.
    5366
    5467New fields in the scheduler reply message (these are not currently used):
     
    6376There are two processing resource types: CPU and CUDA.
    6477
    65 The notion of long-term debt
     78=== Long-term debt ===
     79
     80We'll continue to use the idea of '''long-term debt''' (LTD).
     81LTD represents how much work is "owed" to each project.
     82This increases over time in proportion to its resource share,
     83and decreases as it uses resources.
     84Simplified summary: when we need work for a resource,
     85we ask the project that may have that type of job and whose LTD is greatest.
     86
     87The idea of using RAC as a surrogate for LTD was set aside for various reasons.
     88
     89The notion of LTD needs to span resources;
     90otherwise, in the above example, projects A and B would each get 50% of the GPU.
     91
     92On the other hand, if there's a single cross-resource LTD,
     93and only one project has GPU jobs,
     94then its LTD would go unboundedly negative,
     95and the others would go unboundedly positive.
     96This is undesirable.
     97It could be fixed by limiting the LTD to a finite range,
     98but this would lose information.
     99
     100So the current plan is:
     101
     102 * There is a separate LTD for each resource
     103 * The "overall LTD", which is used in the work-fetch decision, is the sum of the resource LTDs, weighted by the speed of the resource (FLOPs per instance-second).
     104
     105
    66106
    67107=== Per-resource-type backoff ===