wiki:GpuWorkFetch

Version 27 (modified by davea, 15 years ago) (diff)

--

Work fetch and GPUs

This document describes changes to BOINC's work fetch mechanism, in the 6.7 client and the scheduler as of [17024].

Problems with the old work fetch policy

The old work-fetch policy is essentially:

  • Do a weighted round-robin simulation, computing the CPU shortfall (i.e., the idle CPU time we expect during the work-buffering period).
  • If there's a CPU shortfall, request work from the project with highest long-term debt (LTD).

The scheduler request has a single "work_req_seconds" indicating the total duration of jobs being requested.

This policy has some problems:

  • There's no way for the client to say "I have N idle CPUs; send me enough jobs to use them all".

And various problems related to GPUs:

  • If there is no CPU shortfall, no work will be fetched even if GPUs are idle.
  • If a GPU is idle, we should get work from a project that potentially has jobs for it.
  • If a project has both CPU and GPU jobs, the client should be able to tell it to send only GPU (or only CPU) jobs.
  • LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningful comparison between projects that use only GPUs, or between a GPU and CPU projects.

Examples

In following, A and B are projects.

Example 1

Suppose that:

  • A has only GPU jobs and B has both GPU and CPU jobs.
  • The host is attached to A and B with equal resource shares.
  • The host's GPU is twice as fast as its CPU.

The target behavior is:

  • the CPU is used 100% by B
  • the GPU is used 75% by A and 25% by B

This provides equal total processing to A and B.

Example 2

A has a 1-year CPU job with no slack, so it runs in high-priority mode. B has jobs available.

Goal: after A's job finishes, B gets the CPU for a year.

Variation: a new project C is attached when A's job finishes. It should immediately share the CPU with B.

Example 3

A has GPU jobs but B doesn't. After a year, B gets a GPU app.

Goal: A and B immediately share the GPU.

Resource types

New abstraction: processing resource type just "resource type". Examples of resource types:

  • CPU
  • A coprocessor type (a kind of GPU, or the SPE processors in a Cell)

A job sent to a client is associated with an app version, which uses some number (possibly fractional) of CPUs, and some number of instances of a particular coprocessor type.

Scheduler request and reply message

New fields in the scheduler request message:

double cpu_req_secs
number of CPU seconds requested
double cpu_req_instances
send enough jobs to occupy this many CPUs

And for each coprocessor type:

double req_secs
number of instance-seconds requested
double req_instances
send enough jobs to occupy this many instances

The semantics: a scheduler should send jobs for a resource type only if the request for that type is nonzero.

For compatibility with old servers, the message still has work_req_seconds, which is the max of the req_seconds.

Per-resource-type backoff

We need to handle the situation where e.g. there's a GPU shortfall but no projects are supplying GPU work (for either permanent or transient reasons). We don't want an overall work-fetch backoff from those projects.

Instead, we maintain a separate backoff timer per (project, resource type). The backoff interval is doubled up to a limit whenever we ask for work of that type and don't get any work; it's cleared whenever we get a job of that type.

There is still an overall backoff timer for each project. This is triggered by:

  • requests from the project
  • RPC failures
  • job errors

and so on.

Note: if we decide to ask a project for work for resource A, we may ask it for resource B as well, even if it's backed off for B.

Long-term debt

We continue to use the idea of long-term debt (LTD), representing how much work (measured in device instance-seconds) is "owed" to each project P. This increases over time in proportion to P's resource share, and decreases as P uses resources. Simplified summary of the new policy: when we need work for a resource R, we ask the project that is not backed off for R and whose LTD is greatest.

The notion of LTD needs to span resources; otherwise, in the above example, projects A and B would each get 50% of the GPU.

On the other hand, if there's a single cross-resource LTD, and only one project has GPU jobs, then its LTD would go unboundedly negative, and the others would go unboundedly positive. This is undesirable. It could be fixed by limiting the LTD to a finite range, but this would lose information.

In the new model:

  • There is a separate LTD for each resource type
  • The "overall LTD", used in the work-fetch decision, is the sum of the resource LTDs, weighted by the speed of the resource (FLOPs per instance-second).

Per-resource LTD is maintained as follows:

A project is "debt eligible" for a resource R if:

  • P is not backed off for R, and the backoff interval is not at the max.
  • P is not suspended via GUI, and "no more tasks" is not set

Debt is adjusted as follows:

  • For each debt-eligible project P, the debt is increased by the amount it's owed (delta T times its resource share relative to other debt-eligible projects) minus the amount it got (the number of instance-seconds).
  • An offset is added to debt-eligible projects so that the net change is zero. This prevents debt-eligible projects from drifting away from other projects.
  • An offset is added so that the maximum debt across all projects is zero (this ensures that when a new project is attached, it starts out debt-free).

Client data structures

RSC_WORK_FETCH

Work-fetch state for a particular resource types. Data members:

ninstances
number of instances of this resource type

Used/set by rr_simulation()):

double shortfall
shortfall for this resource
double nidle
number of currently idle instances

Member functions:

rr_init()
called at the start of RR simulation. Compute project shares for this PRSC, and clear overall and per-project shortfalls.
set_nidle()
called by RR sim after initial job assignment.

Set nidle to # of idle instances.

accumulate_shortfall()
called by RR sim for each time interval during work buf period.

shortfall += dt*(ninstances - instances in use)
for each project p not backed off for this PRSC
    p->PRSC_PROJECT_DATA.accumulate_shortfall(dt)
select_project()
select the best project to request this type of work from. It's the project not backed off for this PRSC, and for which LTD + p->shortfall is largest, also taking into consideration overworked projects etc.
accumulate_debt(dt)

for each project p:

x = insts of this device used by P's running jobs
y = P's share of this device
update P's LTD

RSC_PROJECT_WORK_FETCH

State for a (resource type, project pair). It has the following "persistent" members (i.e., saved in state file):

backoff_interval
how long to wait until ask project for work specifically for this PRSC;

double this any time we ask for work for this rsc and get none (maximum 24 hours). Clear it when we ask for work for this PRSC and get some job.

backoff_time
back off until this time debt: long term debt

And the following transient members (used by rr_simulation()):

double runnable_share
# of instances this project should get based on resource share

relative to the set of projects not backed off for this PRSC.

instances_used
# of instances currently being used

PROJECT_WORK_FETCH

Per-project work fetch state. Members:

overall_debt
weighted sum of per-resource debts

WORK_FETCH

Overall work-fetch state.

Pseudo-code

The top-level function is:

WORK_FETCH::choose_project()
rr_simulation()

if cuda_work_fetch.nidle
   cpu_work_fetch.shortfall = 0
   p = cuda_work_fetch.select_project()
   if p
      send_req(p)
      return
if cpu_work_fetch.nidle
   cuda_work_fetch.shortfall = 0
   p = cpu_work_fetch.select_project()
   if p
      send_req(p)
      return
if cuda_work_fetch.shortfall
   p = cuda_work_fetch.select_project()
   if p
      send_req(p)
      return
if cpu_work_fetch.shortfall
   p = cpu_work_fetch.select_project()
   if p
      send_req(p)
      return

void send_req(p)
   req.cpu_req_seconds = cpu_work_fetch.shortfall
   req.cpu_req_ninstances = cpu_work_fetch.nidle
   req.cuda_req_seconds = cuda_work_fetch.shortfall
   req.cuda_req_ninstances = cuda_work_fetch.nidle
   req.work_req_seconds = max(req.cpu_req_seconds, req.cuda_req_seconds)

for each resource type R
   for each project P
      if P is not backed off for R
         P.R.LTD += share
   for each running job J, project P
      for each resource R used by J
         P.R.LTD -= share*dt

RR simulation

cpu_work_fetch.rr_init()
cuda_work_fetch.rr_init()

compute initial assignment of jobs
cpu_work_fetch.set_nidle();
cuda_work_fetch.set_nidle();

do simulation as current
on completion of an interval dt
   cpu_work_fetch.accumulate_shortfall(dt)
   cuda_work_fetch.accumulate_shortfall(dt)

Work fetch

Handling scheduler reply

if no jobs returned
   double backoff for each requested PRSC
else
   clear backoff for the PRSC of each returned job

Scheduler changes

global vars
   have_cpu_app_versions
   have_cuda_app_versions
per-request vars
   bool coproc_request
   ncpu_jobs_sending
   ncuda_jobs_sending
   ncpu_seconds_to_fill
   ncuda_seconds_to_fill
   seconds_to_fill
      (backwards compat; used if !coproc_request)
overall startup
   scan app versions, set have_x vars
req startup
   if send_only_cpu and no CPU app versions, don't send work
   if send_only_cuda and no CUDA app versions, don't send work
work_needed()
   need_more_cpu_jobs =
      n_cpu_jobs_sending < ninstances_cpu
      or cpu_seconds_to_fill > 0
   same for cuda
   return false if don't need more CPU or more CUDA
get_app_version
   if send_only_cpu, ignore CUDA versions
   if send_only_cuda, ignore CPU versions
when commit a job
   update n*_jobs_sending,
      n*_seconds_to_fill,
      seconds_to_fill

Notes

The idea of using RAC as a surrogate for LTD was discussed and set aside for various reasons.

This design does not accommodate:

  • jobs that use more than one coprocessor type
  • jobs that change their resource usage dynamically (e.g. coprocessor jobs that decide to use the CPU instead).