wiki:GpuSched

Version 1 (modified by davea, 15 years ago) (diff)

--

Client CPU/GPU scheduling

Prior to version 6.3, the BOINC client assumed that a running application uses 1 CPU. Starting with version 6.3, this is generalized.

  • Apps may use coprocessors (such as GPUs)
  • The number of CPUs used by an app may be more or less than one, and it need not be an integer.

For example, an app might use 2 CUDA GPUs and 0.5 CPUs. This information is visible in the BOINC Manager.

The client's scheduler (i.e., the decision of which apps to run) has been modified to accommodate this diversity of apps.

The way things used to work

The old scheduling policy:

  • Make a list of runnable jobs, ordered by "importance" (as determined by whether the job is in danger of missing its deadline, and the long-term debt of its project).
  • Run jobs in order of decreasing importance. Skip those that would exceed RAM limits. Keep going until we're running NCPUS jobs.

There's a bit more to it than that - e.g., we avoid preempting jobs that haven't checkpoint - but that's the basic idea.

How things work in 6.3

Suppose we're on a machine with 1 CPU and 1 GPU, and that we have the following runnable jobs (in order of decreasing importance):

1) 1 CPU, 0 GPU
2) 1 CPU, 0 GPU
3) .5 CPU, 1 GPU

What should we run? If we use the old policy we'll just run 1), and the GPU will be idle. This is bad - the GPU typically is 50X faster than the CPU, and it seems like we should use it if at all possible.

This leads to the following policy:

Unresolved issues

Apps that use GPUs use the CPU as well. The CPU part typically is a polling loop: it starts a "kernel" on the GPU, waits for it to finish (checking once per .01 sec, say) then starts another kernel.

If there's a delay between when the kernel finishes and when the CPU starts another one, the GPU sits idle and the entire program runs slowly.