wiki:SchedMatch

Version 1 (modified by davea, 16 years ago) (diff)

--

Scheduler job matchmaking

Currently the scheduler's work-sending algorithm is:

  • if host is reliable, scan entire job array, looking for retries
  • if using HR, scan entire job array, looking for jobs committed to this HR class
  • if still need work, scan entire job array with no restrictions

This is bad for several reasons:

  • inefficient: repeated array scans
  • inflexible policy: why should reliable be more important than HR?
  • hard to add new policies, like sending "hard" jobs to fasts hosts

To solve these problems we're going to change this part of the scheduler. Basic idea: given a job J and a host H, there is a function V(J, H) that represents the value of sending J to H. V(J, H) might reflect various factors:

  • the computational "hardness" of J
    • CPU/deadline ratio
    • RAM or disk requirements
  • H already has files required by J(or jobs already planned for sending have such files)
  • J is a retry and H is a fast/reliable host
  • J has already been committed to H's HR class

BOINC includes a default value function. Projects can tweak its weights, or define their own value function.

Details

functions:

fast_filter()
feasibility checks that can be done with no DB access
slow_filter()
feasibility checks that need DB access

When we scan a slot, the possibilities are:

  • slot is empty
  • slot is locked by another scheduler
  • job fails fast filter
  • job fails slow filter
  • none of the above

Parameters:

N
scan at least this many slots (if scan N slots and have enough jobs to send, stop)
M
scan at most this many slots (even if don't have enough jobs to send yet)
L
if scan this many locked slots, print warning msg (should increase shmem size)

logic:

acquire semaphore
i = random index in shmem
x = ordered list of jobs to send (empty)
slots_scanned = 0
slots_locked = 0
loop
	i = i+1 % array_size
	slots_scanned++
	if slots_scanned > M
		break
	if shmem[i] is empty
		continue
	if shmem[i] is locked
		slots_locked++
		continue
	j = shmem[i]
	if !fast_filter(j) continue;
	v = v(h, j)
	if v > lowest score in x
		lock j
		release semaphore
		if !slow_filter(j)
			acquire semaphore
			unlock j
			continue
		acquire semaphore
		if x satisfies client's work request
			unlock and remove lowest-score element of x
		add j to x
	if x satisfies client's work request and slots_scanned >= N
		break;
for each job j in x
	mark slot j as empty
release semaphore
if slots_locked > L
	print "need bigger array" message