Changes between Version 7 and Version 8 of JobSizeMatching


Ignore:
Timestamp:
Apr 19, 2013, 1:15:48 PM (7 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • JobSizeMatching

    v7 v8  
    77Having a single job size can therefore present problems:
    88
    9  * If the size is too small, hosts with GPUs get huge numbers of jobs.
     9 * If the size is small, hosts with GPUs get huge numbers of jobs.
    1010   This causes performance problems on the client
    1111   and a high DB load on the server.
    12  * If the size is too large, slow hosts can't get jobs,
     12 * If the size is large, slow hosts can't get jobs,
    1313   or they get jobs that take weeks to finish.
    1414
     
    1818
    1919We'll assume that jobs for a given application can be generated
    20 in several discrete '''size classes'''
    21 (the number of size classes is a parameter of the application).
     20in several discrete '''size classes''';
     21the number of size classes is a parameter of the application.
    2222
    2323BOINC will try to send jobs of size class i
    2424to devices whose effective speed is in the ith quantile,
    25 where 'effective speed' is the product of the
    26 device speed and the host's on-fraction.
     25where 'effective speed' is the product of the device speed and the host's on-fraction.
    2726
    2827This involves 3 new integer DB fields:
     
    3231
    3332The size class of a job is specified in the call to create_work().
     33
     34Apps with n_size_classes > 1 are called '''multi-size apps'''.
     35A project can have both multi-size and non-multi-size apps.
    3436
    3537Notes:
     
    4749The order statistics of device effective speed will be computed
    4850by a new program '''size_census'''.
    49 For each app with n_size_classes>1 this does:
     51For each multi-size app this does:
    5052
    5153 * enumerate host_app_versions for that app
     
    5961== Scheduler changes ==
    6062
    61 When the scheduler sends jobs of a given app to a given processor,
     63When the scheduler sends jobs of a given multi-size app to a given processor,
    6264it should preferentially send jobs whose size class matches
    6365the quantile of the processor.
     
    7981 * For each job, compute a "score" that includes various factors.
    8082   (reliable, beta, previously infeasibly, locality scheduling lite).
    81  * Include a factor for job size;
     83 * For multi-size apps, include a factor for job size;
    8284   decrement the score of jobs that are too small,
    8385   and decrement more for jobs that are too large.
     
    9698  and the resource load maintaining a job array of that size.
    9799 * All other factors being equal, the scheduler will send jobs of other apps
    98   rather than send a wrong-size job.
    99   This could potentially lead to starvation issues; we'll have to see.
     100  rather than send a job of non-optimal size class.
     101  This could potentially lead to starvation issues; we'll have to see if this is a problem.
    100102
    101103== Regulating the flow of jobs into shared memory ==
     
    114116
    115117Instead, we'll do the following:
    116  * when jobs are created (in the transitioner) set their state to
    117   INACTIVE rather than UNSENT.
    118   This is done if app.n_size_classes > 1
     118 * when jobs are created for a multi-size app (in the transitioner),
     119  set their state to INACTIVE rather than UNSENT.
    119120 * have a new daemon ('''size_regulator''') that polls for the number of unsent
    120121  jobs of each type, and changes a few jobs from INACTIVE to UNSENT