wiki:BossaOverview

Bossa overview

Bossa is a software framework for "distributed thinking" - the use of volunteers on the Internet to perform tasks that require human intelligence, knowledge, or cognitive skills. Examples of such projects include Stardust@home and GalaxyZoo. Bossa simplifies the task of creating projects like these. It serves roughly the same function as Amazon's Mechanical Turk (see also this), but is simpler and more powerful, does not involve payment, and is open source.

Volunteers have different skill levels; they may do tasks well or poorly, and a few of them may intentionally do tasks incorrectly. One can achieve an overall level of accuracy that is higher than the population average using replication - having multiple volunteers do each task, and comparing the results. Replication is also useful for tasks that do not have a unique correct answer, and for which you want to collect alternatives.

Distributed thinking projects have widely varying properties and hence different requirements. Bossa doesn't directly provide all needed features; rather, it makes it easy for you to implement what you need. The division of labor is as follows:

  • Bossa provides mechanisms. It provides the database (MySQL) support.
  • You (the project) define the policies by writing PHP functions; you need to have basic knowledge of PHP to use Bossa.

The software structure of Bossa is summarized here:

Bossa provides two web pages: bossa_get_job.php to get a new job, and bossa_job_finished.php to handle a finished job. These call various application-supplied callback functions, which in turn call Bossa API functions.

We now discuss the various policies you can control, and the corresponding Bossa mechanisms.

Job and result representation

The data needed to display a job may include various filenames, numbers, etc. This info is stored in an opaque data structure, an arbitrary PHP structure that you supply when you create the job. Bossa stores the opaque data in a serialized (textual) from in the database. When a job is to be displayed to a volunteer, Bossa calls your one of your callback functions (job_show()), passing it the opaque data.

Similarly, the results of a completed job are stored in an opaque data structure defined by your application.

Job distribution policy

Applications may require different job distribution policies, i.e. different orders in which jobs are distributed. Consider the two cases:

  • Project A has a limited set of jobs and lots of volunteers.
  • Project B has an unbounded stream of jobs and a limited set of volunteers.

In project A the goal is to perform all jobs about the same number of times. Hence its best distribution policy is to issue all jobs once, then issue them all a second time, and so on. The longer the project runs, the more accurate the results become.

In constrast, project B has a predetermined "accuracy threshold". Its best distribution policy is to issue the first job to a set of volunteers sufficient to meet the accuracy threshold, then issue the second job to another set, and so on. The longer the project runs, the more jobs are finished.

Bossa's mechanism is that each job has a floating-point priority, and jobs are issued in order of decreasing priority. The manipulation of these priorities is up to you. Policies are encoded in PHP project policy functions that are invoked at various points (e.g., when jobs are issued, when they complete, and when they time out).

Volunteer assessment

Projects may want to assess the ability of each volunteer, and use this to determine how many replicas of each job to perform. Several factors might contribute to the ability estimate:

  • The volunteer's performance on a training course.
  • The volunteer's performance on a stream of calibration jobs (with known answers) intermixed with the job stream.
  • The extent to which the volunteer's response agrees with the "correct" response, as determined by replication.

In addition, the way in which ability is described may vary.

  • In simple cases it might be a single number, e.g. an error rate.
  • For tasks that involve feature detection we might want to track the rates of false positives and false negatives separately.
  • For more complex tasks, ability could have arbitrarily many dimensions.

Bossa lets you associate an opaque data structure with each volunteer. This can be initialized and updated however you want.

Bossa provides a calibration job mechanism; jobs can be designated as calibration jobs, and Bossa can be instructed to randomly mix a given rate of calibration jobs into the job stream.

Replication policy

An application's replication policy decides when additional instances of jobs are needed. Examples:

  • Get at least N finished instances of each job.
  • Based on the responses and skill levels of the volunteers who have completed instances, decide whether the error probability is below a given threshold.

An application's replication policy is embodied in its job_finished(), a callback function that is called when an instance is finished. Using the Bossa API, this function can get the set of other instances and their associated users, and can decide whether new instances are needed, and if so, the priority.

Use of experts

There are various ways in which "experts" might be used. Two general possibilities:

Experts do the same job, only better
For example, experts might be used to resolve cases in which no concensus is reached by non-experts. Or they might be used to verify rare features found by non-experts.
Experts do more sophisticated jobs
For example, non-experts might look for features, while experts classify them.

Bossa provides the following mechanism: volunteers can be assigned integer levels. Each job has a vector of priorities, one per level. Jobs are sent to a given volunteer in order of decreasing level-specific priority.

For example, suppose you want to use experts to resolve jobs for which N non-experts have failed to reach a consensus. Use level 0 for non-experts, 1 for experts. In your job_finished() callback function, check for the case of N instances without consensus, and set the priority for level 1 to 2.

In this scenario, experts will always get unresolved jobs if they are available; otherwise, they will get regular jobs.

Last modified 9 years ago Last modified on 07/30/08 10:32:32

Attachments (1)

Download all attachments as: .zip