Research projects involving BOINC
- Improved scheduling policies
- Symbiotic job scheduling
- Data-intensive volunteer computing
- Virtualizing volunteer computing
- Analyze and improve adaptive replication
- Globalize resource share
- Generalize the BOINC credit system
- Predictable and accelerated batch/DAG completion
- Intra-project resource allocation
- Volunteer data archival
- Invisible GPU computing
- Homogeneous redundancy and GPUs
- Interfacing BOINC to Grid and Cloud systems
- Tracking and modeling user behavior and host availability
Research projects involving BOINC
Possible research projects involving BOINC and volunteer computing, appropriate for senior-level class projects or Masters theses. If you're interested, please contact David Anderson.
An out-of-date list of conferences related to volunteer computing is here.
The following software tools may be useful:
- The BOINC client emulator models a single client attached to one or more projects. It's useful for studying client scheduling policies.
- The EmBOINC BOINC server emulator models a BOINC server and a population of clients. It's useful for studying server scheduling policies.
- The scheduler driver can be used to measure the maximum throughput (jobs per second) of a BOINC back end.
Improved scheduling policies
Dozens of changes to BOINC's client and server scheduling policies have been proposed (see, for example, the archives of the boinc_alpha email list). Use the BOINC client emulator and/or EmBOINC to study some of these. Examples:
- Use variance (rather than just mean) throughout the scheduling system.
- Track and incorporate upload/download times in client scheduling.
- Change the scheduling model so that the scheduler can send a client short, low-latency jobs even if its buffer is full with long high-latency jobs.
Symbiotic job scheduling
It has been observed (see UCSD work) that different combinations of applications run more efficiently than others on multiprocessors, because of their respective memory access patterns. Instrument the BOINC client to measure this effect (i.e., to maintain logs of the rate of progress of applications running in different combinations). Estimate the overall throughput increase that could be gained by doing "symbiosis-aware" job scheduling. If it's worth doing, implement it.
Data-intensive volunteer computing
Currently, most BOINC projects work as follows:
- Data is stored on the server
- Input files are sent to client, and jobs are run against them. When done, the files are deleted from the client.
- Output files are sent back to the server.
This architecture doesn't scale well for data-intensive computing. There are various alternatives:
- Workflows: DAGs of tasks connected by intermediate temporary files. Schedule them so that temp files remain local to client most of the time.
- Stream computing: e.g., IBM Infosphere
- Models that involve computing against a large static dataset: e.g. MapReduce, or Amazon's scheme in which they host common scientific datasets, and you can use EC2 to compute against them.
BOINC has some features that may be useful in these scenarios: e.g., locality scheduling and sticky files. It lacks some features that may be needed: e.g., awareness of client proximity, or the ability to transfer files directly between clients.
Virtualizing volunteer computing
The volunteer computing host population is highly heterogeneous in terms of software environment (operating system type and version, system libraries, installed packages). Projects are faced with the difficult task of building application versions for all these different environments; this is a significant barrier to the usage of volunteer computing.
This problem can be mitigated using virtual machine technology. In this approach, a hypervisor such as VMWare or VirtualBox is installed (manually or automatically) on volunteer hosts. An application consists of a virtual machine image containing the application proper together with the required libraries and packages. A "wrapper" program provides an interface between the BOINC client and the hypervisor, so that, for example, the application can be suspended and resumed in accordance with user preferences.
Some of this has already been done; see VmApps.
A related goal is to minimize VM image sizes; people at CERN are working on this.
A higher level goal is to implement a "volunteer cloud" model, and to develop tools to facilitate its use by computational scientists. A collaboration between CERN and INRIA exists in this area.
Analyze and improve adaptive replication
Because volunteer hosts may be error-prone or malicious, volunteer computing requires result validation. One way to do this is by replication: run each job on 2 computers and make sure the results agree.
To reduce the 50% overhead of two-fold replication, BOINC has a mechanism called "adaptive replication" that runs jobs with no replication on hosts with low error rates, while continuing to randomly intersperse replicated jobs.
The project is to identify possible counter-strategies for adaptive replication, to establish bounds on the overall effectiveness of adaptive replication, and to identify refinements that increase the effectiveness.
A related project is to prove the effectiveness (or ineffectiveness) of BOINC's mechanism to defeat 'cherry picking': completing only short jobs in an effort to get credit unfairly.
Globalize resource share
Currently resource share is enforced on a per-host basis; each host's processing resources are allocated in a way that matches resource share as closely as possible. This can be improved in 2 ways:
First, enforce resource share on a per-volunteer basis. Suppose a volunteer is attached to projects A and B, and has two hosts, H and J. Suppose A runs efficiently on H (e.g. because H has a GPU supported by A) and B runs efficiently on B. Then we can increase overall throughput by using H entirely for A, and J entirely for B.
Second, enforce resource share on a global (cross-volunteer) basis. Suppose volunteer V supports project A but has hosts that run project B efficiently, and that V trusts B. Suppose volunteer W is in the symmetric situation. Then we can increase overall throughput by using V's hosts for B, and W's hosts for A. (This idea is due to Arnaud Legrand from INRIA; please contact him).
In both cases, a study should be one before implementing anything.
Generalize the BOINC credit system
The idea of "credit" - a numerical measure of work done - is essential to volunteer computing, as it provides volunteers an incentive and a basis for competition. Currently, BOINC's credit mechanism is based on the number of floating-point operations performed.
The project is to design a new credit system where a) credit can be given for resources other than computing (e.g., storage, network bandwidth); b) the credit given per FLOP can depend on factors such as RAM size and job turnaround time. Ideally the system allows a game-theoretic proof that it leads to an optimal allocation of resources.
Predictable and accelerated batch/DAG completion
Support the notion of batch (or more generally DAG) of jobs. Allow users (i.e. scientists using a project) to get a-priori estimates of batch completion. Develop scheduling policies that optimize batch completion (i.e. makespan). This may require major changes to BOINC's scheduling mechanisms, which were originally designed to maximize the number of jobs completed per day, not minimize the turnaround time of individual jobs.
Intra-project resource allocation
Currently BOINC has only crude mechanisms for allocation resources within a project (e.g., a project that serves multiple scientists). Design a quota system. See PortalFeatures.
Volunteer data archival
While BOINC is currently used for computation, it also provides primitives for distributed data storage: file transfers, queries, and deletion. The project is to develop a system that uses these primitives to implement a distributed data archival system that uses replication to achieve target levels of reliability and availability.
Invisible GPU computing
BOINC has recently added support for GPU computing, and several projects now offer applications for NVIDIA and ATI GPUs. One problem with this is that GPU usage is not prioritized, so when a science application is running the performance of user-visible applications is noticeable degraded. As a result, BOINC's default behavior is that science applications are not run while the computer is in use (i.e., while there has been recent mouse or keyboard activity).
The project (in collaboration with NVIDIA and possibly AMD/ATI) is to make changes to BOINC and to the GPU drivers so that the GPU can be used as much as possible, even while the computer is in use, without impacting the performance of user-visible applications.
Somehow detect when a non-BOINC app wants the GPU, and stop BOINC apps?
Possibly limit the number of GPU processors used by BOINC apps - is this possible?
Homogeneous redundancy and GPUs
As with CPUs, different GPU models, vendors, and language systems can potentially compute different results for a given job. Generalize the homogeneous redundancy mechanism to handle GPU and multithread applications.
Design a library that can be linked to an existing application, providing a lightweight BOINC client. When the application is run, the library contacts projects and gets "microjobs" (small or variable-size jobs). It computes while the app is open. When the app exits, the library reports whatever computation has been done.
Interfacing BOINC to Grid and Cloud systems
At a high level, BOINC can be seen as a way to create and access pools of computing nodes (albeit untrusted, sporadically available, etc.). Research the issues in interfacing these resource pools to existing Grid- and Cloud-based systems for distributed computing.
Tracking and modeling user behavior and host availability
Monitor and maintain statistics of the user input process (mouse, keyboard activity). If the user has been idle for X seconds, with what probability will they be idle for another Y seconds? Use this info to make better decisions about when to run jobs.
Similarly, monitor and maintain statistics about host availability (not just the available fraction) and use this to improve scheduling decisions, e.g. for jobs that don't checkpoint.