API paper

A Runtime System for Volunteer Computing

David P. Anderson
Carl Christensen
Bruce Allen

Abstract

Volunteer computing is a form of distributed computing in which processing and storage resources are volunteered, primarily by members of general public. A middleware system for volunteer computing includes server components (which distribute and collect tasks) and software that runs on the volunteered hosts. At the heart of the latter software is a runtime system that includes process management, checkpoint/restart support, and file access functions. This layer has several requirements: 1) it must handle applications with widely varying process structure, file configuration, and task duration; 2) it must support 'invisible computing', usage preferences, screensaver graphics, and other volunteer-oriented features; 3) it should work on many platforms.
BOINC is a middleware system for volunteer computing, and it is currently used by 20 projects doing a variety of computational research, with about 500,000 volunteers and 800,000 computers supplying 350 TeraFLOPS of processing power. The BOINC client software runs on Windows, Macintosh, OS/2, and most Unix platforms. This paper describes the problems in designing a runtime system for volunteer computing, and the ways these problems are solved in BOINC.

1. Introduction

1.1 Volunteer computing and BOINC

Volunteer computing is a form of distributed computing in which processing and storage resources are volunteered, primarily by members of general public. Early volunteer computing projects include the Great Internet Mersenne Prime Search [10], SETI@home [1], Distributed.net [6] and Folding@home [12].

BOINC (Berkeley Open Infrastructure for Network Computing) is a middleware system for volunteer computing [2]. BOINC is being used by a number of projects, including SETI@home, Climateprediction.net [5], LHC@home [13], Predictor@home [16], and Einstein@Home [7]. Volunteer computing is being used in high-energy physics, molecular biology, medicine, astrophysics, climate study, and other areas.

Volunteers participate by running BOINC client software on their computers ('volunteer hosts', or 'hosts'). They can 'attach' each computer to any set of projects, and can control the resource fraction devoted to each project (see Figure 1).

Figure 1: BOINC volunteers can attach their hosts to any set of projects, and can assign a resource share to each project.

BOINC software includes server-side components, such as scheduler and related daemon programs that manage task distribution and collection [REF], and web-based interfaces for volunteers and project administrators. In this paper we are concerned only with the part of the BOINC software that runs on volunteer hosts.

1.2 What volunteers want

Volunteer computing requires volunteers, and that requires attracting them and keeping them happy. How can middleware contribute to this?

First, it is critical that BOINC's activity be 'invisible': that is, volunteers should not perceive processing or network slowness, exhaustion of disk space, and so forth, as a result of BOINC's activity. Volunteer hosts are mostly desktop PCs. BOINC runs scientific applications except while the host is in use (and in some cases even then).

Second, volunteers should be able to limit when and how their computing resources are used - to specify, in effect, the tradeoff between invisibility and productivity. In BOINC, volunteers can specify general preferences, including:

The hours during which computation can be done.
The minimum interval between periods of disk activity (this is useful for laptops whose disks spin down to conserve power).
Whether applications should be left in memory while not executing.
How many CPUs (on a multiprocessor system) BOINC should use.

Different preferences can be associated with different hosts; for example, a volunteer can define different preferences for hosts at home and at work. Preferences are maintained on a project's database, and edited using a web-based interface. They are automatically propagated to

In addition, volunteers can specify project-specific preferences for each project. These can specify, for example, to control the graphics generated by that project's applications.

Third, should have the option of seeing graphics that show what the application is doing, either in a windows or as a screensaver.

Finally, volunteers must be given credit for the work done by their hosts. BOINC has a system in which work done (a combination of computation, storage and network usage) is represented by a numeric measure, or 'credit'. This is used to display web-based leaderboards, graphs in the client GUI, and other visual forms. BOINC uses various mechanisms to ensure that credit is granted fairly.

1.3 Structure of BOINC client software

The BOINC client software consists of the following components (see Figure 2):

Figure 2: the BOINC client software structure

Applications are project-supplied programs.
A core client program coordinates everything. It communicates with project scheduling servers to get and report work. It uploads and downloads files. It executes applications, and acts as a CPU scheduler for them. On a multiprocess with N CPUs, it attempts to keep N applications running at once. It does preemptive round-robin scheduling among applications, so that volunteers see a variety of activity.
The BOINC Manager provides a graphical user interface allowing volunteers to view and control computation status. For each task, it shows the fraction done and the estimated time to completion (see Figure X). It communicates with the core client through a set of remote procedure calls (GUI RPCs).
A BOINC screensaver (if selected by the volunteers) runs when the computer is active. It normally doesn't generate screensaver graphics itself, but rather communicates with the core client, asking that one of the running applications generate full-screen graphics.

1.4 The BOINC runtime system

BOINC provides an Application Programming Interface (API) that provides functions such as process management, checkpoint/restart support, and file access. These functions are implemented by a runtime system consisting of a library that is linked with applications, together with code in the BOINC core client that interacts with this library.

BOINC's application runtime system corresponds to analogous layers of systems for batch processing, distributed computing, and Grid computing. However, BOINC has several types of requirements that differ from those of other systems:

BOINC must give volunteers what they want: it must compute as invisibly as possible, respect the volunteers's preferences, show graphics if needed, and show information such as credit and fraction done.
BOINC must handle applications that differ along many axes. Some consist of a single process, while others are a pipeline of programs sequenced by a separate coordinator. Some are legacy FORTRAN programs, others are newly-written C++. Some involve tasks that take only a few CPU seconds, while others use several months per task. Some applications are not CPU-intensive, and must be scheduled accordingly.
BOINC seeks to support as many platforms as possible. It runs on operating systems (such as Windows, OS/2 and Unix) that have differing primitives for process creation, control and communication.
Volunteer hosts are unmanaged. BOINC must recover from all failure situations without user involvement.

This paper

2. Process and IPC structure

2.1 Shared-memory message-passing

The central design decision in the implementation of the BOINC API is: how should the core client interact with applications? Operating systems offer a variety of mechanisms for interprocess communication, process control, and synchronization. For example, POSIX-compliant systems such as Linux [REF] have signals, semaphores, and pipes. Windows has mutexes, messages, and various system calls for process and thread control.

We chose to base the BOINC runtime system on shared-memory message-passing. For each application it executes, the core client creates a shared memory segment shared by itself and the application. The shared memory contains a data structure representing a number of unidirectional message channels. Each channel consists of a fixed-size buffer and a flag indicating whether a message is present. The sender checks if the flag is clear. If so, it copies the message to the buffer and sets the flag.

If the sender finds the message buffer full, in some cases it is OK to discard the message. Otherwise, the sender can maintain a queue of unsent messages and sends them from a polling function.

The BOINC API implementation uses eight message channels, four in each direction. For example, one channel carries process control messages (telling the application to suspend, resume or quit) while another one conveys graphics-related messages (telling the application to create or destroy a graphics window).

Figure 3: shared-memory message passing

Compared with alternatives (such as using pipes, sockets, or signals) the use of shared-memory message passing has several advantages:

All operating systems supported by BOINC have a shared-memory facility.
Handling compound applications is fairly simple using shared memory. Each of the components of the application is typically interested in some subset of the message channels; for example, the graphics program would monitor the graphics channel, while the master program would monitor the process control channel.

2.2. Application thread structure

The thread structure of the BOINC runtime system is shown in Figure 3.

Figure 3: The thread structure of the BOINC runtime system

The process structure is slightly different between Unix and Windows.

In Windows, the worker thread executes the main application, and a graphics thread executes a GUI event loop and graphics code (see below). A third 'multimedia timer' thread is created to handle timers.
In Unix, a worker thread executes the main application, a graphics thread executes a GUI event loop. A SIGALRM signal handler executes in the worker thread; to do this, the SIGALRM signal is blocked in each of the other threads, forcing it to be handled in the worker thread. This is necessary because the Unix pthreads interface does not provide a mechanism that allows one thread to get the CPU time of another thread. Also, to suspend/resume the worker thread it is necessary to have an asynchronous signal handler. A separate thread executes a timer function once per second. This performs periodic actions such as sending messages to the core client (see below). This is done in a separate thread (rather than in the SIGALRM handler) because it uses some C library functions (such as printf) that cannot safely be called from an asynchronous signal handler.

2.3. Simple and compound applications

Simple applications are those that consist of a single program; scientific code, graphics code, and the BOINC runtime library reside and execute in a single address space. Compound applications consist of a master program that runs one or more worker programs. The master program, for example, might run pre-processing, main, and post-processing programs in sequence. It might launch one or more other processes (e.g. coupled climate models) that communicate via shared memory but are controlled by the master program. It might run a graphics program concurrently with a scientific program. The master program typically handles the suspend/quit/resume functionality (described below), as well as managing the reporting of CPU time and fraction done (see Section X).

Figure 4: Compound application

3. Process control and synchronization

3.1 Process control

The core client needs to suspend, resume, and quit applications for various reasons:

Preemptive scheduling (suspend or exit)
Computation suspended because of time of day or user active
User exits BOINC

These functions are performed by sending messages to the process control channel. In Unix this is monitored by the timer thread; it sets a 'suspended' flag which is checked by the SIGALRM handler. If set, it sleeps for a second. In Windows, it is monitored by the timer event, which controls the worker thread using calls such as SuspendThread().

There are various project-configurable options for handling process control. In the simplest case, the BOINC client can handle all of the suspend/resume/quit functionality for a project's application. But it is also possible for an application to receive and handle process control messages from BOINC. In this way, for example, the various sub-processes of a compound application can be shut down more "cleanly", i.e. a common shared-memory segment can be detached and closed first before quitting. In the event of a compound application not responding to a quit request in a suitable time period, the BOINC client can force a kill of the application and sub-processes (via a KILL signal in Unix or TerminateProcess() in Windows).

Override: send KILL signal

3.2 Heartbeats

Sometimes the core client exits unexpectedly; for example, when it crashes. In these situations, a mechanism is needed that will cause applications to exit eventually. We do this using heartbeat messages, which are sent once per second from the core client to each application. If an application doesn't get a heartbeat message for 30 seconds, it exits.

In some cases, processes need to make an action atomic with respect to exiting. For example, they might need to write a file or group of files. BOINC provides API functions

boinc_begin_critical_section();
boinc_end_critical_section();

boinc_begin_critical_section() sets a variable in the runtime library. Any code that would otherwise exit (heartbeat timeout, handling of 'quit' messages') checks this variable and defers quitting if it's set.

4. Sandboxing

The BOINC runtime system doesn't 'sandbox' applications. However, it guards against certain types of application misbehavior. Each task has specified limits on memory usage, disk usage, and computation (number of floating-point operations). The runtime system periodically measures these quantities and reports them to the core client. If the limits are exceeded, the core client kills the application.

5. Status and credit reporting

5.1 CPU time and VM usage

CPU time reporting

5.2 Fraction done

   boinc_fraction_done(double fraction_done);

The fraction_done argument is an estimate of the workunit fraction complete (0 to 1). This function is fast and can be called frequently.

The core client GUI displays the percent done of workunits in progress. To keep this display current, an application should periodically call
Used to estimate time remaining (important for scheduling purposes)

5.3 Credit reporting

boinc_fpops_per_cpu_sec() boinc_fpops_total()

6. Directory structure and file access

The directory structure used by the BOINC runtime system is as follows (see Figure 5).

For each project to which the host is attached, there is a directory (under 'project') that contains all files (input, output, executable) for that project.
For each task that is currently executing or is preempted, there is a directory (under 'slots') in which that task executes. Slot directories typically do not contain data files, but rather 'link files' (see below).

Figure 5: Directory structure

To access files, applications call

    int boinc_resolve_filename(char *logical_name, char *physical_name);

to convert logical file names to physical names. For example, instead of

    f = fopen(\"my_file\", \"r\");

the application might use

    string resolved_name;
    retval = boinc_resolve_filename(\"my_file\", resolved_name);
    if (retval) fail(\"can't resolve filename\");
    f = boinc_fopen(resolved_name.c_str(), \"r\");

boinc_resolve_filename() doesn't need to be used for temporary files.

Applications should replace fopen() calls with

boinc_fopen(char* path, char* mode);

This deals with platform-specific problems. On Windows, where security and indexing programs can briefly lock files, boinc_fopen() does several retries at 1-second intervals. On Unix, where signals can cause fopen() to fail with EINTR, boinc_fopen checks for this and does a few retries.

7. Checkpointing

Computations that use a significant amount of time per work unit may want to periodically write the current state of the computation to disk. This is known as checkpointing. The state file must include everything required to start the computation again at the same place it was checkpointed. On startup, the application reads the state file to determine where to begin computation. If the BOINC client quits or exits, the computation can be restarted from the most recent checkpoint.

Frequency of checkpointing is a user preference (e.g. laptop users might want to checkpoint infrequently). An application must call

    int boinc_time_to_checkpoint();

whenever it reaches a point where it is able to checkpoint. If this returns nonzero, the application should call a 'checkpoint function' that writes the state file and flushes all output files, then should call

    void boinc_checkpoint_completed();

boinc_time_to_checkpoint() is fast, so it can be called frequently (hundreds or thousands of times a second). These functions automatically make checkpointing into a critical section (see section X).

7.1 Output file integrity

Typical BOINC applications write incrementally to output files. It an application is preempted by quitting at a point when has written to an output file since the last checkpoint, the output will be written a second time when the task runs again, producing an erroneous output file. There are several ways of dealing with this:

Have the application checkpoint function copy output files (potentially inefficient).
Store the size of output files in the checkpoint file, and seek to these offsets on restart.
Use a set of printf()-replacement functions (supplied by BOINC) that buffer output in memory. Flush these buffers in the checkpoint function.

8. Graphics

BOINC applications can optionally provide graphics, which are displayed either in an application window or in a full-screen window (when acting as a screensaver). Requests to open and close graphics windows originate from the BOINC Manager and screensaver, and are relayed to applications by the core client.

The recommended way for an application to provide graphics is for it to provide a set of 'graphics functions', which are called as needed by the runtime library.

app_graphics_init();    // initialize graphics
app_graphics_render();  // render a frame
app_graphics_resize();  // called when window is resized
app_mouse_move();       // called when mouse moved
app_mouse_button();     // called when mouse button clicked
app_key_press();        // called when key pressed
app_key_release();      // called when key released

To use this approach, applications call an initialization function (see section X) that creates a 'graphics thread'. This thread executes code (in the BOINC runtime library) that monitors the graphics message queue, and that implements a window-system event handling loop. This code handles the creation and destruction of windows (using native calls on Windows, and GLUT on other platforms). It calls the application-supplied graphics functions as needed. Typically it calls the render function 10 times a second, allowing applications to generate moving, animated graphics. It uses a throttling mechanism that limits the fraction of CPU time used by the graphics thread. On hosts without graphics acceleration, the graphics will gracefully degrade to one frame per second or so.

Applications typically implement graphics using OpenGL [REF], since it is available on most platforms and offers hardware acceleration where available. There is a complication, however, on Unix-based systems. Applications must link dynamically to the needed graphics libraries (OpenGL, X11); otherwise they won't get hardware acceleration. However, it an executable that dynamically links a library is run on a system that doesn't have the library, it will fail to start. Some Unix systems have no graphics libraries.

BOINC's solution to this is to divide the application into a main program and a shared library. The shared library contains the application's graphics code. It dynamically links system graphics libraries, but the main program does not. BOINC's graphics-initialization function attempts to load the shared library; if system graphics libraries are missing, this will fail, in which the main program continues to execute normally, but without graphics. This structure is not needed on Windows, since OpenGL libraries are available on all versions of Windows.

In the above approach, the worker and graphics thread run in the same address space. This makes it easy to show a detailed, up-to-the-second view of the scientific calculation. For some applications, this is not important; it may be easier to implement graphics as a separate program that uses information in output or checkpoint files. The BOINC runtime system supports this also.

9. Long-running applications

Example: CPDN [REF].

9.1 Trickle messages

Trickle messages let applications communicate with the server during the execution of a workunit. They are intended for applications that have long work units (multiple days). Trickle messages may go in either direction: 'trickle up' messages go from application to server, 'trickle down' messages go from server to application. Typical uses of this mechanism:

The application periodically sends trickle-up messages containing its current CPU usage, so that users can be granted incremental credit (rather than waiting until the end of the work unit).
The application sends a trickle-up message containing a summary of the computational state, so that server logic can decide if the computation should be aborted.
The server sends a trickle-down message telling the application to abort.
The server sends a trickle-down message containing the user's current total credit.

Trickle messages are asynchronous and reliable. Trickle messages are conveyed in scheduler RPC messages, so they may not be delivered immediately after being generated.

The trickle-message API is as follows:

int boinc_send_trickle_up(char* variety, char* text)

sends a trickle message of the given variety. Returns zero if success.

bool boinc_receive_trickle_down(char* buf, int len)

receives a trickle message. Returns true if there was a message. Messages are delivered in order.

9.2 Intermediate file upload

Long-running applications can upload their output files before the result as a whole is finished. To initiate the upload of an output file, call

   extern int boinc_upload_file(char* name);

where 'name' is the logical name of the file. The application cannot modify the file after making this call.

To check on the status of a file being uploaded, call

extern int boinc_upload_status(char* name);

This will return zero if the upload of the file is finished successfully.

10. Non-CPU-intensive applications

Most BOINC applications are CPU-intensive, i.e. when they run they use as much CPU time as they can get. However, some applications are not CPU intensive. These applications typically are designed so that they always run, but have short and infrequent periods of activity, sleeping the rest of the time. Examples include:

Applications that study network structure and performance. When active, they establish one or more network connections for a short period, measuring the latency and/or bandwidth on these connections, or counting the number of hops to 'landmark' hosts. An example is DIMES [REF].
Applications that study the dynamics of computer usage. These applications typically run periodically. When they run, they measure system metrics such as CPU load, physical and virtual memory usage, and I/O rates.
Applications that provide a network service. These applications wait for a network connection, then carry out a service (typically involving little CPU time).

BOINC allows projects to flag themselves as non-CPU-intensive. Such projects are treated specially. The core client always maintains one in-progress result, and always runs the corresponding application.

Some non-CPU-intensive require that, during their periods of activity, no other BOINC-related work should occur. In particular, no network transfers should take place, and all other BOINC applications should exit. For this purpose, BOINC supplies the following API:

int boinc_suspend_other_activities();
int boinc_resume_other_activities();

The application should call these at the start and end of the its period of activity.

11. Initialization, finalization and error reporting

To start the initial execution of a task, the core client does the following:

Create a shared-memory segment for communicating with the task.
Allocate a slot directory to the task, creating or clearing it as needed.
Create a 'runtime data' file in the slot directory. This is an XML file containing various information needed by the runtime library, such as the name of the shared-memory segment.
Create link files (see section X) for the task's input and output files.
Execute the application in the slot directory (in Windows this uses CreateProcess(); in Unix, fork() and execv(); in OS/2, spawnv()

Applications call

    int boinc_init();

before calling other BOINC functions or doing I/O. This does the following:

Acquire the slot lock file to ensure that another process is not running in the same slot.
Parse the runtime data file.
Map the shared-memory segment.
Create the timer thread.

When the application has completed it calls

    int boinc_finish(int status);

status is nonzero if an error was encountered. This does the following:

Send a final status message and (if needed) trickle message.
Create a file named 'boinc_finished' in the slot directory.
Unlock the slot lock file
flush I/O buffers
On Windows, call TerminateProcess(), which kills other threads in the address space. Otherwise the graphics thread may continue to run while exit() is cleaning up the malloc heap, causing crashes.
call exit(status).

The boinc_finished file solves the following problem: On some versions of Windows, when a program is killed externally by the user, this is indistinguishable (from the parent process's viewpoint) from a call to exit(0). If the core client detects that a program has exited unexpectedly but no boinc_finished file is found, it logs a warning message and starts the application again.

12. Related work

Condor: migration; remote system call technology; traps library calls, sent over network to machine where job was submitted. No changes to source code, no recompile (just relink). ClassAds: form of preferences.

UD, Entropia

XtremWeb, Banyihan

13. Conclusion

non-goal: binary compatibility

not handled: allocation of non-CPU resources e.g. could have an app that checks for a GPU, uses it (and not CPU) if found. On a host with GPU, would want to run this together with a CPU-intensive app.