wiki:VolunteerDataArchival

Version 1 (modified by davea, 12 years ago) (diff)

--

Volunteer data archival

Volunteer data archival means using disk space on volunteered home computers to store large data files. This document describes the design of a system to provide volunteer data archival on BOINC. We assume the goals include:

  • Storing large (e.g. petabyte) files. Files may be thousands of times larger than the amount of space available on individual computers.
  • Store files are long periods.
  • Be able to reduce the probability of data loss to arbitrarily small levels.

Properties of the volunteer host population include:

  • A host may be sporadically available because it is turned off, or because the user has suspended network activity. Unavailable periods may range from minutes to several days.
  • The upload and download speeds of hosts vary widely, and can be fairly low (e.g. 1 Mbps) in some cases.
  • The amount of disk space available to a project on a given host may fluctuate over time, because of the user's own disk usage or disk usage by other BOINC projects to which the host is attached.
  • The population is dynamic: hosts are constantly arriving and leaving. The mean lifetime of a host may be fairly small (on the order of 100 days).
  • Many hosts are behind firewalls. We assume that all communication is initiated by the BOINC client, and involves HTTP requests to trusted project servers. We don't consider direct client-to-client communication.

There are two basic techniques for achieving reliable storage using unreliable resources:

  • Replication: a file
  • Coding: with Reed-Solomon coding, a file is divided into N 'packets', and an additional K checksum packets are generated. The original data can be reconstructed from any N of these N+K packets.