Changes between Version 4 and Version 5 of VolunteerDataArchival


Ignore:
Timestamp:
Nov 25, 2011, 12:45:03 AM (12 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • VolunteerDataArchival

    v4 v5  
    6565required to maintain reliability.
    6666
     67Also, note that data recovery uses network bandwidth.
     68It's conceivable that the capacity of the system is limited
     69not by client disk space, but by network bandwidth at the server.
     70
    6771== Increasing reliability ==
    6872
     
    108112
    109113 * Regenerating a chunk requires reassembling the entire file on the server,
    110   defeating the purpose of distributed storage.
     114  imposing a high storage and network communication overhead.
    111115
    112116== Hybrid reliability mechanisms ==
     
    118122=== Multi-level coding ===
    119123
     124One way to reduce the reconstruction overhead of coding
     125is to divide the file into M parts, and encode each part separately.
     126That way, if a packet is lost, only 1/M of the file needs to be reconstructed on the server.
     127
     128However, if one of these M parts is lost, the file is lost.
     129To remedy this, we can use coding at the top level as well:
     130in addition to the M parts, generate an additional K "checksum parts",
     131and encode these parts in the same way.
     132
     133If we use this 2-level encoding scheme with parameters M=40 and N=20,
     134we can recover from any 400 simultaneous host failures,
     135with a space overhead of 125%.
     136
     137The scheme can be extended to any number of levels of encoding.
     138
    120139=== Coding plus replication ===
    121140
     141To achieve high reliability, we need to use fairly large values of coding's N and K parameters,
     142like 10-50.
     143This means that recovering from a packet loss requires uploading and downloading
     14410-50 packets, which is a large overhead.
     145
     146We can potentially use replication at the bottom level to reduce this overhead.
     147Suppose, for example, that we use 2-fold replication for the bottom-level
     148packets of multi-level encoding.
     149Then, in many (and maybe even almost all) cases
     150we'll just have to do 1 upload and 2 download to restore the packet.
     151Although this doubles the client storage requirement,
     152it could potentially increase system capacity
     153by reducing network bandwidth at the server.
     154
    122155== The VDAB simulator ==
     156
     157We have developed a simulator for VDAB.
     158The simulator models a set of hosts.
     159The parameters of the host population include:
     160
     161 * Arrival rate of hosts
     162 * Distribution of host lifetimes (currently exponential, with adjustable mean)
     163 * Distribution of upload and download network bandwidth
     164 * Distribution of amount of free disk space
     165
     166The simulator models the arrival of one or more files,
     167each with a given size.
     168
     169The simulator is able to model the following storage policies:
     170
     171 * M-level coding
     172 * Different values of N and K at each level of coding
     173 * R-fold replication at the bottom level
     174
     175The simulator outputs:
     176 * statistics of server disk space usage
     177 * statistics of network bandwidth usage
     178 * statistics of "vulnerability": how many host failures would be needed
     179   to cause the loss of each file.