wiki:RemoteInputFiles

Version 3 (modified by davea, 11 years ago) (diff)

--

Remote management of input files

For a file to be used as an input file of a BOINC job, it must be available to BOINC clients via HTTP. The standard way to do this is put the file in the project's "download directory" on the project server.

For projects that use remote job submission, job submitters don't have login access to the server, so they can't store files there directly. Instead, BOINC provides two mechanisms that allow job submitters to place files on the BOINC server.

Each of these mechanisms deals with two issues:

  • File immutability: BOINC requires that a file of a given name can never be changed. Job submitters can't be expected to obey this rule: they must be able to submit one job with an input file of a given name, and a second job with an input file of the same name but different contents.
  • File cleanup: There must be some way to clean up files on the server when they are no longer needed.

Content-based file management

This system is used by the Condor/BOINC interface. If may be useful for other systems as well. In this system, the name of a file on the BOINC server is based on its MD5 hash; thus file immutability is automatic.

File cleanup is based on file/batch associations. Each file can be associated with one or more batches. Files that are no longer associated with an active batch are automatically deleted from the server.

The system uses two Web RPCs. These are implemented as XML sent via HTTP POST; the RPC handler is html/user/job_files.php.

The following C++ interfaces are provided (in samples/condor/job_rpc.cpp). This is to be called on the job submission host; the files must exist on that host, and their MD5s must have already been computed.

extern int query_files(
    const char* project_url,
    const char* authenticator,
    int batch_id,
    vector<string> &md5s,
    vector<string> &paths,
    vector<int> &absent_files		// output
);

Inputs:

  • project_url: the project's master URL
  • authenticator: the job submitter's authenticator
  • paths: a list of file paths on the calling host.
  • md5s: a list of the MD5s of the files.
  • batch_id: the ID of a batch whose jobs will reference the files (these jobs need not exist yet).

Action: for each file, see if it exists on the server. If it does, create an association to the given batch.

Output:

  • return value: nonzero on error
  • absent_files: a list of files not present on the server (represented as indices into the file vector).
extern int upload_files (
    const char* project_url,
    const char* authenticator,
    vector<string> &paths,
    vector<string> &md5s,
    int batch_id
);

Inputs:

  • project_url, authenticator, batch_id: as above.
  • paths: a list of paths of files to be uploaded
  • md5s: a list of MD5 hashes of these files
  • batch_id: the ID of a batch with which the files are associated

Action: Upload the files, and create associations to the given batch.

Output:

  • return value: nonzero on error

If you use this system, periodically run the script html/ops/delete_job_files. This will delete files that are no longer associated with an active batch.

Per-user file sandbox