wiki:RemoteInputFiles

Version 8 (modified by davea, 8 years ago) (diff)

--

Remote management of input files

Input files of BOINC jobs must be available on a public web server. Usually this is the project's BOINC server.

For projects that use remote job submission, job submitters don't have login access to the BOINC server, so they can't store files there directly. Instead, BOINC provides two mechanisms that allow them to store and manage files on the BOINC server.

Job-based file management: files are automatically transferred as part of remote job submission. It's intended to be integrated with such a system; job submitters are not aware of it.

Per-user file sandbox: job submitters explicitly maintain, via a web interface, a set of files on the server.

Each mechanism deals with several issues:

  • File immutability: BOINC requires that a file of a given name can never be changed. Job submitters can't be expected to obey this rule: they must be able to submit one job with an input file of a given name, and a second job with an input file of the same name but different contents.
  • File cleanup: files on the server must be deleted when they are no longer needed.
  • Authorization: only users authorized to submit jobs should be able to move files to the server.

Note: both mechanisms upload files via a PHP script. PHP's default max file upload size is 2MB. To increase this, edit /etc/php.ini, and change, e.g.

upload_max_filesize = 64M
post_max_size = 64M

Job-based file management

In this system, the BOINC name of a file (i.e. its name on the BOINC server) includes its hash; thus file immutability is automatic.

File cleanup is based on file/batch associations. Each file can be associated with one or more batches. Files that are no longer associated with an active batch are automatically deleted from the server.

The system uses two Web RPCs. These are implemented as XML sent via HTTP POST; the RPC handler is html/user/job_files.php.

C++ interface

The following C++ functions are provided (in lib/remote_submit.cpp). They are to be called on the job submission host; the files must exist on that host, and their MD5s must have already been computed.

extern int query_files(
    const char* project_url,
    const char* authenticator,
    std::vector<string> &boinc_names,
    int batch_id,
    std::vector<int> &absent_files,		// output
    std::string& error_message
);

Inputs:

  • project_url: the project's master URL
  • authenticator: the job submitter's authenticator.
  • boinc_names: a list of the "BOINC names" of the files.

Must include the MD5 hash; can have a prefix or suffix if needed.

  • batch_id: the ID of a batch whose jobs will reference the files (these jobs need not exist yet). The operation will fail if the user is not authorized to submit jobs to the batch's application.

Action: for each file, see if it exists on the server. If it does, create an association to the given batch.

Output:

  • return value: nonzero on error
  • absent_files: a list of files not present on the server (represented as indices into the boinc_names vector).
  • error_message: if error, an explanatory string.
extern int upload_files (
    const char* project_url,
    const char* authenticator,
    std::vector<string> &paths,
    std::vector<string> &boinc_names,
    int batch_id,
    std::string& error_message
);

Inputs:

  • project_url, authenticator, batch_id: as above.
  • paths: a list of paths of files to be uploaded
  • boinc_names: a list of BOINC names of these files (see above).
  • batch_id: the ID of a batch with which the files are associated. The operation will fail if the user is not authorized to submit jobs to the batch's application.

Action: Upload the files, and create associations to the given batch.

Output:

  • return value: nonzero on error
  • error_message: if error, an explanatory string.

If you use this system, periodically run the script html/ops/delete_job_files. This will delete files that are no longer associated with an active batch.

Python interface

The Python interface does both RPCs in one function:

import submit_api

req = UPLOAD_FILES_REQ()
req.project = project_url
req.authenticator = get_auth()
req.batch_id = 271
req.local_names = ('updater.cpp', 'kill_wu.cpp')
req.boinc_names = ('xxx_updater.cpp', 'xxx_kill_wu.cpp')
r = upload_files(req)
if r[0].tag == 'error':
	print 'error: ', r[0].find('error_msg').text
	return
print 'success'

Per-user file sandbox

This mechanism allows job submitters to explicitly upload files via a web interface: PROJECT_URL/sandbox.php.

Links to the files are stored in a "sandbox directory" PROJECT_ROOT/sandbox/USERID/. The entries in this directory have contents

size MD5

The actual files are stored in the download directory, under the name sb_userid_MD5.

Currently, files in the sandbox are not cleanup up automatically. The web interface allows users to delete their files.