wiki:RemoteJob

Version 10 (modified by tonig, 13 years ago) (diff)

--

Remote job submission

A group from Universitat Pompeu Fabra has developed RBoinc, system for remote job submission and monitoring. This system allows scientists to submit jobs (or groups of jobs) from a convenient command-line interface.

In the following, we will use the scientist term to denote the individual who submits and administers the workunits on their workstation through the RBoinc 'client tools'. RBoinc client tools are not to be confused with BOINC clients (i.e. the slaves of the distributed computing architecture); for utmost clarity, we shall prefer the term scientist and scientist workstation to indicate the user of the RBoinc client and their machine.

The system (Perl-based) is in boinc/rboinc/.

Warning: this system has been used only by its developers. It will take some work to get it working on other projects.

Powerpoint slides describing the system are here. For details please see the paper T. Giorgino, M. J. Harvey and G. De Fabritiis, Distributed computing as a virtual supercomputer: Tools to run and manage large-scale BOINC simulations, Comp. Phys. Commun. 181, 1402 (2010). http://boinc.berkeley.edu/rboinc.pdf pdf

Summary

(This section to be fixed.) The software should be fairly self-explanatory, but installation may be tricky. Here's a general overview

  • boinc_retrieve_server, boinc_submit_server run as cgi. The former, actually, also handles all administrative requests (stop, purge).
  • boinc_retrieve, boinc_submit, are the client components (ditto as above for admin requests)
  • Exchange of files between client and server is done through WEBDAV http extensions (a scratch area needs be setup for this)
  • Wus naming is important and enforced like this: NNN-UUU_GGG-XX-YY-RNDzzzz where
    • NN is the name of the workunit (sub-group)
    • UU is the submitter id
    • GGG is the group
    • XX is the current step in the chain
    • YY is the total n. of steps
    • zzzz is a random number (not needed,actually)
  • WUs are kept in a "workflow_directory", a subdir of the project dir, as per slide 22 of the Powerpoint.
  • Inside each dir a "process" bash file is created, which is executed by the assimilator with the name of the assimilated WU as its argument. It will create_work the next step for execution.
  • The main reason for using perl is that I preferred to use the XML::Simple module for (un-) xml-ing data structures over the network - it was useful for adding features on the fly keeping backwards compatibility
  • I implemented basic functions for authentication, but this is not finished yet
  • file storage is optimized through hardlinking and pooling. Network transfers are not (but they could be)

Annotating the WU template files

First, workunit template files should be marked as RBoinc-enabled at the top. This is achieved prepending the following tag to the relevant workunit template:

<rboinc application="md"
        description="Standard ACEMD run with optional DCD and PLUMED"/>

The above line marks the template as RBoinc-enabled and thus scientist-visible as an application. The application attribute will be the user-visible name of the application (which may or may not coincide with BOINC application names). The scientist will identify this template through the -app command line switch on the boinc_submit operation.

Additionally, input files in the workunit template are augmented with RBoinc-related settings. In the WU template, each file_ref element should have a child rboinc element as follows:

   <file_ref>
        <file_number>3</file_number>
        <open_name>input.vel</open_name>
        <copy_file/>
        <rboinc parameter_name="vel_file"
                parameter_description="Binary velocities"
                [ optional="true" ]
                [ immutable="true" ]
                [ encode="true" ]
                />
    </file_ref>

The parameter_name attribute is the command line parameter that will be required by the boinc_submit command for that file. The argument passed by the scientist on the command line to that parameter will be interpreted as a local file, transferred to the BOINC server, and associated to the given BOINC-handled file (in this case, number 3, with BOINC open name "input.vel").

The parameter_description is a descriptive text returned by the command line client when the scientist requests help for the attributes supported by the given application.

The optional optional flag specifies whether supplying the given file upon submission is mandatory or not. If not, it will be replaced by a (server-supplied) default file.

Likewise, the optional immutable flag specifies that the given file will be replaced by a server-supplied default file, and the submitter has no chance to override it.

Finally, if encode is true, the file is subject to a (server-defined) encoding before being sent. The server will store both the original and the encoded version (suffixed with _enc).

Annotating the result template files

Results template files are annotated with RBoinc-specific tags which identify which results should be transferred back to the scientist's workstation. The same tags can be used to build output-input chains, i.e. to automatically submit new workunits as continuations of successfully-completed ones.

The syntax for the results template is as follows:

<file_info>
    <name><OUTFILE_0/></name>
    <generated_locally/>
    <upload_when_present/>
    <max_nbytes>100000000</max_nbytes>
    <url><UPLOAD_URL/></url>
    <gzip_when_done/>
    <rboinc aliases=".vel .vel.gz" 
          [ chain="3" ]  />
</file_info>

The optional chain attribute indicates that, upon successful WU completion, that output file should be used as a third input file for the next step in the chain.

Upon retrieval, files have BOINC-assigned outfile names ending by '_1', '_2', and so on. The aliases attribute contains a space-separated list of extensions which are considered when deciding whether a file has been already downloaded. When an alias is specified as above, a file ending in _1 will be considered already downloaded, and its retrieval skipped, if a file with a similar name but ending in _1.vel or _1.vel.gz is present in the retrieve directory.

For details on the chaining mechanism, please see the paper T. Giorgino, M. J. Harvey and G. De Fabritiis, Distributed computing as a virtual supercomputer: Tools to run and manage large-scale BOINC simulations, Comp. Phys. Commun. 181, 1402 (2010). http://boinc.berkeley.edu/rboinc.pdf pdf