wiki:LammpsRemote

Version 1 (modified by davea, 12 years ago) (diff)

--

A web-based system for LAMMPS jobs

This document describes a system that allows scientists to submit and monitor groups of LAMMPS jobs using BOINC. The system has the following properties:

  • Users (i.e. scientists) interact entirely through a web-based interface. They don't need to log into the project server, and they don't need to know anything about BOINC.
  • Users are authenticated by BOINC project accounts. They do not need login accounts on the project server.
  • Users can submit parameter sweeps consisting of thousands of jobs as easily as submitting a single job.
  • Users can get an estimate for the completion of a group of jobs priori to submitting it, and can get updated completion estimates as the batch is processed.

This system was developed for researchers at Tsinghua University. It can be modified to meet the needs of other projects using LAMMPS, and many parts of it can be used to build similar systems for other applications.

Authentication, access control, and quotas

Users (that is, job submitters) must create an account on the BOINC project; this is done using a form on the project web site. A project administrator must then grant the user the right to submit jobs for LAMMPS (and potentially for other applications). Optionally, a designated user may be given the ability to grant access rights to other users. Each user has an associated "quota" that determines their share of processing power.

Per-user file sandbox

LAMMPS input files can be large, and it would be inconvenient to upload these files each time jobs are submitted. Instead, we allow users to maintain a set of files on the project server; this is called the user's "file sandbox".

Using a web interface, users can

  • upload files from PC to sandbox
  • view the files in their sandbox, including size and MD5.
  • download files from sandbox to PC
  • delete files from the sandbox

Files in the sandbox can be modified, and all old versions are retained on the server. When a batch of jobs is submitted, it uses the input file versions at the moment of submission, even if the files are then modified while the batch is in progress.

LAMMPS job submission

Batches of LAMMPS jobs can be submitted using a web interface. This process has two steps. First, the user fills out a form specifying the following files, which must be in the sandbox:

  • The atomic structure file
  • The LAMMPS command script
  • A zipped file containing the potential files needed for the simulation
  • A file containing command lines to be passed to LAMMPS. One job will be created for each line of this file.

The user clicks the "Prepare" button on this form. This validates the input files and estimates the resource requirements of the batch. If there is an error in the input files, the user sees the corresponding LAMMPS error messages. Otherwise, they are shown an estimated completion time for the batch, and an estimate of its disk usage both on the server and on volunteer computers. If either of these is excessive, the user may opt to not submit the batch. Otherwise, they submit the batch by clicking the "Submit" button.

The input validation and runtime estimation is done by running LAMMPS on the project server, checking the output for error messages, aborting it after a few time steps, and measuring the average CPU time per time step. From this, the FLOPS requirements of each job is estimated, and (based on the performance statistics of the volunteer host population) the completion time of the batch is estimated.

Batch monitoring, control, and output retrieval

Users can monitor and control batches through a web interface. While a batch is in progress, the user can see its fraction done and an updated estimate of its completion time. In addition, the user can see the status of each of its component jobs: unsent, in progress, failed, or completed. When a job is completed the user can download its output files.

A batch can be aborted at any time (for example, because outputs of completed jobs are seen to be erroneous, or because many jobs are failing). If this is done, no further jobs from that batch will be issued.

The user can, at any time, download a zipped file of all the output files of all completed jobs in the batch.

After a batch is completed or aborted, and all desired output files have been downloaded, the user can "retire" the batch. This causes its output files and database records to be deleted from the server.