Automation: Schemes

Schemes were introduced to relion-3.1, where they were called Schedules but to prevent confusion with scheduled jobs in the main GUI, they were renamed to Schemes in relion-4.0 and their functionality was further improved. Schemes aim to provide a generalised methodology for automatic submission of relion jobs. This is useful for creating standardised workflows, for example to be used in on-the-fly processing. The relion_it.py script that was introduced in relion-3.1 has been re-written to work with the Schemes.

The Schemes framework is built around the following key concepts: a directed graph that represents the logic of a series of subsequent relion job-types is encoded in Nodes and Edges. Nodes can be either a relion job or a so-called Operator; Edges form the connections between Nodes. In addition, Schemes have their own Variables.

All information for each Scheme is stored in its own subdirectory of the Schemes/ directory in a relion project, e.g. Schemes/prep. Within each Scheme’s directory, the scheme.star file contains information about all the Variables, Edges, Operators and Jobs.

Variables

Three different types of Variables exist: floatVariables are numbers; booleanVariables are either True or False; and stringVariables are text. Each Variable has a VariableName; a so-called VariableResetValue, at which the value is initialised; and a VariableValue, which may change during execution of the Scheme through the actions of Operators, as outlined below.

One special stringVariable is called email. When this is set, upon completion or upon encountering an error, the Scheme will send an email (through the Linux mail command) to the value of the email stringVariable.

Jobs

Jobs are the first type of Node. They can be of any of the jobtypes defined in the relion pipeliner, i.e. Import, Motion correction, etc, including the new External. Any Variable defined in the Scheme can be set as a parameter in a Job, by using two dollar signs on the GUI or in the job.star file. For example, one could define a floatVariable voltage and use $$voltage on the corresponding input line of an Import job. Upon execution of the job inside the Scheme, the $$voltage will be replaced with the current value of the voltage floatVariable.

Jobs within a Scheme each have a JobName and a JobNameOriginal. The latter is defined upon creation of the job (see next section); the former depends on the execution status of the Scheme, and will be set to the executed relion job’s name, e.g. CtfFind/job004. In addition, each job has a JobMod and a jobHasStarted status. There are two types of JobMode:

new

regardless of jobHasStarted, a new job will be created, with its own new JobName, every time the Schemer passes through this Node.

continue

if jobHasStarted is False, a new job, with its own new JobName, will be created. If jobHasStarted is True, the job will be executed as a continue job inside the existing JobName directory.

When a Scheme executes a Job, it always sets jobHasStarted to True. When a Scheme is reset, the jobHasStarted status for all jobs is set to False.

Operators

Operators are the second type of Node. Each operator within a Scheme has a unique name and a type. Operators can also have an output Variable: output, on which they act, and up to two input Variables: input1 and input2. Most, but not all operators change the value of their output Variable.

The following types of operators act on an output that is a floatVariable:

float=set

output = floatVariable input1

float=plus

output = floatVariable input1 + floatVariable input2

float=minus

output = floatVariable input1 - floatVariable input2

float=mult

output = floatVariable input1 × floatVariable input2

float=divide

output = floatVariable input1 / floatVariable input2

float=round

output = ROUND(floatVariable input1)

float=count_images

sets output to the number of images in the STAR file with the filename in stringVariable input1. stringVariable input2 can be particles, micrographs or movies, depending on what type of images need to be counted.

float=count_words

sets output to the number of words in stringVariable input1, where individual words need to be separated with a , (comma) sign.

float=read_star

sets output to the value of a double or integer that is read from a STAR file. stringVariable input1 defines which variable to read as: starfilename,tablename,metadatalabel. If tablename is a table instead of a list, then floatVariable input2 defines the line number, with the default of zero being the first line.

float=star_table_max

sets output to the maximum value of a column in a starfile table, where stringVariable input1 specifies the column as starfilename,tablename,metadatalabel.

float=star_table_min

sets output to the minimum value of a column in a starfile table, where stringVariable input1 specifies the column as starfilename,tablename,metadatalabel.

float=star_table_avg

sets output to the average value of a column in a starfile table, where stringVariable input1 specifies the column as starfilename,tablename,metadatalabel.

float=star_table_sort_idx

a sorting will be performed on the values of a column in a starfile table, where stringVariable input1 specifies the column as starfilename,tablename,metadatalabel. stringVariable input2 specifies the index in the ordered array: the lowest number is 1, the second lowest is 2, the highest is -1 and the one-but-highest is -2. Then, output is set to the corresponding index in the original table.

The following types of operators act on an output that is a booleanVariable:

bool=set

output = booleanVariable input1

bool=and

output = booleanVariable input1 AND booleanVariable input2

bool=or

output = booleanVariable input1 OR booleanVariable input2

bool=not

output = NOT booleanVariable input1

bool=gt

output = floatVariable input1 > floatVariable input2

bool=lt

output = floatVariable input1 < floatVariable input2

bool=ge

output = floatVariable input1 >= floatVariable input2

bool=le

output = floatVariable input1 <= floatVariable input2

bool=eq

output = floatVariable input1 == floatVariable input2

bool=file_exists

output = True if a file with the filename stored in stringVariable input1 exists on the file system; False otherwise

bool=read_star

reads output from a boolean that is stored inside a STAR file. stringVariable input1 defines which variable to read as: starfilename,tablename,metadatalabel. If tablename is a table instead of a list, then floatVariable input2 defines the line number, with the default of zero being the first line.

The following types of operators act on an output that is a stringVariable:

string=set

output = stringVariable input1

string=join

output = concatenate stringVariable input1 and stringVariable input2

string=before_first

sets output to the substring of stringVariable input1 that occurs before the first instance of substring stringVariable input2.

string=after_first

sets output to the substring of stringVariable input1 that occurs after the first instance of substring stringVariable input2.

string=before_last

sets output to the substring of stringVariable input1 that occurs before the last instance of substring stringVariable input2.

string=after_last

sets output to the substring of stringVariable input1 that occurs after the last instance of substring stringVariable input2.

string=read_star

reads output from a string that is stored inside a STAR file. stringVariable input1 defines which variable to read as: starfilename,tablename,metadatalabel. If tablename is a table instead of a list, then floatVariable input2 defines the line number, with the default of zero being the first line.

string=glob

output = GLOB(stringVariable input1), where input1 contains a Linux wildcard and GLOB is the Linux function that returns all the files that exist for that wildcard. Each existing file will be separated by a comma in the output string.

string=nth_word

output = the Nth substring in stringVariable input1, where N=*floatVariable* input2, and substrings are separated by commas. Counting starts at one, and negative values for input2 mean counting from the end, e.g. input2=-2 means the second-last word.

The following types of operators do not act on any variable:

touch_file

performs touch input1 on the file system

copy_file

performs cp input1 input2 on the file system. stringVariable input1 may contain a linux wildcard. If stringVariable input2 contains a directory structure that does not exist yet, it will be created.

move_file

performs mv input1 input2 on the file system. stringVariable input1 may contain a linux wildcard. If stringVariable input2 contains a directory structure that does not exist yet, it will be created.

delete_file

performs rm -f input1 on the file system. stringVariable input1 may contain a linux wildcard.

email

sends an email, provided a stringVariable with the name email exists and the Linux command mail is functional. The content of the email has the current value of stringVariable input1, and optionally also stringVariable input2.

wait

waits floatVariable input1 seconds since the last time this operator was executed. The first time it is executed, this operator only starts the counter and does not wait. Optionally, if output is defined as a floatVariable, then the elapsed number of seconds since last time is stored in output.

exit_maxtime

terminates the execution of the Scheme after the number of hours have passed since its start as stored in floatVariable input1.

exit

terminates the execution of the Scheme.

Edges

Two types of Edges exist. The first type is a normal Edge, which connects an inputNode to an ouputNode, thereby defining their consecutive execution.

The second type is called a Fork. A Fork has one inputNode, an outputNode, an outputNodeIfTrue, and an associated booleanVariable. Whether one or the other output Node is executed depends on the current value of the booleanVariable that is associated with the Fork. The fork with lead from the inputNode, an outputNode if the booleanVariable is False. The fork will lead from the inputNode, an outputNodeIfTrue if the booleanVariable is True. Thereby, Forks are the main instrument of making decisions in Schemes.

Create a Scheme

The combination of the Variables, Nodes and Edges allows one to create complicated sequences of jobs. It is probably a good idea to draw out a logical flow-chart of your sequence before creating a Scheme. Then, use your favourite text editor to manually edit the files Schemes/SCHEMENAME/scheme.star and all the files Schemes/SCHEMENAME/JOBNAMES/job.star for all the jobs in that Scheme. Following the prep and proc examples in the scripts directory of your relion installation is probably the easiest way to get started.

In the Schemes/SCHEMENAME/scheme.star file, first add all the different variables and operators that you will need.

Note that any variable names that contain a JobNameOriginal of any of the Jobs inside any Scheme that is present in the ProjectDirectory, will be replaced by the current JobName upon execution of an operator. For example, a stringVariable with the value Schemes/prep/ctffind/micrographs_ctf.star will be replaced to something like CtfFind/job003/micrographs_ctf.star upon execution of the job that uses it, assuming that the current JobName of that job is CtfFind/job003/ in the Schemes/prep/scheme.star file.

Then, add your jobs. You can use the normal relion GUI to fill in all parameters of each job that you need and then use the Save job.star options from the Jobs menu to save the job.star file in the corresponding Schemes/SCHEMENAME/JOBNAMES/ directory. Jobs use the same mechanism as described for the Variables above. So, if an Auto-picking job depends on its micrographs STAR file input on a CTF estimation job called ctffind, and this CTF estimation job is part of a Scheme called prep, then the micrographs STAR file input for the Auto-picking job should be set to Schemes/prep/ctffind/micrographs_ctf.star, and this will be converted to CtfFind/job003/micrographs_ctf.star upon execution of the job. In addition, a corresponding edge will be added to the default_pipeliner.star upon execution of the Scheme. Also note that parameters in job.star files may be updated with the current values of Variables from the Scheme by using the $$ prefix, followed by the name of the corresponding Variable, as also mentioned above.

In addition, the JobMode needs to be chosen from options: new or continue. Typically, in on-the-fly-processing procedures that iterate over ever more movies, jobs like Import, Motion correction, CTF estimation, Auto-picking and Particle extraction are set as continue, whereas most other jobs are set as new.

Finally, once all the Variables, Operators and Jobs are in place, one should define all the Edges between them.

The Scheme will be initialised (and reset) to the left-hand Node of the first defined Edge. If the Scheme is not an infinite loop, it is recommended to add the exit Operator as the last Node.

Once a Scheme has been created, it may be useful for more than one relion project. Therefore, you may want to store it in a tar-ball:

tar -zcvf preprocess_scheme.tar.gz Schemes/preprocess

That tar-ball can then be extracted in any new relion project directory:

tar -zxvf preprocess_scheme.tar.gz

Executing a Scheme

Once a Scheme has been created using the –scheme argument to the GUI, it is no longer necessary to provide that argument. One can instead launch the GUI normally (and have slider bars for numbers and Yes/No pull-down menues for booleans):

relion &

The Scheme can then be accessed through the ‘Scheduling’ menu at the top of the GUI, where all defined Schemes are available through the ‘Schemes’ sub-menu. The same GUI can be toggled back into the normal ‘pipeline’ mode from the same menu (or by pressing ALT+’p’ on Linux). If one wants to start a Scheme from scratch, one would typically press the Reset button first, and then press the Run! button. This will lock the Scheme directory from further writing by the GUI and to reflect this, the lower part of the GUI will be de-activated. Once the Scheme finishes, the lock (in effect a hidden directory with the name .relion_lock_scheme_NAME) will be removed and the bottom part of the GUI will be re-activated. One can safely toggle between the pipeliner and the schemer mode during execution of any Scheme, and multiple (different) Schemes can run simultaneously.

When a Scheme for whatever reason dies in error, the lock will not be automatically removed. If this happens, use the Unlock button to remove the lock manually. Be careful not to remove the lock on a running Scheme though, as this itself will cause it to die with an error.

If one would like to stop a running Scheme for whatever reason, one can press the Abort button. This will send an abort signal (i.e. it will create files called RELION_JOB_ABORT_NOW in the job directory of the currently running job, and in the directory of the Scheme itself), which will cause the Scheme to stop, and the lock to be removed. If one were to press the Run! button again, the same Scheme would continue the same execution as before, from the point where it was aborted. Most likely though, one has aborted because one would like to change something in the Scheme execution. For example, one could change parameters of a specific Job. To do so, select that Job by clicking on it in the list of Jobs in the lower part of the GUI. Then, edit the corresponding parameters on the relevant tabs of that Job on the top part of the GUI. Then, one may want to set jobHasStarted status to False, in order to make these options effective for all data processed in the Scheme thus far. For example, after running a Scheme for automated pre-processing for a while, one might want to change the threshold for picking particles in a Auto-picking job. One would then reset the jobHasStarted status of the Auto-picking job to False, while one would leave the jobHasStarted status of other jobs like Motion correction and CTF estimation to True. Thereby, upon a re-start of the Scheme, only new movies would be subjected to Motion correction and CTF estimation inside the same output directories as generated previously, but a new Auto-picking directory would be created, in which all movies acquired thus far would be processed.