Quickstart

In this quickstart you will learn to: 1. create a workflow 2. create a workflow run 3. execute the run

1. Create a workflow

We assume you downloaded and unzipped molgenis compute commandline and are now in the directory you downloaded.

You can generate a template for a new workflow using command:

  sh molgenis_compute.sh --create myfirst_workflow

This will create a new directory for the workflow:

  cd myfirst_workflow
  ls

The directory contains a typical Molgenis Compute workflow structure

  /protocols              #folder with bash script 'protocols'
  /protocols/step1.sh     #example of a protocol shell script
  /protocols/step2.sh     #example of a protocol shell script
  workflow.csv            #file listing steps and parameter flow
  workflow.defaults.csv   #default parameters for workflow.csv (optional)
  parameters.csv          #parameters you want to run analysis on
  header.ftl              #user extra script header (optional)
  footer.ftl              #user extra script footer (optional)

Define workflow

You can define a workflow of steps using the workflow.csv file. For example:

  step,protocol,dependencies
  step1,protocols/step1.sh,
  step2,protocols/step2.sh,step1

This example consists of two steps 'step1' and 'step2', where 'step2' depends on 'step1'. 'step1' has its contents in the file protocols/step1.sh and 'step2' in the file protocols/step2.sh respectively.

If we want parameter values to flow between steps, we can also map the parameters:

  step,protocol,parameterMapping
  step1,protocols/step1.sh,in=input
  step2,protocols/step2.sh,wf=workflowName;date=creationDate;strings=step1.out;in=input

Define parameters

To feed parameter values to your workflow you can also use simple csv files. In this example, one parameter 'input' has two values 'hello' and 'bye':

  input
  hello
  bye

Define step contents

Finally, you need to implement what needs to happen at each step. We therefor define for each step a 'protocol'. Protocols are simply bash scripts containing the commands you want to run

For example protocols/step1.sh:

  #string in
  #output out
  echo ${in}_hasBeenInStep1
  out=${in}_hasBeenInStep1

Given the parameters above, 'input' will be substituted with values 'hello' or 'bye'. In addition, the contents of 'out' will be available to the next step.

Inputs can either be '#string' for variables with a single value or '#list' for variables with multiple values. The outputs are specified with the flag '#output'

In the same way, we can map outputs of one step to the inputs of the next steps. In our example, 'strings' in the 'step2', which has protocol step2.sh

  #string wf
  #string date
  #list strings
  echo "Workflow name: ${wf}"
  echo "Created: ${date}"
  echo "Result of step1.sh:"
  for s in "${strings[@]}"
  do
    echo ${s}
  done
  echo "(FOR TESTING PURPOSES: your runid is ${runid})"

The example protocols has the following listings:

In our example variables 'date' and 'wf' are defined in an additional parameters file +workflow.defaults.csv+.

  workflowName,creationDate
  myFirstWorkflow,today

In this way, the parameters can be divided in several groups and re-used in different workflows. If users do not like to map parameters, they should use the same names in protocols and parameters files. This makes parameters a kind of global.

2. Generate jobs

Once you defined your workflow you can generate 1000s of jobs. Just change the parameter values to have different runs.

N.B. always use full paths to parameter files, workflow etc

cd ..
sh molgenis_compute.sh --generate --parameters myfirst_workflow/parameters.csv --workflow myfirst_workflow/workflow.csv --defaults myfirst_workflow/workflow.defaults.csv

or with a short command-line version

cd ..
sh molgenis_compute.sh -g -p myfirst_workflow/parameters.csv -w myfirst_workflow/workflow.csv -defaults myfirst_workflow/workflow.defaults.csv

The directory rundir is created.

  ls rundir/

It contains a number of files

  doc        
  step1_0.sh    
  step1_1.sh    
  step2_0.sh    
  submit.sh    
  user.env

.sh are actual scripts generated from the specified workflow. 'step1' has two scripts and 'step2' has only one, because it treats outputs from scripts of the 'step1' as a list, which is specified in step2.sh by

    #list strings

user.env contains all actual parameters mappings. In this example:

  #
  ## User parameters
  #
  creationDate[0]="today"
  creationDate[1]="today"
  input[0]="hello"
  input[1]="bye"
  workflowName[0]="myFirstWorkflow"
  workflowName[1]="myFirstWorkflow"

Parameters, which are known before hand can be connected to the environment file or weaved directly in the protocols (if 'weave' flag is set in command-line options). In our example, two shell scripts are generated for the 'step1'. The weaved version of generated files are shown below.

step1_0.sh:

  #string in
  #output out
  # Let's do something with string 'in'
  echo "hello_hasBeenInStep1"
  out=hello_hasBeenInStep1

and step1_1.sh

  #string in
  #output out
  # Let's do something with string 'in'
  echo "bye_hasBeenInStep1"
  out=bye_hasBeenInStep1

The output values of the first steps are not known beforehand, so, 'string' cannot be weaved and will stay in the generated for the 'step2' script as it was. However, the 'wf' and 'date' values are weaved.

step2_0.sh:

  #string wf
  #string date
  #list strings
  echo "Workflow name: myFirstWorkflow"
  echo "Created: today"
  echo "Result of step1.sh:"
  for s in "${strings[@]}"
  do
      echo ${s}
  done

If values can be known, the script will have the following content

step2_0.sh with all known values:

  #string wf
  #string date
  #list strings
  echo "Workflow name: myFirstWorkflow"
  echo "Created: today"
  echo "Result of step1.sh:"
  for s in "hello" "bye"
  do
      echo ${s}
  done

If 'weaved' flag is not set, +step1_0.sh+ file, for example looks as follows:

  # Connect parameters to environment
  input="bye"
  #string input
  # Let's do something with string 'in'
  echo "${input}_hasBeenInStep1"
  out=${input}_hasBeenInStep1

In this way, users can choose how generated files look like. In the current implementation, values are first taken from parameter files. If they are not present, then compute looks, if these values can be known at run-time, by analysing all previous steps of the protocol, where values are unknown. If values cannot be known at run-time, compute will give a generation error.

3. Execute workflow

Execute locally

Compute can execute the jobs locally with command:

  sh molgenis_compute.sh --run
  ls rundir/

Now, rundir contains more files

  doc                
  step1_0.sh            
  step1_1.sh            
  step2_0.sh
  submit.sh
  step1_0.sh.started        
  step1_1.sh.started        
  step2_0.sh.started
  step1_0.env            
  step1_1.env            
  step2_0.env    
  step1_0.sh.finished        
  step1_1.sh.finished        
  step2_0.sh.finished
  molgenis.bookkeeping.log        
  user.env

.started and .finished files are created, when certain jobs are started and finished respectively.

In our example, 'strings' variable from 'step2' requires run-time values produced in 'step1'. These values are taken from step1_X.env files. For example:

step1_0.env:

  step1__has__out[0]=hello_hasBeenInStep1

In the workflow.csv file, it is specified with a simple '.'

  strings=step1.out

and substituted with 'has' in generated script files.

Last updated