PEPPRO pipeline step-by-step guide

In this guide, we'll walk you through the step by step procedure of running a tutorial PRO-seq dataset through the pipeline. The output from this process is the same as you see in the example PRO-seq output we've provided. To use this tutorial, you should have a basic familiarity with working in a command line driven environment. You also need to have already installed PEPPRO prerequisites, which you can do following one of the installation instructions.

1. Set up folders

From an open terminal, let's first create a directory we'll use to run through this guide:

mkdir peppro_tutorial

Let's move into our newly created directory and create a few more folders that we'll use later.

cd peppro_tutorial/
mkdir data
mkdir processed
mkdir divvy_templates
mkdir tools
cd tools/
git clone https://github.com/databio/peppro.git

2. Download tutorial read files

We're going to work with some files a little larger than the test data included in the pipeline so we can see all the features included in a full run of the pipeline. Go ahead and download the tutorial_r1.fastq.gz and tutorial_r2.fq.gz files.

wget http://big.databio.org/peppro/fastq/tutorial_r1.fq.gz
wget http://big.databio.org/peppro/fastq/tutorial_r2.fq.gz

To simplify the rest of this tutorial, let's put those files in a standard location we'll use for the rest of this guide.

mv tutorial_r1.fq.gz peppro/examples/data/
mv tutorial_r2.fq.gz peppro/examples/data/

3. Configure project files

We're going to use looper to analyze our data. For that, we need to pass looper a configuration file. This project config file describes your project. See looper docs for details. A configuration file has been provided for you in the pipeline itself, conveniently named tutorial.yaml. This configuration file also points to our sample. In this case, we've provided a sample for you with the pipeline. You don't have to do anything else at this point and may skip right to running the sample if you'd like. Otherwise, we'll briefly touch on what those configuration files look like.

You can open the configuration file in your favorite text editor if you'd like to look closer. For the purposes of the tutorial you may safely move past this step should you choose.

cd peppro/examples/meta/
nano tutorial.yaml

The following is what you should see in that configuration file.

# Run tutorial samples through PEPPRO
name: tutorial

pep_version: 2.0.0
sample_table: tutorial.csv

looper:
  output_dir: "$PROCESSED/tutorial"
  pipeline_interfaces: ["$CODEBASE/peppro/project_pipeline_interface.yaml"]

sample_modifiers:
  append:
    pipeline_interfaces: ["$CODEBASE/peppro/sample_pipeline_interface.yaml"] 
  derive:
    attributes: [read1, read2]
    sources:
        R1: "$CODEBASE/peppro/examples/data/{sample_name}_r1.fq.gz"
        R2: "$CODEBASE/peppro/examples/data/{sample_name}_r2.fq.gz"
  imply:
    - if:
        organism: ["human", "Homo sapiens", "Human", "Homo_sapiens"]
      then:
        genome: "hg38"
        prealignments: ["human_rDNA"]

There is also a sample annotation file referenced in our configuration file. The sample annotation file contains metadata and other information about our sample. Just like before, this file, named tutorial.csv has been provided. You may check it out if you wish, otherwise we're all set. If you open tutorial.csv, you should see the following:

sample_name,organism,protocol,read_type,read1,read2
tutorial,human,PROSEQ,paired,R1,R2

That's it! Let's analyze that sample!

4. Create environment variables

We also need to create some environment variables to help point looper to where we keep our data files and our tools. You may either set the environment variables up, like we're going to do now, or you may simply hard code the necessary locations in the configuration files. First, let's create a PROCESSED variable that represents the location where we want to save output.

export PROCESSED="/path/to/peppro_tutorial/processed/"

Second, we'll create a variable representing the root path to all our tools named CODEBASE.

export CODEBASE="/path/to/peppro_tutorial/tools/"

(Add these environment variables to your .bashrc or .profile so you don't have to always do this step). Fantastic! Now that we have the pipeline and its requirements installed, we're ready to get our reference genome(s).

5. Use looper to run the pipeline

Looper requires a few variables and configuration files to work for the specific user. Let's get those set up now. Looper uses divvy to manage computing resource configuration so that projects and pipelines can easily travel among environments. For more detailed information, check out the looper docs. Let's set it up.

cd /path/to/peppro_tutorial/
export DIVCFG="/path/to/peppro_tutorial/compute_config.yaml"
divvy init $DIVCFG

You can open that initialized file in your favorite text editor if you want to learn more about its structure. If you need to edit this file further for your own setup you can learn more about that in the looper docs.

nano compute_config.yaml

# Use this to change your cluster manager (SLURM, SGE, LFS, etc).
# Relative paths are relative to this compute environment configuration file.
# Compute resource parameters fill the submission_template file's fields.
adapters:
  CODE: looper.command
  JOBNAME: looper.job_name
  CORES: compute.cores
  LOGFILE: looper.log_file
  TIME: compute.time
  MEM: compute.mem
  DOCKER_ARGS: compute.docker_args
  DOCKER_IMAGE: compute.docker_image
  SINGULARITY_IMAGE: compute.singularity_image
  SINGULARITY_ARGS: compute.singularity_args
compute_packages:
  default:
    submission_template: divvy_templates/localhost_template.sub
    submission_command: .
  local:
    submission_template: divvy_templates/localhost_template.sub
    submission_command: .
  slurm:
    submission_template: divvy_templates/slurm_template.sub
    submission_command: sbatch
  singularity:
    submission_template: divvy_templates/localhost_singularity_template.sub
    submission_command: .
    singularity_args: ""
  singularity_slurm:
    submission_template: divvy_templates/slurm_singularity_template.sub
    submission_command: sbatch
    singularity_args: ""
  bulker_local:
    submission_template: divvy_templates/localhost_bulker_template.sub
    submission_command: sh
  docker:
    submission_template: divvy_templates/localhost_docker_template.sub
    submission_command: .
    docker_args: |
      --user=$(id -u):$(id -g) \
      --env="DISPLAY" \
      --volume="/etc/group:/etc/group:ro" \
      --volume="/etc/passwd:/etc/passwd:ro" \
      --volume="/etc/shadow:/etc/shadow:ro"  \
      --volume="/etc/sudoers.d:/etc/sudoers.d:ro" \
      --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
      --workdir="`pwd`" \

(Remember to add DIVCFG to your .bashrc or .profile to ensure it persists). The looper environment configuration file points to submission template(s) in order to know how to run a samples locally or using cluster resources. If you'd like to learn more, check out the DIVCFG configuration file and submission templates. We're going to simply setup a local template for the purposes of this tutorial. You can also easily create templates for cluster or container use as well! Let's change to our templates/ directory to make our first submission template.

cd /path/to/peppro_tutorial/divvy_templates/
nano localhost_template.sub

Paste the following into the localhost_template.sub:

#!/bin/bash

echo 'Compute node:' `hostname`
echo 'Start time:' `date +'%Y-%m-%d %T'`

{
{CODE}
} | tee {LOGFILE}

Save and close that file, and return to the pipeline repository directory.

cd /path/to/peppro_tutorial/tools/peppro/

Now, we'll use looper to run the sample pipeline locally.

looper run examples/meta/tutorial.yaml

Congratulations! Your first sample should be running through the pipeline now. It takes right around 25 minutes for this process to complete using a single core and maxes at about 3.5 GB of memory.

We will also use looper to run the project pipeline locally. At the project level we can aggregate all the samples in our project (just 1 in this simple case) and view everything together.

looper runp examples/meta/tutorial.yaml

After the pipeline is finished, we can look through the output directory together. We've provided a breakdown of that directory in the browse output page.

6. Generate an HTML report using looper

Let's take full advantage of looper and generate a pipeline HTML report that makes all our results easy to view and browse. If you'd like to skip right to the results and see what it looks like, check out the tutorial results. Otherwise, let's generate a report ourselves. Using our same configuration file we used to run the samples through the pipeline, we'll now employ the report function of looper.

looper report examples/meta/tutorial.yaml

That's it! Easy, right? Looper conveniently provides you with the location where the HTML report is produced. You may either open the report with your preferred internet browser using the PATH provided, or we can change directories to the report's location and open it there. Let's go ahead and change into the directory that contains the report.

cd /path/to/peppro_tutorial/processed/tutorial/
firefox tutorial_summary.html

The HTML report contains a summary page that integrates the project level summary table and any project level objects including: raw aligned reads, percent aligned reads, and TSS enrichment scores. The status page lists all the samples in this project along with their current status, a link to their log files, the time it took to run the sample and the peak memory used during the run. The objects page provides links to separate pages for each object type. On each object page, all the individual samples' objects are provided. Similarly, the samples page contains links to individual pages for each sample. The sample pages list the individual summary statistics for that sample as well as links to log files, command logs, and summary files. The sample pages also provide links and thumbnails for any individual objects generated for that sample. Of course, all of these files are present in the sample directory, but the report provides easy access to them all.