Run PEPPRO in a container.

A popular approach is installing all dependencies in a container and just use that single container. This container can be used with either docker or singularity. You can run PEPPRO as an individual pipeline on a single sample using the container with docker run or singularity exec. Or, you can rely on looper, which is already set up to run any pipeline in existing containers using the divvy templating system.

Running PEPPRO using a single, monolithic container.

1: Clone the PEPPRO pipeline

git clone https://github.com/databio/peppro.git

2: Get genome assets

We recommend refgenie to manage all required and optional genome assets. However, PEPPRO can also accept file paths to any of the assets.

2a: Initialize refgenie and download assets

PEPPRO can use refgenie assets for alignment and annotation. Because assets are user-dependent, these files must still exist outside of a container system. We need to install and initialize a refgenie config file.. For example:

pip install refgenie
export REFGENIE=/path/to/your_genome_folder/genome_config.yaml
refgenie init -c $REFGENIE

Add the export REFGENIE line to your .bashrc or .profile to ensure it persists.

Next, pull the assets you need. Replace hg38 in the example below if you need to use a different genome assembly. If these assets are not available automatically for your genome of interest, then you'll need to build them.

refgenie pull hg38/fasta hg38/bowtie2_index hg38/refgene_anno hg38/ensembl_gtf hg38/ensembl_rb
refgenie build hg38/feat_annotation

PEPPRO also requires a fasta and bowtie2_index asset for any pre-alignment genomes:

refgenie pull human_rDNA/fasta human_rDNA/bowtie2_index

2b: Download assets manually

If you prefer not to use refgenie, you can also download and construct assets manually. Again, because these are user-defined assets, they must exist outside of any container system. The minimum required assets for a genome includes:
- a chromosome sizes file: a text file containing "chr" and "size" columns.
- a bowtie2 genome index. - an ensembl_gtf asset used to build other derived assets including a comprehensive TSS annotation and gene body annotation. - an [ensembl_rb] (http://refgenie.databio.org/en/latest/available_assets/#ensembl_rb) asset containing known genomic features such as promoters and used to produce derived assets such as genomic feature annotations. - a refgene_anno asset used to produce derived assets including transcription start sites (TSSs), exons, introns, and premature mRNA sequences. - a genomic feature annotation file (which may also be built locally through the refgenie build <genome_name>/feat_annotation)

You can still obtain the pre-constructed assets from the refgenie servers. Refgenie uses algorithmically derived genome digests under-the-hood to unambiguously define genomes. That's what you'll see being used in the example below when we manually download these assets. Therefore, 2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 is the digest for the human readable alias, "hg38", and b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8 is the digest for "human_rDNA."

wget -O hg38.fasta.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/fasta?tag=default
wget  -O hg38.bowtie2_index.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/bowtie2_index?tag=default
wget  -O hg38.ensembl_gtf.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/ensembl_gtf?tag=default
wget  -O hg38.ensembl_rb.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/ensembl_rb?tag=default
wget  -O hg38.refgene_anno.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/refgene_anno?tag=default
wget -O hg38.feat_annotation.gz http://big.databio.org/peppro/hg38_annotations.bed.gz
wget  -O human_rDNA.fasta.tgz http://refgenomes.databio.org/v3/assets/archive/b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8/fasta?tag=default
wget  -O human_rDNA.bowtie2_index.tgz http://refgenomes.databio.org/v3/assets/archive/b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8/bowtie2_index?tag=default

Then, extract these files:

tar xf hg38.fasta.tgz
tar xf hg38.bowtie2_index.tgz
tar xf hg38.ensembl_gtf.tgz
tar xf hg38.ensembl_rb.tgz
tar xf hg38.refgene_anno.tgz
mv hg38.feat_annotation.gz default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.annotation.bed.gz
tar xf human_rDNA.fasta.tgz
tar xf human_rDNA.bowtie2_index.tgz

3. Pull the container image.

Docker: You can pull the docker databio/peppro image from dockerhub like this:

docker pull databio/peppro

Or build the image using the included Dockerfile (you can use a recipe in the included Makefile in the peppro/ repository):

make docker

Singularity: You can download the singularity image or build it from the docker image using the Makefile:

make singularity

Now you'll need to tell the pipeline where you saved the singularity image. You can either create an environment variable called $SIMAGES that points to the folder where your image is stored, or you can tweak the pipeline_interface.yaml file so that the compute.singularity_image attribute is pointing to the right location on disk.

6. Confirm installation

After setting up your environment to run PEPPRO using containers, you can confirm the pipeline is now executable with your container system using the included checkinstall script. This can either be run directly from the peppro/ repository...

./checkinstall

or from the web:

curl -sSL https://raw.githubusercontent.com/databio/peppro/checkinstall | bash

4. Run individual samples in a container

Individual jobs can be run in a container by simply running the peppro.py command through docker run or singularity exec. You can run containers either on your local computer, or in an HPC environment, as long as you have docker or singularity installed. You will need to include any volumes that contain data required by the pipeline. For example, to utilize refgenie assets you'll need to ensure the volume containing those files is available. In the following example, we are including an environment variable ($GENOMES) which points to such a directory.

For example, run it locally in singularity like this:

singularity exec $SIMAGES/peppro pipelines/peppro.py --help

With docker, you can use:

docker run --rm -it databio/peppro pipelines/peppro.py --help

5. Running multiple samples in a container with looper

To run multiple samples in a container, you simply need to configure looper to use a container-compatible template. The looper documentation has instructions for running jobs in containers.

Container details

Using docker

The pipeline has been successfully run in both a Linux and MacOS environment. With docker you need to bind mount your volume that contains the pipeline and your genome assets locations, as well as provide the container the same environment variables your host environment is using.

In the first example, we're mounting our home user directory (/home/jps3ag/) which contains the parent directories to our genome assets and to the pipeline itself. We'll also provide the pipeline environment variables, such as $HOME.

Here's that example command in a Linux environment to run the test example through the pipeline (using the manually downloaded genome assets):

docker run --rm -it --volume /home/jps3ag/:/home/jps3ag/ \
  -e HOME='/home/jps3ag/' \
  databio/peppro \
  /home/jps3ag/src/peppro/pipelines/peppro.py --single-or-paired single \
  --prealignment-index human_rDNA=default/b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8 \
  --genome hg38 \
  --genome-index /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 \
  --chrom-sizes /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.chrom.sizes \
  --sample-name test \
  --input /home/jps3ag/src/peppro/examples/data/test_r1.fq.gz \
  --TSS-name /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_TSS.bed \
  --anno-name /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.annotation.bed.gz \
  --pre-name /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_pre-mRNA.bed \
  --exon-name /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_exons.bed \
  --intron-name /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_introns.bed \
  --pi-tss /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_ensembl_TSS.bed \
  --pi-body /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_ensembl_gene_body.bed \
  -O $HOME/peppro_test

In this second example, we'll perform the same command in a MacOS environment using Docker for Mac.

This necessitates a few minor changes to run that same example:

  • replace /home/ with /Users/ format
  • e.g. --volume /Users/jps3ag/:/Users/jps3ag/

Be sure to allocate sufficient memory (6-8GB should generally be adequate) in Docker for Mac.

docker run --rm -it --volume /Users/jps3ag/:/Users/jps3ag/ \
  -e HOME="/Users/jps3ag/" \
  databio/peppro \
  /Users/jps3ag/src/peppro/pipelines/peppro.py --single-or-paired single \
  --prealignment-index human_rDNA=/Users/jps3ag/src/peppro/default/b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8 \
  --genome hg38 \
  --genome-index /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 \
  --chrom-sizes /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.chrom.sizes \
  --sample-name test \
  --input /Users/jps3ag/src/peppro/examples/data/test_r1.fq.gz \
  --TSS-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_TSS.bed \
  --anno-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.annotation.bed.gz \
  --pre-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_pre-mRNA.bed \
  --exon-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_exons.bed \
  --intron-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_introns.bed \
  --pi-tss /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_ensembl_TSS.bed \
  --pi-body /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_ensembl_gene_body.bed \
  -O peppro_test

Using singularity

First, build a singularity container from the docker image and create a running instance:

singularity build peppro docker://databio/peppro:latest
singularity instance start -B /home/jps3ag/:/home/jps3aq/ peppro peppro_instance

Second, run your command.

singularity exec instance://peppro_instance \
  /home/jps3ag/src/peppro/pipelines/peppro.py --single-or-paired single \
  --prealignment-index human_rDNA=/Users/jps3ag/src/peppro/default/b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8 \
  --genome hg38 \
  --genome-index /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 \
  --chrom-sizes /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.chrom.sizes \
  --sample-name test \
  --input /home/jps3ag/src/peppro/examples/data/test_r1.fq.gz \
  --TSS-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_TSS.bed \
  --anno-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.annotation.bed.gz \
  --pre-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_pre-mRNA.bed \
  --exon-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_exons.bed \
  --intron-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_introns.bed \
  --pi-tss /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_ensembl_TSS.bed \
  --pi-body /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_ensembl_gene_body.bed \
  -O peppro_test

Third, close your instance when finished.

singularity instance stop peppro_instance