Run
in a container.
A popular approach is installing all dependencies in a container and just use that single container. This container can be used with either docker or singularity. You can run PEPPRO as an individual pipeline on a single sample using the container with docker run or singularity exec. Or, you can rely on looper, which is already set up to run any pipeline in existing containers using the divvy templating system.
Running PEPPRO using a single, monolithic container.
1: Clone the PEPPRO pipeline
git clone https://github.com/databio/peppro.git
2: Get genome assets
We recommend refgenie to manage all required and optional genome assets. However, PEPPRO can also accept file paths to any of the assets.
2a: Initialize refgenie and download assets
PEPPRO can use refgenie assets for alignment and annotation. Because assets are user-dependent, these files must still exist outside of a container system. We need to install and initialize a refgenie config file.. For example:
pip install refgenie
export REFGENIE=/path/to/your_genome_folder/genome_config.yaml
refgenie init -c $REFGENIE
Add the export REFGENIE line to your .bashrc or .profile to ensure it persists.
Next, pull the assets you need. Replace hg38 in the example below if you need to use a different genome assembly. If these assets are not available automatically for your genome of interest, then you'll need to build them.
refgenie pull hg38/fasta hg38/bowtie2_index hg38/refgene_anno hg38/ensembl_gtf hg38/ensembl_rb
refgenie build hg38/feat_annotation
PEPPRO also requires a fasta and bowtie2_index asset for any pre-alignment genomes:
refgenie pull human_rDNA/fasta human_rDNA/bowtie2_index
2b: Download assets manually
If you prefer not to use refgenie, you can also download and construct assets manually. Again, because these are user-defined assets, they must exist outside of any container system. The minimum required assets for a genome includes:
- a chromosome sizes file: a text file containing "chr" and "size" columns.
- a bowtie2 genome index.
- an ensembl_gtf asset used to build other derived assets including a comprehensive TSS annotation and gene body annotation.
- an [ensembl_rb] (http://refgenie.databio.org/en/latest/available_assets/#ensembl_rb) asset containing known genomic features such as promoters and used to produce derived assets such as genomic feature annotations.
- a refgene_anno asset used to produce derived assets including transcription start sites (TSSs), exons, introns, and premature mRNA sequences.
- a genomic feature annotation file (which may also be built locally through the refgenie build <genome_name>/feat_annotation)
You can still obtain the pre-constructed assets from the refgenie servers. Refgenie uses algorithmically derived genome digests under-the-hood to unambiguously define genomes. That's what you'll see being used in the example below when we manually download these assets. Therefore, 2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 is the digest for the human readable alias, "hg38", and b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8 is the digest for "human_rDNA."
wget -O hg38.fasta.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/fasta?tag=default
wget -O hg38.bowtie2_index.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/bowtie2_index?tag=default
wget -O hg38.ensembl_gtf.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/ensembl_gtf?tag=default
wget -O hg38.ensembl_rb.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/ensembl_rb?tag=default
wget -O hg38.refgene_anno.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/refgene_anno?tag=default
wget -O hg38.feat_annotation.gz http://big.databio.org/peppro/hg38_annotations.bed.gz
wget -O human_rDNA.fasta.tgz http://refgenomes.databio.org/v3/assets/archive/b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8/fasta?tag=default
wget -O human_rDNA.bowtie2_index.tgz http://refgenomes.databio.org/v3/assets/archive/b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8/bowtie2_index?tag=default
Then, extract these files:
tar xf hg38.fasta.tgz
tar xf hg38.bowtie2_index.tgz
tar xf hg38.ensembl_gtf.tgz
tar xf hg38.ensembl_rb.tgz
tar xf hg38.refgene_anno.tgz
mv hg38.feat_annotation.gz default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.annotation.bed.gz
tar xf human_rDNA.fasta.tgz
tar xf human_rDNA.bowtie2_index.tgz
3. Pull the container image.
Docker: You can pull the docker databio/peppro image from dockerhub like this:
docker pull databio/peppro
Or build the image using the included Dockerfile (you can use a recipe in the included Makefile in the peppro/ repository):
make docker
Singularity: You can download the singularity image or build it from the docker image using the Makefile:
make singularity
Now you'll need to tell the pipeline where you saved the singularity image. You can either create an environment variable called $SIMAGES that points to the folder where your image is stored, or you can tweak the pipeline_interface.yaml file so that the compute.singularity_image attribute is pointing to the right location on disk.
6. Confirm installation
After setting up your environment to run PEPPRO using containers, you can confirm the pipeline is now executable with your container system using the included checkinstall script. This can either be run directly from the peppro/ repository...
./checkinstall
or from the web:
curl -sSL https://raw.githubusercontent.com/databio/peppro/checkinstall | bash
4. Run individual samples in a container
Individual jobs can be run in a container by simply running the peppro.py command through docker run or singularity exec. You can run containers either on your local computer, or in an HPC environment, as long as you have docker or singularity installed. You will need to include any volumes that contain data required by the pipeline. For example, to utilize refgenie assets you'll need to ensure the volume containing those files is available. In the following example, we are including an environment variable ($GENOMES) which points to such a directory.
For example, run it locally in singularity like this:
singularity exec $SIMAGES/peppro pipelines/peppro.py --help
With docker, you can use:
docker run --rm -it databio/peppro pipelines/peppro.py --help
5. Running multiple samples in a container with looper
To run multiple samples in a container, you simply need to configure looper to use a container-compatible template. The looper documentation has instructions for running jobs in containers.
Container details
Using docker
The pipeline has been successfully run in both a Linux and MacOS environment. With docker you need to bind mount your volume that contains the pipeline and your genome assets locations, as well as provide the container the same environment variables your host environment is using.
In the first example, we're mounting our home user directory (/home/jps3ag/) which contains the parent directories to our genome assets and to the pipeline itself. We'll also provide the pipeline environment variables, such as $HOME.
Here's that example command in a Linux environment to run the test example through the pipeline (using the manually downloaded genome assets):
docker run --rm -it --volume /home/jps3ag/:/home/jps3ag/ \
-e HOME='/home/jps3ag/' \
databio/peppro \
/home/jps3ag/src/peppro/pipelines/peppro.py --single-or-paired single \
--prealignment-index human_rDNA=default/b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8 \
--genome hg38 \
--genome-index /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 \
--chrom-sizes /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.chrom.sizes \
--sample-name test \
--input /home/jps3ag/src/peppro/examples/data/test_r1.fq.gz \
--TSS-name /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_TSS.bed \
--anno-name /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.annotation.bed.gz \
--pre-name /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_pre-mRNA.bed \
--exon-name /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_exons.bed \
--intron-name /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_introns.bed \
--pi-tss /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_ensembl_TSS.bed \
--pi-body /home/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_ensembl_gene_body.bed \
-O $HOME/peppro_test
In this second example, we'll perform the same command in a MacOS environment using Docker for Mac.
This necessitates a few minor changes to run that same example:
- replace
/home/with/Users/format - e.g.
--volume /Users/jps3ag/:/Users/jps3ag/
Be sure to allocate sufficient memory (6-8GB should generally be adequate) in Docker for Mac.
docker run --rm -it --volume /Users/jps3ag/:/Users/jps3ag/ \
-e HOME="/Users/jps3ag/" \
databio/peppro \
/Users/jps3ag/src/peppro/pipelines/peppro.py --single-or-paired single \
--prealignment-index human_rDNA=/Users/jps3ag/src/peppro/default/b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8 \
--genome hg38 \
--genome-index /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 \
--chrom-sizes /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.chrom.sizes \
--sample-name test \
--input /Users/jps3ag/src/peppro/examples/data/test_r1.fq.gz \
--TSS-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_TSS.bed \
--anno-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.annotation.bed.gz \
--pre-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_pre-mRNA.bed \
--exon-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_exons.bed \
--intron-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_introns.bed \
--pi-tss /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_ensembl_TSS.bed \
--pi-body /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_ensembl_gene_body.bed \
-O peppro_test
Using singularity
First, build a singularity container from the docker image and create a running instance:
singularity build peppro docker://databio/peppro:latest
singularity instance start -B /home/jps3ag/:/home/jps3aq/ peppro peppro_instance
Second, run your command.
singularity exec instance://peppro_instance \
/home/jps3ag/src/peppro/pipelines/peppro.py --single-or-paired single \
--prealignment-index human_rDNA=/Users/jps3ag/src/peppro/default/b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8 \
--genome hg38 \
--genome-index /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 \
--chrom-sizes /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.chrom.sizes \
--sample-name test \
--input /home/jps3ag/src/peppro/examples/data/test_r1.fq.gz \
--TSS-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_TSS.bed \
--anno-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.annotation.bed.gz \
--pre-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_pre-mRNA.bed \
--exon-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_exons.bed \
--intron-name /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_introns.bed \
--pi-tss /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_ensembl_TSS.bed \
--pi-body /Users/jps3ag/src/peppro/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4_ensembl_gene_body.bed \
-O peppro_test
Third, close your instance when finished.
singularity instance stop peppro_instance