Install and run
1: Clone the
git clone https://github.com/databio/peppro.git
2: Install required software
PEPPRO requires a set of Python and R packages to run.
PEPPRO uses several packages under the hood. Make sure you're up-to-date with a user-specific install:
cd peppro pip install --user -r requirements.txt
PEPPRO uses R to produce QC plots, and we include an R package for these functions. The
PEPPRO package relies on a handful of additional packages.
To install the prerequisite packages from the command line:
Rscript -e 'install.packages("devtools")' Rscript -e 'devtools::install_github("pepkit/pepr")' Rscript -e 'install.packages("BiocManager")' Rscript -e 'BiocManager::install("GenomicRanges")' Rscript -e 'devtools::install_github("databio/GenomicDistributions")' Rscript -e 'BiocManager::install(c("BSgenome", "GenomicFeatures", "ensembldb"))' Rscript -e 'install.packages("http://big.databio.org/GenomicDistributionsData/GenomicDistributionsData_0.0.1.tar.gz", repos=NULL)'
Then, install the
PEPPRO package. From the
Rscript -e 'devtools::install(file.path("PEPPROr/"), dependencies=TRUE, repos="https://cloud.r-project.org/")'
The pipeline also relies on a set of publicly available bioinformatic tools, but if you don't want to install the prerequisite software used by PEPPRO natively, you can learn to run PEPPRO using containers and skip this step.
Otherwise, you'll need to install the following: bedtools, bigWigCat, bowtie2, fastq-pair, flash, picard, preseq, seqkit, samtools, seqtk, and wigToBigWig. If you need help, we have detailed installation instructions for installing these.
refgenie assets for alignment. If you haven't already, initialize a refgenie config file like this:
export REFGENIE=your_genome_folder/genome_config.yaml refgenie init -c $REFGENIE
export REFGENIE line to your
.profile to ensure it persists.
Next, pull the assets you need. Replace
hg38 in the example below if you need to use a different genome assembly. If these assets are not available automatically for your genome of interest, then you'll need to build them. Download these required assets with this command:
refgenie pull -g hg38 -a fasta bowtie2_index ensembl_gtf ensembl_rb refgene_anno feat_annotation
PEPPRO also requires
bowtie2_index for any pre-alignment genomes:
refgenie pull -g human_rDNA -a bowtie2_index
PEPPRO can mix and match tools for adapter removal, read trimming, deduplication, and reverse complementation. The use of
fqdedup, in particular, is useful if you wish to minimize memory use at the expense of speed. We suggest using the default tools simply due to the fact that
fastx toolkit has not been supported since 2012.
seqOutBias can be used to take into account the mappability at a given read length to filter the sample signal.
4: Run an example project through PEPPRO
Start by running the example project (
peppro_test.yaml) in the
examples/meta/ folder. PEPPRO uses a project management tool called looper to run the pipeline across samples in a project. Let's use the
-d argument to do a dry run, which will create job scripts for every sample in a project, but will not execute them:
cd peppro looper run -d examples/meta/peppro_test.yaml
If the looper executable is not in your
$PATH, add the following line to your
If that worked, let's actually run the example by taking out the
looper run examples/meta/peppro_test.yaml
Or, if you're using
bulker to run the pipeline in containers:
bulker activate databio/peppro looper run examples/meta/peppro_test.yaml
There are lots of other cool things you can do with looper, like dry runs, summarize results, check on pipeline run status, clean intermediate files to save disk space, lump multiple samples into one job, and more. For details, consult the
5: Configure your project files
To run your own samples, you'll need to organize them in PEP format, which is explained in how to create a PEP and is universal to all pipelines that read PEPs, including
PEPPRO. To get you started, there are examples you can adapt in the
examples/ folder (e.g. example test PEP). In short, you need two files for your project:
- project config file -- describes output locations, pointers to data, etc.
- sample annotation file -- comma-separated value (CSV) list of your samples.
The sample annotation file must specify these columns:
- library (e.g. 'PRO', 'PROSEQ', 'PRO-seq', 'GRO', 'GROSEQ', 'GRO-seq')
- organism (e.g. 'human' or 'mouse')
- read2 (if paired)
- anything else you wish to include
This is just the beginning. For your next step, the extended tutorial will walk you through a real project. Or, take a look at one of other detailed user guide links in the side menu.