Getting started

1: Clone the PEPPRO pipeline

Clone the pipeline:

git clone https://github.com/databio/peppro.git

2: Install required software

PEPPRO requires a series of publicly-available, common bioinformatics tools including: samtools, bedtools, bowtie2, seqkit, fastp, seqtk, preseq, fastq-pair, picard, wigToBigWig, and bigWigCat.

Python packages

PEPPRO uses several packages under the hood. Make sure you're up-to-date with a user-specific install:

cd peppro
pip install --user -r requirements.txt

R package

PEPPRO uses R to produce QC plots, and we include an R package for these functions. From the peppro/ directory:

Rscript -e 'install.packages("PEPPROr", repos=NULL, type="source")'

Optional software

Optionally, PEPPRO can mix and match tools for adapter removal, read trimming, deduplication, and reverse complementation. The use of fqdedup, in particular, is useful if you wish to minimize memory use at the expense of speed. We suggest using the default tools simply due to the fact that fastx toolkit has not been supported since 2012.

seqOutBias can be used to take into account the mappability at a given read length to filter the sample signal.

Optional tools: fqdedup, fastx toolkit, seqOutBias, fastqc, and pigz (v2.3.4+).

3: Download refgenie assemblies

The pipeline relies on refgenie assemblies for alignment. First, initialize a folder for genome indexes and the refgenie config file.

export REFGENIE=your_genome_folder/genome_config.yaml
refgenie init -c $REFGENIE

Then, just pull the assets you need.

refgenie pull -g hg38 -a bowtie2
refgenie pull -g rCRSd -a bowtie2
refgenie pull -g human_repeats -a bowtie2

(Add REFGENIE to your .bashrc or .profile to ensure it persists). Alternatively, you can skip the REFGENIE variable and simply change the value of the resources.genome_config option in the pipeline_config.yaml file to point to the folder where you stored the assemblies.

4: Run the pipeline script directly

The pipeline at its core is just a python script, and you can run it on the command line for a single sample (see command-line usage), which you can also get on the command line by running pipelines/peppro.py --help. You just need to pass a few command-line parameters to specify sample name, reference genome, input files, etc. Here's the basic command to run the included small test example through the pipeline:

/pipelines/peppro.py \
  --sample-name test \
  --genome hg38 \
  --input examples/data/test_r1.fq.gz \
  --single-or-paired single \
  -O $HOME/peppro_example/

This test example takes less than 5 minutes to complete. Read more about how to run the test sample using Looper with the included example peppro_test.yaml file.

5. Next steps

This is just the beginning. For your next step, take a look at one of these user guides: