Detailed installation instructions

This guide walks you through the minutiae of how to install each prerequisite component. We'll presume you're installing this in a Linux environment. If not the case, you'll need to go to each tool's respective site to find alternative installation approaches and options.

1: Clone the PEPPRO pipeline

To begin, we need to get the PEPPRO pipeline itself. The pipeline is hosted on github. If you don't have git installed, follow the git installation instructions, and here is a brief introduction to git. To install PEPPRO, you can use one of the following methods:

  • using SSH:
git clone [email protected]:databio/peppro.git
  • using HTTPS:
git clone https://github.com/databio/peppro.git

We'll use HTTPS in this example. From an open terminal, let's first create a directory we'll use to run through this guide:

mkdir peppro_tutorial

Let's move into our newly created directory and create a few more folders that we'll use later.

cd peppro_tutorial/
mkdir data
mkdir genomes
mkdir processed
mkdir templates
mkdir tools
cd tools/

Time to get PEPPRO!

git clone https://github.com/databio/peppro.git

Success! If you had any issues, feel free to reach out to us with questions. Otherwise, let's move on to installing additional software.

2: Install required software

You have two options for installing the software prerequisites: 1) use a container, in which case you need only either docker or singularity; or 2) install all prerequisites natively. We'll install everything natively in this guide.

To use PEPPRO, we need the following software: Python packages. The pipeline uses pypiper to run a single sample, looper to handle multi-sample projects (for either local or cluster computation), pararead for parallel processing sequence reads, refgenie to organize and build reference assemblies, cutadapt to remove adapters for single-end data or optionally for paired-end, and the common python libraries numpy and pandas. You can do a user-specific install using the included requirements.txt file in the pipeline directory:

pip install --user -r requirements.txt

Remember to add your user specific install location to your PATH.

export PATH="$PATH:$HOME/.local/bin/"

Required executables. We will need some common bioinformatics tools installed. The complete list (including optional tools) is specified in the pipeline configuration file (pipelines/peppro.yaml) tools section. The following tools are used by the pipeline:

bedtools

We'll install each of these pieces of software before moving forward. Let's start right at the beginning and install bedtools. We're going to install from source, but if you would prefer to install from a package manager, you can follow the instructions in the bedtools' installation guide.

cd tools/
wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
tar -zxvf bedtools-2.25.0.tar.gz
rm bedtools-2.25.0.tar.gz
cd bedtools2
make

Now, let's add bedtools to our PATH environment variable. Look here to learn more about the concept of environment variables if you are unfamiliar.

export PATH="$PATH:/path/to/peppro_tutorial/tools/bedtools2/bin/"

bowtie2

Next, let's install bowtie2. For more more specific instruction, read the author's installation guide.

cd ../
wget https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.3.4.1/bowtie2-2.3.4.1-source.zip
unzip bowtie2-2.3.4.1-source.zip
rm bowtie2-2.3.4.1-source.zip
cd bowtie2-2.3.4.1
make

Again, let's add bowtie2 to our PATH environment variable:

export PATH="$PATH:/path/to/peppro_tutorial/tools/bowtie2-2.3.4.1/"

Great! On to the next one.

samtools

Next up, samtools.

wget https://sourceforge.net/projects/samtools../files/samtools/1.9/samtools-1.9.tar.bz2
tar xvfj samtools-1.9.tar.bz2
rm samtools-1.9.tar.bz2
cd samtools-1.9
./configure

Alternatively, if you do not have the ability to install samtools to the default location, you can specify using the --prefix=/install/destination/dir/ option. Learn more about the --prefix option here.

make
make install

As for our other tools, add samtools to our PATH environment variable:

export PATH="$PATH:/path/to/peppro_tutorial/tools/samtools-1.9/"

seqkit

Let's grab seqkit now. Check out the author's installation guide for more instruction if necessary.

cd ../
wget https://github.com/shenwei356/seqkit/releases/download/v0.10.1/seqkit_linux_amd64.tar.gz
tar -zxvf seqkit_linux_amd64.tar.gz

And then make sure that executable is in our PATH.

export PATH="$PATH:/path/to/peppro_tutorial/tools/"

fastp

Next on our list is fastp. Check out their install instructions if necessary.

git clone https://github.com/OpenGene/fastp.git
cd fastp/
make
make install

Add to PATH!

export PATH="$PATH:/path/to/peppro_tutorial/tools/fastp/"

preseq

The pipeline uses preseq to calculate library complexity. Check out the author's page for more instruction.

cd ../
wget http://smithlabresearch.org/downloads/preseq_linux_v2.0.tar.bz2
tar xvfj preseq_linux_v2.0.tar.bz2

Add to PATH!

export PATH="$PATH:/path/to/peppro_tutorial/tools/preseq_v2.0/"

fastq_pair

Finally, because PRO-seq treats read1 differently than read2 in paired-end data, we need to resync paired-end files after processing. We use fastq_pair to do so efficiently.

git clone https://github.com/linsalrob/fastq-pair.git
cd fastq-pair/
mkdir build
cd build/
cmake3 ..
make
make install

picard

PEPPRO is built using PyPiper and relies upon the PyPiper NGSTK tool kit which itself employs Picard. Read the picard installation guide for more assistance.

cd ../../
wget https://github.com/broadinstitute/picard/releases/download/2.20.3/picard.jar
chmod +x picard.jar

Create an environmental variable pointing to the picard.jar file called $PICARD. Alternatively, update the peppro.yaml file with the full PATH to the picard.jar file.

export PICARD="/path/to/peppro_tutorial/tools/picard.jar"

UCSC utilities

Finally, we need a few of the UCSC utilities. You can install the entire set of tools should you choose, but here we'll just grab the subset that we need.

wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigCat
chmod 755 wigToBigWig
chmod 755 bigWigCat

Add our tools/ directory to our PATH environment variable.

export PATH="$PATH:/path/to/peppro_tutorial/tools/"

That should do it! Now we'll install some optional packages. Of course, these are not required, but for the purposes of this tutorial we're going to be completionists.

Optional software

PEPPRO uses R to generate quality control plots. These are optional and the pipeline will run without them, but you would not get any QC plots. If you need to don't have R installed, you can follow these instructions. We'll use and install the necessary packages in this example. Here is the list of required packages:

To install the needed packages, enter the following command in the pipeline folder:

Rscript -e 'install.packages("PEPPROr", repos=NULL, type="source")'

To extract files quicker, PEPPRO can also utilize pigz in place of gzip if you have it installed. Let's go ahead and do that now. It's not required, but it can help speed everything up when you have many samples to process.

cd /path/to/peppro_tutorial/tools/
wget http://zlib.net/pigz/pigz-2.4.tar.gz
tar xvfz pigz-2.4.tar.gz
rm pigz-2.4.tar.gz
cd pigz-2.4/
make

Don't forget to add this to your PATH too!

export PATH="$PATH:/path/to/peppro_tutorial/tools/pigz-2.4/"

That's it! Everything we need to run PEPPRO to its full potential should be installed. If you are interested and have experience using containers, you can check out the alternate installation methods.

Create environment variables

We also need to create some environment variables to help point looper to where we keep our data files and our tools. You may either set the environment variables up, like we're going to do now, or you may simply hard code the necessary locations in our configuration files. First, let's create a PROCESSED variable that represents the location where we want to save output.

export PROCESSED="/path/to/peppro_tutorial/processed/"

Second, we'll create a variable representing the root path to all our tools named CODEBASE.

export CODEBASE="/path/to/peppro_tutorial/tools/"

(Add these environment variables to your .bashrc or .profile so you don't have to always do this step). Fantastic! Now that we have the pipeline and its requirements installed, we're ready to get our reference genome(s).

3: Download a reference genome

Before we analyze anything, we also need a reference genome. PEPPRO uses refgenie genomes. For the purposes of this tutorial, we'll just download pre-built genomes. Follow the 'refgenie instructions if you'd like to build your own reference genome. First, let's change into our genomes/ folder.

cd /path/to/peppro_tutorial/genomes/
wget http://big.databio.org/refgenomes/hg38.tgz
wget http://cloud.databio.org.s3.amazonaws.com/refgenomes/human_repeats_170502.tgz
wget http://cloud.databio.org.s3.amazonaws.com/refgenomes/rCRSd_170502.tgz
tar xvfz hg38.tgz
tar xvfz human_repeats_170502.tgz
tar xvfz rCRSd_170502.tgz
rm hg38.tgz
rm human_repeats_170502.tgz
rm rCRSd_170502.tgz

4: Point the pipeline to your Refgenie assemblies

Let's also create another environment variable that points to our genomes.

export GENOMES="/path/to/peppro_tutorial/genomes/

(Don't forget to add this to your .bashrc or .profile to ensure it persists).

5: Download or create annotation files

To calculate TSS enrichments, you will need a TSS annotation file in your reference genome directory. If a pre-built version for your genome of interest isn't present, you can quickly create that file yourself. In the reference genome directory, you can perform the following commands for in this example, hg38:

wget -O hg38_TSS_full.txt.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz \
zcat hg38_TSS_full.txt.gz | \
  awk  '{if($4=="+"){print $3"\t"$5"\t"$5"\t"$4"\t"$13}else{print $3"\t"$6"\t"$6"\t"$4"\t"$13}}' | \
  LC_COLLATE=C sort -k1,1 -k2,2n -u > hg38_TSS.tsv

We also have downloadable pre-built genome annotation files for hg38, hg19, mm10, and mm9 that you can use to annotate the reads and peaks. These files annotate 3' and 5' UTR, Exonic, Intronic, Intergenic, Promoter, and Promoter Flanking Regions of the corresponding genome as indicated in Ensembl or UCSC. Simply move the corresponding genome annotation file into the peppro/anno folder. Once present in the peppro/anno folder you don't need to do anything else as the pipeline will look there automatically. Alternatively, you can use the --anno-name pipeline option to directly point to this file when running. You can also learn how to create a custom annotation file to calculate coverage using your own features of interest.

Alright! Time to setup the pipeline configuration files and run our sample.