Detailed installation instructions
This guide walks you through the minutiae of how to install each prerequisite component. We'll presume you're installing this in a Linux environment. If not the case, you'll need to go to each tool's respective site to find alternative installation approaches and options.
Install required software
You have two options for installing the software prerequisites: 1) use a container, in which case you need only either
singularity; or 2) install all prerequisites natively. We'll install everything natively in this guide. If you want to try the container approach, read PEPPRO in containers.
PEPPRO, we need the following software:
Python packages. The pipeline uses
pypiper to run a single sample,
looper to handle multi-sample projects (for either local or cluster computation),
pararead for parallel processing sequence reads,
refgenie to organize and build reference assemblies,
cutadapt to remove adapters,
refgenie to manage genome assets, and the common
pandas. You can do a user-specific install using the included requirements.txt file in the pipeline directory:
pip install --user -r requirements.txt
Remember to add your user specific install location to your
Required executables. We will need some common bioinformatics tools installed. The complete list (including optional tools) is specified in the pipeline configuration file (pipelines/peppro.yaml) tools section. The following tools are used by the pipeline:
- bedtools (v2.25.0+)
- bowtie2 (v2.2.9+)
- samtools (v1.7)
- Two specific UCSC tools (v3.5.1)
We'll install each of these pieces of software before moving forward. Let's start right at the beginning and install
bedtools. We're going to install from source, but if you would prefer to install from a package manager, you can follow the instructions in the bedtools' installation guide.
cd tools/ wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz tar -zxvf bedtools-2.25.0.tar.gz rm bedtools-2.25.0.tar.gz cd bedtools2 make
Now, let's add
bedtools to our
PATH environment variable. Look here to learn more about the concept of environment variables if you are unfamiliar.
Next, let's install
bowtie2. For more more specific instruction, read the author's installation guide.
cd ../ wget https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/126.96.36.199/bowtie2-188.8.131.52-source.zip unzip bowtie2-184.108.40.206-source.zip rm bowtie2-220.127.116.11-source.zip cd bowtie2-18.104.22.168 make cd ../
Again, let's add
bowtie2 to our
PATH environment variable:
Great! On to the next one.
Finally, because PRO-seq treats read1 differently than read2 in paired-end data, we need to resync paired-end files after processing. We use
fastq_pair to do so efficiently.
git clone https://github.com/linsalrob/fastq-pair.git cd fastq-pair/ mkdir build cd build/ cmake3 .. make make install cd ../../
To obtain a plot to evaluate library quality when we have paired-end reads, we use FLASH to generate a distribution of reads.
wget http://ccb.jhu.edu/software/FLASH/FLASH-1.2.11-Linux-x86_64.tar.gz tar xvfz FLASH-1.2.11-Linux-x86_64.tar.gz
And let's add
FLASH to our
PATH environment variable:
PEPPRO is built using
PyPiper and relies upon the
PyPiper NGSTK tool kit which itself employs
Picard. Read the
picard installation guide for more assistance.
wget https://github.com/broadinstitute/picard/releases/download/2.20.3/picard.jar chmod +x picard.jar
Create an environmental variable pointing to the
picard.jar file called
$PICARD. Alternatively, update the
peppro.yaml file with the full PATH to the
The pipeline uses
preseq to calculate library complexity. Check out the author's page for more instruction.
wget http://smithlabresearch.org/downloads/preseq_linux_v2.0.tar.bz2 tar xvfj preseq_linux_v2.0.tar.bz2
wget https://github.com/samtools/samtools/releases/download/1.10/samtools-1.10.tar.bz2 tar xvfj samtools-1.10.tar.bz2 rm samtools-1.10.tar.bz2 cd samtools-1.10/ ./configure
Alternatively, if you do not have the ability to install
samtools to the default location, you can specify using the
--prefix=/install/destination/dir/ option. Learn more about the
--prefix option here.
make make install
As for our other tools, add
samtools to our
PATH environment variable:
seqkit now. Check out the author's installation guide for more instruction if necessary.
cd ../ wget https://github.com/shenwei356/seqkit/releases/download/v0.10.1/seqkit_linux_amd64.tar.gz tar -zxvf seqkit_linux_amd64.tar.gz
And then make sure that executable is in our
Finally, we need a few of the UCSC utilities. You can install the entire set of tools should you choose, but here we'll just grab the subset that we need.
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigCat chmod 755 wigToBigWig chmod 755 bigWigCat
tools/ directory to our
PATH environment variable.
That should do it! Now we'll install some optional packages. Of course, these are not required, but for the purposes of this tutorial we're going to be completionists.
R to generate quality control plots. These are optional and the pipeline will run without them, but you would not get any QC plots. If you need to don't have R installed, you can follow these instructions. We'll use and install the necessary packages in this example. Here is the list of required packages:
- data.table (v1.11.2)
- GenomicDistributions (v0.5)
- ggplot2 (v2.2.1)
- pepr (v0.2.1)
- optigrab (v0.9.2.1)
To install the needed packages, enter the following command in the pipeline folder:
Rscript -e 'install.packages("devtools")' Rscript -e 'devtools::install_github("pepkit/pepr")' Rscript -e 'install.packages("BiocManager")' Rscript -e 'BiocManager::install("GenomicRanges")' Rscript -e 'devtools::install_github("databio/GenomicDistributions")' Rscript -e 'install.packages("http://big.databio.org/GenomicDistributionsData/GenomicDistributionsData_0.0.1.tar.gz", repos=NULL)' Rscript -e 'devtools::install(file.path("PEPPROr/"), dependencies=TRUE, repos="https://cloud.r-project.org/")'
To extract files quicker,
PEPPRO can also utilize
pigz in place of
gzip if you have it installed. Let's go ahead and do that now. It's not required, but it can help speed everything up when you have many samples to process.
cd /path/to/peppro_tutorial/tools/ wget http://zlib.net/pigz/pigz-2.4.tar.gz tar xvfz pigz-2.4.tar.gz rm pigz-2.4.tar.gz cd pigz-2.4/ make
Don't forget to add this to your
refgenie assets for alignment, quality control reports, and some outputs. You can initialize a refgenie config file like this:
export REFGENIE=your_genome_folder/genome_config.yaml refgenie init -c $REFGENIE
export REFGENIE line to your
.profile to ensure it persists.
Next, pull the assets you need. Replace
hg38 in the example below if you need to use a different genome assembly. If these assets are not available automatically for your genome of interest, then you'll need to build them. Download these required assets with this command:
refgenie pull -g hg38 -a bowtie2_index ensembl_gtf ensembl_rb refgene_anno feat_annotation
PEPPRO also requires
bowtie2_index for any pre-alignment genomes:
refgenie pull -g human_rDNA -a bowtie2_index
That's it! Everything we need to run
PEPPRO to its full potential should be installed.