Genome assets
PEPPRO
can use either manually constructed or refgenie
managed assets. Refgenie
streamlines sample processing, where once assets are built by refgenie
there is minimal argument calls to PEPPRO
to use all assets. Pipeline assets include:
Required
PEPPRO argument |
refgenie asset name |
Description |
---|---|---|
--genome-index |
bowtie2_index |
A genome index file constructed from bowtie2-build |
--chrom-sizes |
With refgenie , this asset is built automatically when you build/pull the fasta asset. |
A text file containing "chr" and "size" columns. |
Optional
PEPPRO argument |
refgenie asset name |
Description |
---|---|---|
--prealignment-names |
Human readable genome alias(es) for refgenie managed bowtie2_index asset(s). |
A space-delimited list of genome names. e.g. ["rCRSd", "human_repeats"] |
--prealignment-index |
bowtie2_index |
A genome index file constructed from bowtie2-build . Used for manually pointing to prealignment genome indices when using bowtie2 (default) for alignment. |
--TSS-name |
refgene_anno . refgenie build/pull the TSS annotation file with this asset. |
Transcription start site (TSS) annotations. e.g. refGene.txt.gz |
--anno-name |
feat_annotation |
A BED-style file with "chr", "start", "end", "genomic feature name", "score" and "strand" columns. |
--pi-tss |
ensembl_gtf.ensembl_tss |
A derived asset from an Ensembl GTF file. Represents all possible TSSs. |
--pi-body |
ensembl_gtf.ensembl_gene_body |
A derived asset from an Ensembl GTF file. Represents all possible gene body coordinates. |
--pre-name |
refgene_anno.refgene_pre_mRNA |
Asset derived from a refGene annotation file. Represents premature mRNA coordinates. |
--exon-name |
refgene_anno.refgene_exon |
Asset derived from a refGene annotation file. Represents all exon coordinates. |
--intron-name |
refgene_anno.refgene_intron |
Asset derived from a refGene annotation file. Represents all intron coordinates. |
--fasta |
fasta The fasta asset. |
A genome fasta file. Required for --sob argument. |
--search-file |
tallymer_index The search_file is built from this refgenie asset. |
File used to search an index of k-mers in the genome of the same size as input read lengths. Only required for --sob argument |
Using refgenie
managed assets
PEPPRO
can utilize refgenie
assets. Because assets are user-dependent, these files must be available natively. Therefore, you need to install and initialize a refgenie config file.. For example:
pip install refgenie
export REFGENIE=/path/to/your_genome_folder/genome_config.yaml
refgenie init -c $REFGENIE
Add the export REFGENIE
line to your .bashrc
or .profile
to ensure it persists.
Next, pull the assets you need. Replace hg38
in the example below if you need to use a different genome assembly. If these assets are not available automatically for your genome of interest, then you'll need to build them. Download all standard assets for hg38
like so:
refgenie pull hg38/fasta hg38/bowtie2_index hg38/refgene_anno hg38/ensembl_gtf hg38/ensembl_rb
refgenie build hg38/feat_annotation
PEPPRO
also requires a fasta
and bowtie2_index
asset for any prealignment genomes:
refgenie pull human_rDNA/fasta human_rDNA/bowtie2_index
Furthermore, you can learn more about using seqOutBias
and the required tallymer_index
here.
Example using refgenie
managed assets
When using refgenie
, you only need to provide the --genome
and --prealignment-names
argument to provide the pipeline with every required index and optional annotation file that exists for those genomes. This means, the TSS file, feature annotation file, and blacklist will all be used without needing to directly specify the paths to these files.
From the peppro/
repository directory:
looper run examples/meta/peppro_test_refgenie.yaml
Using manually managed assets
Assets may also be managed manually and specified directly to the pipeline. While this frees you from needing refgenie
installed and initialized, it does require a few more arguments to be specified.
The TSS annotation file may be specified using --TSS-name </path/to/your_TSS_annotations.bed>
. This file is a BED6
(e.g. chr, start, end, name, score, strand) formatted file.
The feat_annotation
asset may also be directly specified using --anno-name </path/to/your_custom_feature_annotations.bed.gz>
. Read more about using custom reference data.
The pi_tss
asset, representing all possible TSSs for calculating the pause index, may be directly specified using --pi-tss
. This file is a BED6
(e.g. chr, start, end, name, score, strand) formatted file.
The pi_body
asset, representing all possible gene bodies for calculating the pause index, may be directly specified using --pi-body
. This file is a BED6
(e.g. chr, start, end, name, score, strand) formatted file.
The pre_name
asset, representing premature mRNA sequence coordinates, may be directly specified using --pre-name
. This file is a BED6
(e.g. chr, start, end, name (a gene name), score, strand) formatted file.
The exon_name
asset, representing gene exon coordinates, may be directly specified using --exon-name
. This file is a BED6
(e.g. chr, start, end, name (the name of the gene the exon is from), score, strand) formatted file.
The intron_name
asset, representing gene intron coordinates, may be directly specified using --intron-name
. This file is a BED6
(e.g. chr, start, end, name (the name of the gene the intron is from), score, strand) formatted file.
Example using manually managed assets
Even if you are not using refgenie
, you can still grab these assets for all required and optional assets from the refgenie
servers. Refgenie
uses algorithmically derived genome digests under-the-hood to unambiguously define genomes. That's what you'll see being used in the example below when we manually download these assets. Therefore, 2230c535660fb4774114bfa966a62f823fdb6d21acf138d4
is the digest for the human readable alias, "hg38", and b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8
is the digest for "human_rDNA."
From within the peppro/
repository:
wget -O hg38.fasta.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/fasta?tag=default
wget -O hg38.bowtie2_index.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/bowtie2_index?tag=default
wget -O hg38.ensembl_gtf.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/ensembl_gtf?tag=default
wget -O hg38.ensembl_rb.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/ensembl_rb?tag=default
wget -O hg38.refgene_anno.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/refgene_anno?tag=default
wget -O hg38.feat_annotation.gz http://big.databio.org/peppro/hg38_annotations.bed.gz
wget -O human_rDNA.fasta.tgz http://refgenomes.databio.org/v3/assets/archive/b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8/fasta?tag=default
wget -O human_rDNA.bowtie2_index.tgz http://refgenomes.databio.org/v3/assets/archive/b769bcf2deaf9d061d94f2007a0e956249905c64653cb5c8/bowtie2_index?tag=default
Then, extract these files to the peppro/
parent directory:
tar xvf hg38.fasta.tgz
tar xvf hg38.bowtie2_index.tgz
mv hg38.feat_annotation.gz default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.annotation.bed.gz
tar xvf hg38.refgene_anno.tgz
tar xvf hg38.ensembl_rb.tgz
tar xvf hg38.ensembl_gtf.tgz
tar xvf human_rDNA.fasta.tgz
tar xvf human_rDNA.bowtie2_index.tgz
From the peppro/
repository folder (using the manually downloaded genome assets):
looper run examples/meta/peppro_test.yaml