Configure seqOutBias assets

seqOutBias is an optional tool that can be used to correct for enzymatic bias and generate stranded bigWigs for visualization. This correction can be important as bias from the T4 RNA Ligase used in PRO-seq protocols can yield erroneous conclusions [^Martins2018]. The bias itself is corrected by taking the ratio of genome-wide observed read counts to the expected sequence based counts for each k-mer [^Martins2018]. The k-mer counts take into account mappability at a given read length using GenomeTools’ Tallymer program [^Kurtz2008].

To successfully use seqOutBias therefore requires some additional refgenie assets: the tallymer_index and the suffixerator_index. To generate the indexed k-mers (tallymer_index) for a sample requires an enhanced suffix array (suffixerator_index) for your primary alignment genome. The corresponding k-mer index is produced using this enhanced suffix array for the specific read length of your library. This means you need separate k-mer indicies for every read length of your samples of interest, should they be different across samples.

For example, a sample of interest is read length 75 and it's to be aligned to human genome, hg38. You need to tag the tallymer_index with the read length, as the pipeline looks for the specific read length tagged tallymer index when using seqOutBias.

First, we'll create the enhanced suffix array. This asset only needs to be produced once for a genome. For larger genomes, this can require significant amounts of memory to complete in a reasonable time, likely around an hour to two.

refgenie build hg38/suffixerator_index --params memlimit=24GB

Once the suffixerator_index is complete, we can generate the tallymer_index for read length 75 with which we'll tag the asset. This is much faster to complete than the suffixerator_index.

refgenie build hg38/tallymer_index:75 --params mersize=75

Now you'd be all set to run the pipeline using seqOutBias (argument: --sob) to correct for enzymatic biases.

[^Martins2018]: Martins AL, Walavalkar NM, Anderson WD, Zang C, Guertin MJ. Universal correction of enzymatic sequence bias reveals molecular signatures of protein/DNA interactions. Nucleic Acids Res. 2018;46(2):e9. doi:10.1093/nar/gkx1053

[^Kurtz2008]: Kurtz S, Narechania A, Stein JC, Ware D. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics 2008;9:517.