Configure UMI settings

By default, the pipeline assumes there is not a UMI. In other words, the parameter umi_len is set to 0. See the pipeline usage documentation for additional parameter settings.

Specify a UMI length

There are three approaches for specifying the umi_len parameter for your samples.

1: Pass the --umi-len parameter at the command line

If you're running PEPPRO at the command line for a single sample, you may specify the UMI length using the --umi-len argument.
For example:

./pipelines/peppro.py \
  --sample-name test \
  --genome hg38 \
  --input examples/data/test_r1.fq.gz \
  --single-or-paired single \
  --umi-len 8 \
  -O $HOME/peppro_example/

2: Pass the --umi-len parameter to the pipeline using looper

If you're running PEPPRO with looper, you can also pass any number of additional arguments to looper that will be automatically passed to the pipeline.
For example:

looper run examples/meta/peppro_test.yaml -d \
  --package slurm \
  --umi-len 8

In this case, looper will automatically pass the --umi-len 8 argument to each sample in the peppro_test.yaml file.

3: Specify a --umi-len argument in the project configuration file

If you're using looper and you'd like to set the --umi-len for individual samples that is entirely possible with some customization to the configuration and annotation files. For a real life example, check out the peppro_paper.yaml and peppro_paper.csv project files.

Below we'll go over two examples of customization in the project configuration files.

1: Set a universal --umi-len in the project configuration file

name: test

pep_version: 2.0.0
sample_table: "peppro_test.csv"

looper:
  output_dir: "$PROCESSED/peppro/peppro_test/"
  pipeline_interfaces: ["$CODE/peppro/project_pipeline_interface.yaml"]

sample_modifiers:
  append:
    pipeline_interfaces: ["$CODE/peppro/sample_pipeline_interface.yaml"]
    #prioritize: null # Default is FALSE. Pass flag to prioritize features by the order they appear in the feat_annotation asset when calculating FRiF/PRiF
    #sob: null        # Default is FALSE. Pass flag to use seqOutBias for signal track generation and to incorporate mappability
    #no_scale: null   # Default is FALSE. Pass flag to not scale signal tracks
    #coverage: null   # Default is FALSE. Pass flag to use coverage when producing library complexity plots.
    #keep: null       # Default is FALSE. Pass flag to keep prealignment BAM files.
    #noFIFO: null     # Default is FALSE. Pass flag to NOT use named pipes during prealignments.
    #complexity: null # Default is TRUE.  Pass flag to disable library complexity calculation. Faster.
  derive:
    attributes: [read1]
    sources:
        R1: "$CODE/peppro/examples/data/{sample_name}_r1.fq.gz"
  imply:
    - if:
        organism: ["human", "Homo sapiens", "Human", "Homo_sapiens"]
      then:
        genome: hg38
        prealignments: human_rDNA
        adapter: cutadapt  # Default
        dedup: seqkit      # Default
        trimmer: seqtk     # Default
        protocol: pro      # Default
        umi_len: 8         # Custom --umi-len that will be passed to **all** samples
        max_len: -1        # Default

2: Set custom --umi_len arguments for individual samples

What do you do if your project doesn't have the same UMI for all samples? This requires a bit more complexity. Our paper's samples do exactly this, and you can go check out those configuration files specifically for complete detail. Here we'll highlight the relevant components.

In our paper' project annotation file. We can include any column of choice, in this case we can simply name a column umi_len and set the UMI length on a sample by sample basis. Note, the column name contains an underscore, _, whereas the flag on the command line includes a hyphen, --umi-len.

Here's a snippet of the relevant portion of the configuration file: - annotation file (umi_len column)

umi_len
6
6
8
8
0