Masking Low Complexity Regions – Step-by-Step

by:

Bioinformatics

This tutorial teaches how to mask low complexity regions in a FASTA file using BBMap.

1. Why is it important to Masking Low Complexity Regions

Masking low complexity regions is important in bioinformatics and computational biology because these regions can cause false positive results in sequence alignment, annotation, and homology searches. Low complexity regions are sequences of DNA, RNA, or protein that consist of repetitive or redundant patterns of amino acids or nucleotides. Such regions can be easily aligned to similar sequences, leading to incorrect results and over-representation of the aligned region in the final output.

By masking these regions, one can reduce the impact of low complexity sequences and improve the accuracy of sequence analysis tools. The masked regions are typically represented as N or X characters in the sequence, indicating that they are not to be considered during analysis. This helps to reduce the amount of noise in the data, enabling researchers to obtain more accurate results and make better inferences about the underlying biology.

2. How to Mask Low Complexity Regions in a FASTA File

First, we need to install BBMap, which comes with bbmask.sh to mask low complexity regions.

I would recommend using Bioconda to install it using the command below:

$ conda install -c bioconda bbmap

Now that we have bbmask, we can use it to mask a FASTA file using an entropy of 0.7

$ bbmask.sh in=INPUT out=MASKED_OUTPUT entropy=0.7

3. More Resources