Remove Poor Reads in FASTQ/A

by:

Bioinformatics

Have you needed to Clean your FASTA/FASTQ file? I know it is a common task. Back in the day, I found a Python 2 script to remove duplicate sequences, short sequences, and sequences with too many N’s. Furthermore, analyzing poor data takes CPU time and interpreting the results from poor data takes people time, so it’s always important to make a preprocessing.

The script was not very well architectured and used Biopython. Due to the numerous requests, the creator of the original script has decided to re-write the tool, drop Biopython for Pysam, use Python 3, and added new features to it.

This blog post explains how to use the new version of the script from understanding the new features, installing it, and running a small example.

New Features

Pysam rather than Biopython and FASTQ now allowed as Input

I love Pysam FASTA and FASTQ reading function. Luckily, the new version of the script dropped Biopython from the script and use Pysam file to read FASTQ and FASTA files.

Remove reverse complement duplicates

The new version of the program now removes duplicates sequence even to remove complement matches. For example, the FASTA below will combine the two sequences into one.

>sequence_1
AAAA
>sequence_2
TTTT

GitHub Hosting

The program code now lives on Github. It should help people to easily to clone and install it. Moreover, it should help to report bugs, adding new features, versioning, etc.

Cleaning Statistics

The script reports some basics statistics values for the cleaning per files such as number sequences processed, a number repeated sequences, Number repeated Sequences (Reverse Complement), number short sequences, and Number high N sequences.

Installing sequence_cleaning

Assuming that you have GitHub configurated on your computer, execute the following steps which should install the dependencies:

# clone Sequence-Cleaner
git clone git@github.com:metageni/Sequence-Cleaner.git

# install Sequence-Cleaner
cd Sequence-Cleaner && python setup.py install
Usage
    usage: sequence_cleaner [-h] [-v] -q QUERY -o OUTPUT_DIRECTORY
                            [-ml MINIMUM_LENGTH] [-mn PERCENTAGE_N] [-l LOG]
    
    Sequence Cleaner: Remove Duplicate Sequences, etc
    
    optional arguments:
      -h, --help            show this help message and exit
      -v, --version         show program's version number and exit
      -q QUERY, --query QUERY
                            Path to directory with FAST(A/Q) files
      -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                            Path to output files
      -ml MINIMUM_LENGTH, --minimum_length MINIMUM_LENGTH
                            Minimum length allowed (default=0 - allows all the
                            lengths)
      -mn PERCENTAGE_N, --percentage_n PERCENTAGE_N
                            Percentage of N is allowed (default=100)
      -l LOG, --log LOG     Path to log file (Default: STDOUT).
    
    example > sequence_cleaner -q INPUT/ -o OUTPUT/

Running the tool

Now that you know everything about the tool, we can use the FASTA below to run it

>sequence_1
AAAA
>sequence_2
ATGATG
>sequence_3
TGATGATGA
>sequence_4
TTTT
>sequence_5
AAAA
>sequence_6
A

When running where {INPUT/} is a directory that contains the FASTA above.{OUTPUT_DIR/} is the output where the cleaned FASTA will be written to.

sequence_cleaner -q {INPUT/} -o {OUTPUT_DIR/} -ml 2

Last but not least, we get the following output file, and as we can see:

  • sequence_6 was removed because of its length
  • sequence_1, sequence_4_RC, and sequence_5 were clustered as one because they are identical. sequence_4 was a reverse complement repeated or sequence_1 and sequence_5
  • sequence_2 and sequence_3 are unique sequences
>sequence_1__sequence_4_RC__sequence_5
AAAA
>sequence_2
ATGATG
>sequence_3
TGATGATGA

I hope you enjoyed this tutorial help you to clean your FASTA/FASTQ file. If there are features you would like to see on the tool, please leave a comment below.

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Related Posts