Have you needed to Clean your FASTA/FASTQ file? I know it is a common task. Back in the day, I found a Python 2 script to remove duplicate sequences, short sequences, and sequences with too many N’s. Furthermore, analyzing poor data takes CPU time and interpreting the results from poor data takes people time, so it’s always important to make a preprocessing.
The script was not very well architectured and used Biopython. Due to the numerous requests, the creator of the original script has decided to re-write the tool, drop Biopython for Pysam, use Python 3, and added new features to it.
This blog post explains how to use the new version of the script from understanding the new features, installing it, and running a small example.
New Features
Pysam rather than Biopython and FASTQ now allowed as Input
I love Pysam FASTA and FASTQ reading function. Luckily, the new version of the script dropped Biopython from the script and use Pysam file to read FASTQ and FASTA files.
Remove reverse complement duplicates
The new version of the program now removes duplicates sequence even to remove complement matches. For example, the FASTA below will combine the two sequences into one.
>sequence_1
AAAA
>sequence_2
TTTT
GitHub Hosting
The program code now lives on Github. It should help people to easily to clone and install it. Moreover, it should help to report bugs, adding new features, versioning, etc.
Cleaning Statistics
The script reports some basics statistics values for the cleaning per files such as number sequences processed, a number repeated sequences, Number repeated Sequences (Reverse Complement), number short sequences, and Number high N sequences.
Installing sequence_cleaning
Assuming that you have GitHub configurated on your computer, execute the following steps which should install the dependencies:
# clone Sequence-Cleaner
git clone git@github.com:metageni/Sequence-Cleaner.git
# install Sequence-Cleaner
cd Sequence-Cleaner && python setup.py install
Usage
usage: sequence_cleaner [-h] [-v] -q QUERY -o OUTPUT_DIRECTORY
[-ml MINIMUM_LENGTH] [-mn PERCENTAGE_N] [-l LOG]
Sequence Cleaner: Remove Duplicate Sequences, etc
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-q QUERY, --query QUERY
Path to directory with FAST(A/Q) files
-o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
Path to output files
-ml MINIMUM_LENGTH, --minimum_length MINIMUM_LENGTH
Minimum length allowed (default=0 - allows all the
lengths)
-mn PERCENTAGE_N, --percentage_n PERCENTAGE_N
Percentage of N is allowed (default=100)
-l LOG, --log LOG Path to log file (Default: STDOUT).
example > sequence_cleaner -q INPUT/ -o OUTPUT/
Running the tool
Now that you know everything about the tool, we can use the FASTA below to run it
>sequence_1
AAAA
>sequence_2
ATGATG
>sequence_3
TGATGATGA
>sequence_4
TTTT
>sequence_5
AAAA
>sequence_6
A
When running where {INPUT/} is a directory that contains the FASTA above.{OUTPUT_DIR/} is the output where the cleaned FASTA will be written to.
sequence_cleaner -q {INPUT/} -o {OUTPUT_DIR/} -ml 2
Last but not least, we get the following output file, and as we can see:
- sequence_6 was removed because of its length
- sequence_1, sequence_4_RC, and sequence_5 were clustered as one because they are identical. sequence_4 was a reverse complement repeated or sequence_1 and sequence_5
- sequence_2 and sequence_3 are unique sequences
>sequence_1__sequence_4_RC__sequence_5
AAAA
>sequence_2
ATGATG
>sequence_3
TGATGATGA
I hope you enjoyed this tutorial help you to clean your FASTA/FASTQ file. If there are features you would like to see on the tool, please leave a comment below.
More Resources
Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.
- Python for the Life Sciences: A Gentle Introduction to Python for Life Scientists Paperback by Alexander Lancaster
- Bioinformatics with Python Cookbook by Tiago Antao
- Bioinformatics Programming Using Python: Practical Programming for Biological Data by Mitchell L. Model