One of the most common tasks in bioinformatics data analysis is to convert a FASTQ file into FASTA. Unfortunately, some programs only accept input in FASTA format (e.g BLAST).
Stop and think and answer this question: How many times have you tried to convert a FASTQ file into a FASTA format? How many times have you found a solution, but it requires a third-party tool? Sadly, the answer to both questions is probably “many times”.
Here, I present a single-line solution to the problem which does not require any third-party too to accomplish the file conversion. Furthermore, I present a simple way to handle FASTA and FASTQ files using Pysam.
1. Why would you want to a Conversion FASTQ to FASTA?
There are several reasons why one might want to convert FASTQ to FASTA format:
- Compatibility: Some bioinformatics tools and software only accept sequences in FASTA format, and do not support FASTQ.
- File size: FASTA files are usually smaller in size compared to FASTQ files, making them easier to store and transfer.
- Sequence analysis: FASTA files only contain the sequence information, while FASTQ files contain additional quality scores. If the quality scores are not needed for a particular analysis, converting to FASTA can simplify the process.
- Ease of use: FASTA format is easier to read and interpret, especially for someone who is new to bioinformatics.
In summary, converting FASTQ to FASTA is a common step in bioinformatics workflows to ensure compatibility with other tools, reduce file size, simplify sequence analysis, and make the data easier to use.
2. FASTQ to FASTA Conversion (also, fastq.gz to fasta)
First and foremost, below there are two of the ways you can convert your FASTQ (or compressed FASTQ) into FASTA using bash (or fastq.gz to fasta):
1.1. FASTQ Compressed in gz to FASTA
gzip is used to uncompress the FASTQ file and awk to parse the FASTQ into FASTA:
$ gunzip -c INPUT.fastq.gz | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > OUTPUT.fasta
1.2. FASTQ to FASTA
cat is used to stream data to the FASTQ file and awk to parse the FASTQ into FASTA:
$ cat INPUT.fastq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > OUTPUT.fasta
3. Fast FASTQ to FASTA Conversion
We know that FASTQ files can get very large (even after being compressed). If your FASTQ file is large (anything above 500MB), you should consider installing seqtk which can be done using bioconda.
$ seqtk seq -a INPUT.fq.gz > OUTPUT.fa
4. Using Pysam to Read FASTA and FASTQ files
Next, I would like to share some code used to read a FASTQ or FASTA file in Python. Sometimes we need to process these files within a Python script. Here, I use a third-party tool (sorry!) – Pysam can handle it for you. See the Python function below:
# !/usr/bin/env python3
# -*- coding: utf-8 -*-
from pysam import FastxFile
def read_fasta_q_file(fasta_q_file):
"""Parse FASTA/Q file using `pysam.FastxFile`.
Args:
fasta_q_file (str): Path to FASTA/Q file.
"""
with FastxFile(fasta_q_file) as fh:
for entry in fh:
sequence_id = entry.name
sequence = entry.sequence
Next, on the code above the variables, sequence_id, and sequence represent each sequence ID and DNA/protein sequence in your FASTA/FASTQ file which can be used to easily process the target file.
In conclusion, I hope this single-line solution to convert FASTQ to FASTA was helpful to you, and that you consider using Pysam to read FASTA/FASTQ files.
Please don’t hesitate to ask the questions below!
4. More Resources
Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.
- Python for the Life Sciences: A Gentle Introduction to Python for Life Scientists Paperback by Alexander Lancaster
- Bioinformatics with Python Cookbook by Tiago Antao
- Bioinformatics Programming Using Python: Practical Programming for Biological Data by Mitchell L. Model