Easy Conversion FASTQ to FASTA

by:

Bioinformatics

One of the most common tasks in bioinformatics data analysis is to convert a FASTQ file into FASTA. Unfortunately, Some programs only accept input in FASTA format (e.g BLAST).

Stop and think and answer this question: How many times have you tried to convert a FASTQ file into a FASTA format? How many times you found a solution but it requires a third-party tool? Sadly, the answer to both questions is probably “many times”.

Here, I present a single line solution to the problem which does not require any third-party too to accomplish the file conversion. Furthermore, I present a simple way to handle FASTA and FASTQ files using Pysam.

1. FASTQ to FASTA Conversion

First and foremost, below there are two of the ways you can convert your FASTQ (or compressed FASTQ) into FASTA using bash:

1.1. FASTQ Compressed in gz to FASTA

gzip is used to uncompress the FASTQ file and awk to parse the FASTQ into FASTA:

$ gunzip -c INPUT.fastq.gz | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > OUTPUT.fasta

1.2. FASTQ to FASTA

cat is used to stream data the FASTQ file and awk to parse the FASTQ into FASTA:

$ cat INPUT.fastq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > OUTPUT.fasta

2. Using Pysam to Read FASTA and FASTQ files

Next, I would like to share some code used to read a FASTQ or FASTA file in Python. Sometimes we need to process these files within a Python script. Here, I use a third-party tool (sorry!) – Pysam can handle it for you. See the Python function below:

# !/usr/bin/env python3

# -*- coding: utf-8 -*-

from pysam import FastxFile


def read_fasta_q_file(fasta_q_file):
    """Parse FASTA/Q file using `pysam.FastxFile`.

    Args:

        fasta_q_file (str): Path to FASTA/Q file.

    """
    with FastxFile(fasta_q_file) as fh:
        for entry in fh:
            sequence_id = entry.name
            sequence = entry.sequence

Next, on the code above the variables, sequence_id, and sequence represent each sequence ID and DNA/protein sequence in your FASTA/FASTQ file which can be used to easily process the target file.

In conclusion, I hope this single-line solution to convert FASTQ to FASTA was helpful to you, and that you consider using Pysam to read FASTA/FASTQ files.

Please don’t hesitate to ask questions below!

3. More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

4. Related Post