The Easiest way to Calculate N50 for Genome Assembly

by:

Bioinformatics

This short tutorial demonstrates how you can use Python to compute N50 for Genome Assembly. What is N50? How to calculate N50 Is a larger n50 better? These questions will be answered here.

1. What is N50?

N50 is a statistical measure that is commonly used in genomics, transcriptomics, and other related fields. It is used to describe the size and quality of a set of sequences or contigs, which are typically generated by sequencing technologies.

In genomics, sequencing is the process of determining the exact order of nucleotides (adenine, guanine, cytosine, and thymine) in a DNA molecule. However, most sequencing technologies generate short reads or fragments of DNA, which need to be assembled into longer sequences or contigs.

The process of assembling short reads into longer sequences can be challenging and requires sophisticated algorithms and computational resources. During this process, errors and gaps may occur, resulting in a fragmented set of sequences or contigs. Therefore, researchers need a way to evaluate the quality of the assembly and determine the most informative sequences or contigs.

This is where the N50 value comes in. N50 is defined as the length of the sequence or contig at which half of the total length of the sequences or contigs in the set is contained in sequences or contigs of that length or longer. In other words, it represents the median length of the longest sequences or contigs in the set.

For example, if a set of contigs has an N50 of 10,000 base pairs, it means that half of the total length of all the contigs is contained in contigs of 10,000 base pairs or longer. The remaining half of the length is spread across shorter contigs. Therefore, a higher N50 value indicates a higher quality dataset with longer and more informative sequences or contigs.

Researchers often use N50 to compare and evaluate different sequencing runs, assemblies, or datasets. It can help them identify the most informative and useful sequences or contigs, and optimize the sequencing or assembly process. However, N50 should not be the only metric used to assess the quality of a dataset, as other factors such as sequencing coverage, accuracy, and completeness also need to be considered.

In summary, N50 is a statistical measure used to describe the size and quality of a set of sequences or contigs in genomics and related fields. It provides a useful summary statistic for evaluating the most informative sequences or contigs and optimizing the sequencing or assembly process.

2. How to calculate N50

Below is the simple Python Script to compute the N50 for a list with contigs lengths.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

def calculate_N50(list_of_lengths):
    """Calculate N50 for a sequence of numbers.

    Args:
        list_of_lengths (list): List of numbers.

    Returns:
        float: N50 value.

    """
    tmp = []
    for tmp_number in set(list_of_lengths):
            tmp += [tmp_number] * list_of_lengths.count(tmp_number) * tmp_number
    tmp.sort()

    if (len(tmp) % 2) == 0:
        median = (tmp[int(len(tmp) / 2) - 1] + tmp[int(len(tmp) / 2)]) / 2
    else:
        median = tmp[int(len(tmp) / 2)]

    return median

In Scaffold_builder, a tool I published in graduate school, N50 was used as one of the metrics to compare the assembly genome contigs vs the scaffolded contigs.

3. Is a larger n50 better?

Yes, it is. And when comparing different assemblers and k-mers parameters on Spades, you want to use the settings that generate the highest N50.

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Conclusion

In summary, in this tutorial you learn more about N50 and how to use Python to compute it.

Related Posts