The Easiest way to Calculate N50 for Genome Assembly

by:

Bioinformatics

This short tutorial demonstrates how you can use Python to compute N50 for Genome Assembly. What is N50? How to calculate N50 Is a larger n50 better? These questions will be answered here.

1. What is N50?

In genomics, N50 is a metric that measures the quality of assembled genomes (contigs or scaffolds).

2. How to calculate N50

Below is the simple Python Script to compute the N50 for a list with contigs lengths.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

def calculate_N50(list_of_lengths):
    """Calculate N50 for a sequence of numbers.

    Args:
        list_of_lengths (list): List of numbers.

    Returns:
        float: N50 value.

    """
    tmp = []
    for tmp_number in set(list_of_lengths):
            tmp += [tmp_number] * list_of_lengths.count(tmp_number) * tmp_number
    tmp.sort()

    if (len(tmp) % 2) == 0:
        median = (tmp[int(len(tmp) / 2) - 1] + tmp[int(len(tmp) / 2)]) / 2
    else:
        median = tmp[int(len(tmp) / 2)]

    return median

In Scaffold_builder, a tool that I published in graduate school, N50 was used as one of the metrics to compare the assembly genome contigs vs the scaffolded contigs.

3. Is a larger n50 better?

Yes, it is. And when comparing different assemblers and k-mers parameters on Spades, you want to use the settings that generate the highest N50.

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Conclusion

In summary, this tutorial you learn more about N50 and how to use Python to compute it.

Related Posts