Metagenomic Taxonomic Profile in Seconds



Would you believe if I tell you that you get a metagenomic taxonomic profile in seconds in your regular compute? If you don’t believe it, I hope I can prove here that it is possible by using a tool called FOCUS.

1. What is FOCUS?

First and foremost, FOCUS, Find Organisms by Composition USage, a fast approach that reconstructs a taxonomic profile using an ensemble k-mer composition of the entire metagenome.

Moreover, It computes the optimal set of organism abundances using non-negative least squares (NNLS) to match the metagenome k-mer composition to organisms in a reference database and report the focal organisms present in metagenomic samples and profile their abundances.

Furthermore, FOCUS was tested with simulated and over 250GB of real metagenomes, and the results show that our approach accurately predicts the organisms present in microbial communities in seconds.

2. Binning vs Profiling

Next, In the metagenomics classification world, there are tools that do binning and other tools do profiling. Binning tools classify every read in the metagenomic data putting them into bins (of course). On the other hand, profiling tools classify the metagenome as a whole community which reports how much of each taxon is present in the metagenome without having to classify every single read.

Moreover, it is important to point out that if you have a binning result, you can get the community profile by computing the relative abundance for every bin which boils down to the community profile. However, the other way around is not true – if you have profiling, you can’t come up with the binning.

Robert Edwards, my Ph.D. advisor, talks more binning here:

3. Why is FOCUS so fast?

Now that you know what binning and profiling mean, I can tell you why FOCUS is so fast. FOCUS is a profiling tool which means we don’t need to classify every single read in the sample.

The tool is fast for two main reasons:

  • Counting k-mers for the microbial is very fast with tools such as Jellyfish
  • NNLS in Scipy was written in Fortran which makes the minimization of finding the best set of organisms in the database that best represent the query very efficiently.

These two things combine makes the tool really fast because the tool does not need to bin read by read. Instead, all we need to do to find the best-represented genomes from the database in the query.

4. Installing and Running FOCUS

Installing FOCUS is very simple because it lives in pip. All you need to do is to call pip3 install metagenomics-focus.

Running FOCUS is also straightforward. The tool has a great README file on FOCUS’ git. So I will redirect you to it.

5. FOCUS performance

Last but not least, three hundred metagenomic datasets (254 GB total) from the Human Microbiome Project were analyzed using FOCUS in about 1 h and 20 min. To show that it is indeed fast, we profiled the real dataset in 1 min and 45 s using an Intel(R) Core(TM) i3 @2.53 GHz and 1 GB RAM. Now imagine how much faster it would be using a computer with so many more RAM and processors?

Also, the paper shows the profile efficiency against other tools.

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.


In summary, I was able to prove here that it is possible to profile a metagenomic dataset in second using FOCUS, a profiling tool.

Related Posts