Easy NCBI Genome Download

by:

Bioinformatics

NCBI genome download can be a very unpleasant job. This page shows how to use NCBI-genome-download to download NCBI genomes with a single command line easily.

1. What is NCBI Genome Database?

The NCBI (National Center for Biotechnology Information) Genome is a database that provides access to the DNA sequences and related annotations for a variety of organisms, including viruses, bacteria, fungi, plants, and animals. The NCBI Genome database contains the complete genomes of thousands of organisms, as well as many incomplete or partially annotated genomes.

In terms of size, the NCBI Genome database is massive, with the total amount of DNA sequence data in the database being in the petabytes (1 petabyte = 1,000 terabytes). The size of the database is constantly growing as new genomes are sequenced and added, so it’s difficult to give a precise number. However, it’s safe to say that the NCBI Genome database is one of the largest collections of genomic data in the world.

The NCBI Genome database is updated regularly, with new genomes being added and existing genomes being updated as new data becomes available. The frequency of updates can vary, but the NCBI works to ensure that the data in the NCBI Genome database is as up-to-date and accurate as possible.

The NCBI Genome database provides access to high-quality, reference genome sequences, as well as other related information, such as gene models, functional annotations, and pathway data. The database is widely used by researchers and scientists around the world and is an essential resource for studying evolution, comparative genomics, and the molecular basis of biological processes.

2. Installing ncbi-genome-download

Installing the tool is simple; I’m glad to inform you that their developers added the device to bioconda and pip.

# install ncbi-genome-download using bioconda
$ conda install -c bioconda ncbi-genome-download

# install ncbi-genome-download using pip
$ pip install ncbi-genome-download

You can use the command to make sure that the tool was indeed installed.

$ which ncbi-genome-download

3. Download RefSeq-NCBI Genomes by Kingdom

Downloading RefSeq-NCBI genomes by kingdom using NCBI-genome-download is simple. The command line below download the genomes in FASTA format and outputs them to the “output_dir” directory. It also downloads the 16 genomes in parallel.

$ ncbi-genome-download bacteria -F fasta -o output_dir/ --parallel 16

More than one kingdom can be passed, separated by a comma: the sample below downloads all the bacterial and fungal genomes on the Refseq database.

$ ncbi-genome-download bacteria,fungi -F fasta -o output_dir/ --parallel 16

If another output format is needed besides FASTA, the tool also provides the following options:

'genbank' (default), 'rm', 'features', 'gff','protein-fasta', 'genpept', 'wgs', 'cds-fasta', 'rna-fna', 'rna-fasta', 'assembly-report', 'assembly-stats', 'all'

4. Download RefSeq-NCBI Genomes by Genus

Downloading RefSeq-NCBI Genomes by genus is single, and all you need to pass is the flag “-genera” and the genus name to the previous command line.

In the example below, it downloads all the lactobacillus RefSeq genomes in FASTA format using 16 threads.

$ ncbi-genome-download bacteria --genera lactobacillus -F fasta -o output_dir/ --parallel 16

5. Download RefSeq-NCBI Genomes by Taxid

You can use the flag “–species-taxids” to download all the genomes related to the species taxid. In the example below, it downloads all the genomes for Lactobacillus iners which is under taxid 147802.

$ ncbi-genome-download bacteria --species-taxids 147802 -F fasta -o output_dir/ --parallel 16

However, if there is a specific genome you want to download based on its taxid, you can use the flag “–taxids.”

Here we download the Lactobacillus iners AB-1 genome under taxid 713605

$ ncbi-genome-download bacteria --taxids 713605 -F fasta -o output_dir/ --parallel 16

6. Conclusion

I hope you appreciate as much as I did how easy NCBI-genome-download makes the process to download RefSeq genomes.

For more information on the tool and other parameters, please check out the tool documentation.

6. More Resources