How to Simulate NGS reads – Step-by-Step

onestop_databy:

Bioinformatics

This tutorial shows how to use a read simulator in Next-Generation Sequencing (NGS) sequence for single-cell genomes and metagenomics. It covers how to simulate Illumina (NovaSeq, HiSeq or MiSeq), Pacific Biosciences (PacBio) SMRT, and Oxford Nanopore reads.

1. Simulating Single-Cell Genomes

First and foremost, we show to simulate a single-cell genome using microbial data, but this tutorial can be also applied to other types of data such as when trying to simulate human reads.

We first show to simulate reads for the Illumina instruments, and then show how this can be done for Pacbio and Nanopore.

1.1 Simulating Illumina NovaSeq, HiSeq or MiSeq reads

Now, for the Illumina reads simulation, we are going to use InSilicoSeq. The tools can be simply installed using pip (pip3 install InSilicoSeq) or Bioconda (conda install -c bioconda insilicoseq).

Using InSilicoSeq is easy:

For example, below we generate 100,000 pairs of reads (FASTQ format) for Escherichia coli 042 using the NovaSeq Model. The FASTQs for the output are stored with the name novaseq_reads_ecoli and they are generated using 16 CPUs.

$ iss generate --n_reads 100000 --genomes genomes/Escherichia_coli_042.fasta --model novaseq --output novaseq_reads_ecoli --cpus 16

Next, we generate 100,000 pairs of reads for Escherichia coli 042 using the Hiseq Model.

$ iss generate --n_reads 100000 --genomes genomes/Escherichia_coli_042.fasta --model hiseq --output hiseq_reads_ecoli --cpus 16

And Finally, we generate 100,000 pairs of reads for Escherichia coli 042 using the Miseq Model.

$ iss generate --n_reads 100000 --genomes genomes/Escherichia_coli_042.fasta --model miseq --output miseq_reads_ecoli --cpus 16

1.2 Simulating Pacbio and Nanopore Reads

Here, we use SimLoRD for Pacbio reads simulation. Luckily it lives in Bioconda and has an easy install (conda install -c bioconda simlord).

The line below generates 100 Pacbio reads for the Escherichia coli 042 genome and outputs it on a FASTQ file named ecoli_pacbio_reads.fastq.

$ simlord -n 100 --read-reference genomes/Escherichia_coli_042.fasta ecoli_pacbio_reads

Now, we use DeepSimulator to simulate Nanopore reads. The tool requires you to have conda installed in your machine. If you don’t know how to install conda, please read this post.

Once you have conda set up, the next step is to install DeepSimulator by following these steps:

# clone tool from git
git clone https://github.com/lykaust15/DeepSimulator.git

# change directory to DeepSimulator cloned
cd ./DeepSimulator/

# install DeepSimulator
./install.sh

Now that we got the tool installed, it is easy to run it.

$ ./deep_simulator.sh -n 100 -i ../genomes/Escherichia_coli_042.fasta -c 16  -o ecoli_nanopore

The command should generate 100 Nanopore reads for Escherichia coli 042 using 16 CPUs, and they will be stored at directory ecoli_nanopore/

2. Simulating Metagenomes

Next, using a read simulator on a metagenomic dataset is also simple.

First, we need to define a tabular file with the target sequences in the first column and on the second column the sequence abundance. The sum of abundance has to sum up to 1 (100%).

For example, here we are simulating a microbial community with E.coli, Staphylococcus aureus, and Streptococcus pyogenes. When creating the tabular file, please ignore the header from the table below.

Sequence IDAbundance
FN554766.1_ecoli0.5
AP017922.1_staphylococcus_aureus_JP0800.3
AE014074.1_streptococcus_pyogenes_MGAS3150.2

Now that we have the abundance file, we can run InSilicoSeq to generate some NovaSeq reads, but it could also have been Hiseq or Miseq as done above.

Below we are generating 100,000 paired reads for the genomes in folder genomes/ and desired abundance for these sequences are described in metagenome_abundance.

$ iss generate --n_reads 100000 --genomes genomes/*.fasta --abundance_file metagenome_abundance.txt --model novaseq --output novaseq_reads_metagenome --cpus 16  

This should generate 25,000 pairs of Ecoli reads (50%), 15,000 pairs of Staphylococcus Aureus reads (30%), and 10,000 pairs of Streptococcus Pyogenes reads (20%).

More Resources

Here are three of my favorite Python Bioinformatics Books in case you want to learn more about it.

Conclusion

In this tutorial, you learned how to use a read simulator on single-cell and metagenomic reads with Illumina NovaSeq, HiSeq, and Miseq, Pacific Biosciences (PacBio) SMRT, and Oxford Nanopore models.

Related Posts