Everything You need to know about the CRAM Format

by:

Data Analysis

This tutorial teaches everything you need to know about the CRAM format, bam to cram compression ratio, cramtools, etc

1. What is a BAM, SAM, and CRAM format

BAM, SAM, and CRAM are file formats used to store and exchange alignment data in bioinformatics.

BAM (Binary Alignment/Map) format is a compact binary representation of the data in the Sequence Alignment/Map (SAM) format. The BAM format provides a faster and more memory-efficient way to store and process large-scale sequencing data and is widely used in next-generation sequencing applications. BAM files can be indexed for fast access to specific regions and can be compressed to reduce disk space.

SAM (Sequence Alignment/Map) format is a text-based format that represents the mapping of short reads to a reference genome. Each line in a SAM file represents a single alignment and contains information such as the read name, reference name, alignment start and end positions, and alignment score. SAM is human-readable but can be slower and less memory-efficient than the BAM format for large-scale data processing.

CRAM (Compressed Sequence Alignment/Map) format is a variant of the SAM format that provides both compression and random access to data. CRAM uses reference-based compression, which reduces the amount of data stored by only storing the differences between the reads and the reference genome. This allows for more efficient storage and faster access to data, while still preserving the information in the SAM format.

In summary, the three formats provide different trade-offs between file size, processing speed, and accessibility, and the choice of format depends on the specific use case and requirements.

2. Pros and Cons of using CRAM format

The CRAM format has several advantages and disadvantages compared to other file formats used in bioinformatics:

Pros:

  1. Compression: CRAM uses reference-based compression, which reduces the amount of data stored by only storing the differences between the reads and the reference genome. This leads to more efficient storage and reduced disk space requirements.
  2. Random Access: CRAM files can be indexed for fast access to specific regions, which is useful for applications such as genome assembly, variant calling, and RNA-seq analysis.
  3. Data Integrity: CRAM uses checksums and compression ratios to ensure data integrity, which helps prevent data loss and corruption during storage and transfer.
  4. Compatibility: CRAM is compatible with the SAM format, so existing pipelines and tools can be used with little or no modification.

Cons:

  1. Processing Time: Compressing and decompressing the data in CRAM format can take more time than simply reading the data from a BAM file. This can result in slower processing times and increased computational requirements.
  2. Reference Genome Dependence: CRAM files depend on the reference genome used for compression, so the reference genome must be available when accessing the data. This can be a limitation for some applications, such as metagenomics, where the reference genome may not be known or available.
  3. Complexity: CRAM is a more complex file format than BAM or SAM, and requires specialized tools and knowledge to use effectively.

In summary, the choice of format will depend on the specific use case and requirements. CRAM is a good choice for applications where disk space is a concern, or when fast access to specific regions is required, but may not be the best choice for applications where processing speed is a priority or where the reference genome is not known or available.

3. BAM to CRAM Compression Ratio

The compression ratio of a BAM file converted to CRAM format will depend on several factors, including the size and complexity of the data, the reference genome used for compression, and the specific implementation of the CRAM format.

In general, CRAM files are typically smaller than BAM files, often by a factor of 2-3 or more, due to the reference-based compression used in the CRAM format. However, the exact compression ratio will vary depending on the specifics of the data and the reference genome.

For example, for a dataset of human genomic data, a compression ratio of 2-3 is common, meaning that the CRAM file is typically half the size of the corresponding BAM file. For datasets with a large number of insertions and deletions, or for datasets with highly repetitive regions, the compression ratio may be lower.

It’s worth noting that while the CRAM format can result in smaller file sizes, it may also result in slower processing times, as the compression and decompression of the data can take more time than simply reading the data from a BAM file. The trade-off between file size and processing time will depend on the specific use case and the size of the data.

4. How to convert from BAM to CRAM?

To convert a BAM file to a CRAM file, you can use the samtools toolkit, which is a widely used set of utilities for handling alignment data in bioinformatics.

Here’s an example of how to convert a BAM file to CRAM using samtools:

samtools view -@ NUMBER_OF_THREADS -T reference.fa -C -o output.cram input.bam

The -T option specifies the reference genome that the reads in the BAM file were aligned to, and the -C option tells samtools to compress the output file using the CRAM format. The -o option is used to specify the output file name.

Note that in order to successfully convert a BAM file to CRAM, you need to have the reference genome that was used for the original alignment. The reference genome is used to decompress the data stored in the CRAM file.

It’s also worth noting that converting a BAM file to CRAM can result in a smaller file size, but may also result in a slower processing time, as the compression and decompression of the data can take more time than simply reading the data from a BAM file. The trade-off between file size and processing time will depend on the specific use case and the size of the data.

5. Is it possible to convert from SAM to CRAM?

Yes, you can use the steps from the previous section, but pass a SAM file rather than a BAM file

6. What is cramtools and what is the benefits over samtools on the BAM to CRAm conversion?

CRAMtools is a Java-based suite of tools for working with CRAM (Compressed Reference-based Alignment) files in bioinformatics. CRAMtools provides functionality for reading, writing, and manipulating CRAM files, as well as for working with reference genomes, quality scores, and alignment statistics.

Compared to samtools, CRAMtools provides a number of benefits for working with CRAM files, including:

  1. Java Implementation: CRAMtools is written in Java, which makes it more accessible for many bioinformatics researchers and practitioners, as Java is a widely used programming language in the field.
  2. Rich Feature Set: CRAMtools provides a comprehensive set of features for working with CRAM files, including tools for reading and writing CRAM files, working with reference genomes, and calculating alignment statistics.
  3. High-Level APIs: CRAMtools provides high-level APIs for working with CRAM files, which makes it easier to work with the data in a programmatic way.
  4. Integration with Other Tools: CRAMtools integrates well with other bioinformatics tools and pipelines, allowing for seamless integration with existing workflows.
  5. Improved Performance: CRAMtools can be faster than samtools for certain operations, such as decoding CRAM files, as it is optimized for the CRAM format.

In summary, CRAMtools provides a more accessible, feature-rich, and performant solution for working with CRAM files compared to samtools. Whether CRAMtools is a better choice for a particular use case will depend on the specific requirements and preferences of the user.

7. Conclusions

In conclusion, this tutorial has provided a comprehensive introduction to the CRAM format and its associated tools. By explaining the benefits of the CRAM format and comparing its compression ratio to the traditional BAM format, users can understand how to optimize their data storage and analysis workflows. Additionally, the tutorial provides detailed instructions on how to use cramtools, a set of command-line tools for manipulating and analyzing CRAM files. With these tools, users can convert BAM files to CRAM, view CRAM file contents, and perform various analyses on the compressed data. This knowledge is especially useful for bioinformaticians and researchers working with large genomic datasets that require efficient data storage and processing. By utilizing CRAM format and associated tools, users can optimize their storage and analysis workflows and gain new insights into their data. Overall, this tutorial is a valuable resource for any researcher or data analyst looking to work with CRAM files and integrate them into their projects.