Fast Conversion of Lowercase Sequences to Uppercase in FASTA Format

by:

Bioinformatics

This tutorial teaches two approaches to convert lowercase to uppercase sequences in FASTA format. In bioinformatics, a FASTA with lowercase bases implies that regions are low complexity. This can be a problem for some bioinformatics software that may ignore the lowercased regions.

1. Why should I convert Lowercase Sequences to Uppercase in FASTA Format?

In FASTA format, it is a common convention to represent the sequence header line, which starts with a single-character “>” symbol, in uppercase letters. This is a widely adopted standard and helps to distinguish the header line from the actual sequence data, which is usually represented in lowercase letters. The use of uppercase letters for headers is a matter of convention and not a strict requirement, but it is a widely followed practice to help ensure consistency and readability in FASTA-formatted files. Converting lowercase sequences to uppercase in FASTA format can help to maintain consistency with this convention, making the data easier to understand and manipulate.

2. Using BBMap to Convert Lowercase Sequences to Uppercase

BBMap’s script is named reformat.sh was the fastest way I found to convert the lowercase bases to uppercase – it took 20 seconds to process the human genome 38.

Here is the script’ syntax:

$ reformat.sh in=INPUT_FASTA out=OUTPUT_FASTA trd tuc -Xmx10g

3. Using awk to Convert Lowercase Sequences to Uppercase

An alternative approach to BBMap is to use awk to get the job done. Unfortunately, it is not as fast as BBMap – it took ~3 minutes to get the job done.

If you are working with small files and don’t want to install BBMap, this is the way to go.

Here is the syntax:

$ awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $1}}' INPUT_FASTA > OUTPUT_FASTA

There way you go, BBMap or awk; I hope this was useful to you.

3. More Resources