Fast Conversion of Lowercase Sequences to Uppercase in FASTA Format



This tutorial teaches two approaches to convert lowercase to uppercase sequences in FASTA format. In bioinformatics, a FASTA with lowercase bases implies that regions are low complexity. This can be a problem for some bioinformatics software that may ignore the lowercased regions.

1. Using BBMap to Convert Lowercase Sequences to Uppercase

BBMap’s script is named was the fastest way I found to convert the lowercase bases to uppercase – it took 20 seconds to process the human genome 38.

Here is the script’ syntax:

$ in=INPUT_FASTA out=OUTPUT_FASTA trd tuc -Xmx10g

2. Using awk to Convert Lowercase Sequences to Uppercase

An alternative approach to BBMap is to use awk to get the job done. Unfortunately, it is not as fast as BBMap – it took ~3 minutes to get the job done.

If you are working with small files and don’t want to install BBMap, this is the way to go.

Here is the syntax:

$ awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $1}}' INPUT_FASTA > OUTPUT_FASTA

There way you go, BBMap or awk; I hope this was useful to you.

3. More Resources