Fast Conversion of Lowercase Sequences to Uppercase in FASTA Format

onestop_databy:

Bioinformatics

This tutorial teaches two approaches to convert lowercase to uppercase sequences in FASTA format. In bioinformatics, a FASTA with lowercase bases implies that regions are low complexity. This can be a problem for some bioinformatics software that may ignore the lowercased regions.

Using BBMap to Convert Lowercase Sequences to Uppercase

BBMap‘ script named reformat.sh was the fastest way I found to convert the lowercase bases to uppercase – it took 20 seconds to process the human genome 38.

Here is the script’ syntax:

$ reformat.sh in=INPUT_FASTA out=OUTPUT_FASTA trd tuc -Xmx10g

Using awk to Convert Lowercase Sequences to Uppercase

An alternative approach to BBMap is to simply use awk to get the job done. Unfortunately, it is not as fast as BBMap – it took ~3 minutes to get the job done.

If you are working with small files and/or don’t want to get BBMap install, this is the way to go.

Here is the syntax:

$ awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $1}}' INPUT_FASTA > OUTPUT_FASTA

There way you go, BBMap or awk, I hope this was useful to you.

More Resources