How to count the length of fasta sequences?

04-21-2019

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

A few thoughts on this from a person who does not know much about FASTA-format files. My impression is that the sequence after the >ID line often spans many lines. There are some utilities that are designed to work with such files (and even the more complex formats : FASTA format - Wikipedia )

One such is fastagrep which in one part of a group of perl utilities at: GitHub - rec3141/rec-genome-tools . These appear to be somewhat old, but fastagrep worked for me as shown below.

For those wishing to do more in perl, see the search for fasta at Search for "fasta" - metacpan.org , seeing the BIO and FASTA collections.

However, one may often want to use awk, so I put together a small template that could be used for elementary processing, like getting the length of a sequence. Functions can do some of the work, are easily added, and two are included to see how they can work.

The last example illustrates the modularity of using fastagrep to filter the desired IDs which are then piped into the awk code:

Code:

!/usr/bin/env bash

# @(#) s1       Demonstrate awk template for reading FASTA files.

LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk

FILE=${1-data1}

pl " Input file $FILE:"
head -25 $FILE

pl " Results, $FG extracted lines from $FILE:"
fastagrep 'seq1|seq2' $FILE

pl " Results, awk print from $FILE, awk FASTA template:"
awk '
function fformat(id, sequence)
{
  printf ">%s\n", id
  while (length(sequence) >= 60) {
    print substr(sequence, 1, 60)
    gsub(/^.{60}/, "", sequence)
  }
  if (length(sequence) > 0) { print sequence }
}
function operate(id, sequence)
{
  print id, length(sequence)
}
BEGIN {
  getline
  id = substr($1, 2)
  sequence = ""
}
$0 ~ /^>/ {
  fformat(id, sequence)
  # operate(id, sequence)
  id = substr($1, 2)
  sequence = ""
  next
}
{
  sequence = sequence $0
}
END { fformat(id, sequence) }
# END { operate(id, sequence) }

' $FILE

pl " Results, fastagrep and print length with \"operate\" function:"
fastagrep 'seq1|seq2' $FILE |
./s0

exit 0

producing:

Code:

$ ./s1 data3

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-7-amd64, x86_64
Distribution        : Debian 8.11 (jessie) 
bash GNU bash 4.3.30
awk GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2-p3, GNU MP 6.0.0)

-----
 Input file data3:
>seq1
ATGCTA
>seq2
CGAAACGCAAACAGCCCACGACAGTTAGCCTAGATTGGAACCTCCGGGAAATATGAGACT
GGGTTTAAGGGATGGCAGCCATAGCCACGACATCACGTAG
>seq3
TAGC


-----
 Results,  extracted lines from data3:
>seq1
ATGCTA
>seq2
CGAAACGCAAACAGCCCACGACAGTTAGCCTAGATTGGAACCTCCGGGAAATATGAGACT
GGGTTTAAGGGATGGCAGCCATAGCCACGACATCACGTAG

-----
 Results, awk print from data3, awk FASTA template:
>seq1
ATGCTA
>seq2
CGAAACGCAAACAGCCCACGACAGTTAGCCTAGATTGGAACCTCCGGGAAATATGAGACT
GGGTTTAAGGGATGGCAGCCATAGCCACGACATCACGTAG
>seq3
TAGC

-----
 Results, fastagrep and print length with "operate" function:
seq1 6
seq2 100

The ./s0 above is essentially the awk-template using the operate function.

More details on fastagrep:

Code:

fastagrep       extract sequences from a multi-FASTA file by regex. (what)
Path    : ~/bin/fastagrep
Version : fastagrep version 0.3
Length  : 392 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/env perl
Home    : GitHub - rec3141/rec-genome-tools (doc)
Modules : (for perl codes)
 strict 1.08
 warnings       1.23
 Getopt::Std    1.10
 IO::File       1.16
 Data::Dumper   2.151_01

Best wishes ... cheers, drl

This User Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

UNIX for Beginners Questions & Answers

How to count the length of fasta sequences?

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to add specific bases at the beginning and ending of all the fasta sequences?

Discussion started by: dineshkumarsrk

2. Shell Programming and Scripting

Shorten header of protein sequences in fasta file to only organism name

Discussion started by: jerrild

3. Shell Programming and Scripting

Outputting sequences based on length with sed

Discussion started by: Xterra

4. Shell Programming and Scripting

Getting unique sequences from multiple fasta file

Discussion started by: Ibk

5. UNIX for Dummies Questions & Answers

Select distinct sequences from fasta file and list

Discussion started by: Marion MPI

6. Shell Programming and Scripting

Shorten header of protein sequences in fasta file

Discussion started by: alexypaul

7. Shell Programming and Scripting

Count and search by sequence in multiple fasta file

Discussion started by: empyrean

8. Shell Programming and Scripting

Extract sequences from a FASTA file based on another file

Discussion started by: nelsonfrans

9. Shell Programming and Scripting

Shell script for changing the accession number of DNA sequences in a FASTA file

Discussion started by: margarita

10. Shell Programming and Scripting

Extract length wise sequences from fastq file

Discussion started by: empyrean