grep FASTA files

06-16-2010

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

grep FASTA files

I would like to extract the sequences larger than 10 bases but shorter than 18 along with the identifier from a FASTA file that looks like this:

> Seq I
ACGACTAGACGATAGACGATAGA
> Seq 2
ACGATGACGTAGCAGT
> Seq 3
ACGATACGAT

I know I can extract the IDs alone with the following code

Code:

grep ">" inputfile > Outputfile

and the sequences that are longer than 10 bases with the following one:

Code:

grep "[[:alpha:]]\{10\}" inputfile > outputfile

However, the ID will be also lost since they are pretty short and the output file will also contain the reads that are longer than 18. Thus, I need to generate a file containing the sequence ID followed by the actaul sequence. Therefore, the only sequence that will be present in the outputfile would be sequence 2:

Quote:

> Seq 2
ACGATGACGTAGCAGT

Any ideas?
Thanks!

Last edited by Xterra; 06-16-2010 at 11:58 PM..

Xterra

View Public Profile for Xterra

Find all posts by Xterra

06-17-2010

Registered User

602, 83

Join Date: Sep 2007

Last Activity: 17 February 2012, 6:42 AM EST

Location: /home/sea

Posts: 602

Thanks Given: 40

Thanked 83 Times in 81 Posts

Maybe this?

Code:

$ perl -nle 'if ($a == 0) { push @x, $_; $a=1; next; }
             if ($a == 1 && length($_) >=11 && length($_) <= 17)
             { print @x; print $_; $a=0; @x=(); }
             else { $a=0; @x=(); }' fasta-file

This User Gave Thanks to pseudocoder For This Post:

pseudocoder

View Public Profile for pseudocoder

Find all posts by pseudocoder

06-17-2010

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

pseudocoder

Would it be a way to do the same with bash? It will be easier for me to understand.
I was wondering if there is any way to calculate the frequency of each sequence? In other words, let assume that after 'trimming' the sequences there are several that are identical, would it be possible to determine the frequency and include it as part of the ID line? Something like this:

Quote:

> Seq A Freq 50
AGAGATAGATAGAGCTGAT
> Seq B Freq 25
AGAGATAGATAGAGCTGAT
> Seq C Freq 25
AGAGATAGATAGAGCTGAT

Thanks

Last edited by Xterra; 06-17-2010 at 05:00 AM..

Xterra

View Public Profile for Xterra

Find all posts by Xterra

06-17-2010

Registered User

602, 83

Join Date: Sep 2007

Last Activity: 17 February 2012, 6:42 AM EST

Location: /home/sea

Posts: 602

Thanks Given: 40

Thanked 83 Times in 81 Posts

Not sure if I got you right, but maybe it is this.
Yes I know the data gets processed twice, but it worked for me

Code:

$ cat fasta
> Seq 1
ACGACTAGACGATAGACGATAGA
> Seq 2
ACGATGACGTAGCAGT
> Seq 3
ACGATGACGTAGCAGT
> Seq 4
ACGATGACGTAGCAGT
> Seq 5
ACGATACGAT
> Seq 6
ASDFASKFALKJSDFKJASLDFJL
> Seq 7
ASDFASDFASDF
> Seq 8
ASAFJASDFASDFF
> Seq 9
ASDFAS
> Seq 10
ASAFJASDFASDFF
> Seq 11
ACGATGACGTAGCAGT

Code:

$ perl -nle 'if ($a == 0) { push @x, $_; $a=1; next; }
             if ($a == 1 && length($_) >=11 && length($_) <= 17)
             { push @x, ",$_"; print @x; $a=0; @x=(); }
             else { $a=0; @x=(); }' fasta | cut -c7- | sort -t, -k2 -k1n |\
  perl -F, -lane 'if ($a == 0) { push @x, @F; $a=1; $c=1; next; }
                  if ($a == 1 && $F[1] eq $x[1]) { ++$c; }
                  else { push @x, $c; print "> Seq $x[0]", " / freq $x[2]\n", $x[1]; @x=(); push @x, @F; $a=1; $c=1; }
                  END { push @x, $c; print "> Seq $x[0]", " / freq $x[2]\n", $x[1]; }'
> Seq 2 / freq 4
ACGATGACGTAGCAGT
> Seq 8 / freq 2
ASAFJASDFASDFF
> Seq 7 / freq 1
ASDFASDFASDF
$

The keys are sorted and the earliest Seq number found is taken as reference to the appropriate key.

PS: Sorry, I can't tell you how it's done in bash.

pseudocoder

View Public Profile for pseudocoder

Find all posts by pseudocoder

UNIX for Dummies Questions & Answers

grep FASTA files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to append two fasta files?

Discussion started by: dineshkumarsrk

2. Shell Programming and Scripting

Help with reformat single-line multi-fasta into multi-line multi-fasta

Discussion started by: patrick87

3. UNIX for Dummies Questions & Answers

Round up -FASTA file

Discussion started by: Xterra

4. UNIX for Dummies Questions & Answers

Fasta header modification

Discussion started by: Lokaps

5. UNIX for Dummies Questions & Answers

How to change sequence name in along fasta file?

Discussion started by: baika

6. UNIX for Dummies Questions & Answers

Breaking a fasta formatted file into multiple files containing each gene separately

Discussion started by: Ann Mc Cartney

7. UNIX for Dummies Questions & Answers

renaming (renumbering) fasta files

Discussion started by: Oyster

8. Shell Programming and Scripting

Changing from FASTA to PHYLIP format

Discussion started by: Xterra

9. Shell Programming and Scripting

grep for certain files using a file as input to grep and then move

Discussion started by: anita07

10. UNIX for Dummies Questions & Answers

fasta format?

Discussion started by: lost