Select distinct sequences from fasta file and list

09-24-2014

Registered User

3, 0

Join Date: Sep 2014

Last Activity: 25 September 2014, 7:51 AM EDT

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Select distinct sequences from fasta file and list

Hi
How can I extract sequences from a fasta file with respect a certain criteria? The beginning of my file (containing in total more than 1000 sequences) looks like this:

Code:

>H8V34IS02I59VP 
SDACNDLTIALLQIAREVRVCNPTFSFRWHPQVKDEVMRECFDCIRQGLG
YPSMRNDPILIANCMNWHGHPLEEARQWVHQACMSPCPSTKHGFQPFRMA
SATANCAKIIEYALSNGYDPVVNMQMGPKIGDARDFT*L*TVIRGLGTTR
WSG
>H8V34IS02IRUQO 
SDACNDLTTAYMQAARLVRVCNPTFSFRYHPQVKDEVMREAFGCIRHGLG
YPNIKNDSVLIPNAMYWHGHPLEEARQWVNQACMSPCPPXKYGCQPNRMA
SAANCAKMIEYTLHTGMIM**TCRVGTEGRVIRAYFKDFGGVLTRYGVKQ
MEWLDVVLIVRFT
>H8V34IS02HTVT3 
SDACNGMTIALMQAARLVRTPNPTFAFRWHPKVKDEVMREIFECIRHGLG
YPAMRNDPILISNAMHWHRHPIEEARTWVHQACMSPCPTTKHGTQPMRMA
HATANCAKIMEYALWNGYDHVVNMQMGPRTGDARKFTDFEQLFDAWVKQX
DGC
>H8V34IS02HI4PS 
SDACNALTDCYLEAALVSRVSDPTFGFRYHSKVRTETLRRVFECIRHGLG
YPSIRNDDVLIPNIMHWFGHPLKEARRWLHQACMAPAPDTKWGAPSLRYP
QASIITGSKAISLAMFDGFDPLTGMQTGIKTGDCSKFETFDEFYDAWYEQ
PKAGFKQATGMEH
>H8V34IS02F9NL0 
SDACNDLTIALLQIAREVRVCNPTFSFRWHPQVKDEVMRECFDCIRQGLG
YPSMRNDPILIANCMNWHGHPLEEARQWVHQACMSPCPSTKHGFQPFRMA
SATANCAKIIEYALSNGYDPVVNMQMGPKIGDAGISDFEQLFEAWVQTDG
VA*WIL

I want to extract the sequences containing the motif FDCIR? Can it be done with grep? Or do I need a pearl script?
In a next step: How could I even extract sequences with respect to fullfilling two or more criteria?

Looking forward getting your suggestions.

Cheers, Marion.

Last edited by jim mcnamara; 09-24-2014 at 04:55 PM..

Marion MPI

View Public Profile for Marion MPI

Find all posts by Marion MPI

09-24-2014

Moderator

1,837, 668

Join Date: Nov 2012

Last Activity: 30 June 2020, 12:07 PM EDT

Posts: 1,837

Thanks Given: 180

Thanked 668 Times in 590 Posts

Hi Marion Welcome to Forums, can we have expected output as well please.

Akshay Hegde

View Public Profile for Akshay Hegde

Find all posts by Akshay Hegde

09-24-2014

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Have you tried

Code:

grep "FDCIR" file

?
Multiple matches:

Code:

grep -e "FDCIR" -e "ALSNG" file

Code:

grep "\
FDCIR
ALSNG" file

Code:

egrep "FDCIR|ALSNG" file

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

09-24-2014

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Code:

#  Try awk (or nawk on Solaris)
awk ' />/ {arr[$0]=""; i=$0; next}
        {arr[i]=arr[i] $0}
        END {for( p in arr) { if ( index(arr[p], "FCDIR")>0 ) {print p, arr[p] }}}
       '   fasta_file > newfile

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

09-25-2014

Registered User

3, 0

Join Date: Sep 2014

Last Activity: 25 September 2014, 7:51 AM EDT

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi
Many thanks for the quick reply.
I tried but the output looks like

Code:

SDACNDLTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKN
GRVQRSDRRHTRGILEYQDAGAVSCIQILAQNQRKTRHLVFDNIAQGFGFPSIKHEEKTR
SDACNALTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKN
GRVQRSDRRHTRSILEYQNAGAVPFIQVLTQDQRKTRHLVFDNIAQGFGFPSIKHDEKNT
GRVQRSDRRHTRSILEYQDAGAVSCVQIFS*NQRKTRHLVFDNIAQGFGFPSIKHEEKTR
GRVQRSDRRHTRGILEYQDAGAVSCIQILAQNQRKTRHLVFDNIAQGFGFPSIKHEEKNT
GRVQRSDRRHTRSILEYQDTGAVPFIQVLTQDQRKTRHLVFDNIAQGFGFPSIKHEEKNT
GRVQRSDRRHTRSILEYQDTGAVPFIQVLTQDQRKTRHLVFDNIAQGFGFPSIKHEEKNH
SDACNDLTDAILEASLNIRTPEPSLAFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKN
SDACNDLTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKN
SDACNDLTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKN
SDACNALTDAILEASLNIRTPEPSLAFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKN
SDACNDLTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHDEKN
SDACNALTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFRP*SMKKKT
SDACNALTDVILEASLNIRTPEPSLGFRYSPKINEKTRHLVFDNIAQGFGFPSIKRDEKN
SDACNDLTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHDEKN
GRVQRSDRRHTRGILEYQDAGAVSCIQILAQNQRKTRHLVFDNIAQGFGFPSIKHEEKTR
GRVQRLTDAILEASLNIRTPEPSLAFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKNT

instead of

Code:

>H8V34IS02HC5PK_rframe3
GRVQRSDRRHTRGILEYQDAGAVSCIQILAQNQRKTRHLVFDNIAQGF
GFPSIKHEEKTRR***TISIFHRTRPPTGRLFFAWRRA*INAGEPRQERKGRGVALQNPWNLLWETVFDYS
LTNIQMGPKQATLRSSKTSRTYGTHCGTGQIGNSLHFRNQGCMP*GQ
>H8V34IS02FU9GO_rframe1
SDACNALTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKNTKMLIDYFHIPPDEAAHWALVLCMAPGVNKRRGTQKSRTEGGGALCVAKPIELAMSDGF
DYSLTNAQMGLKTGDPTQFKDFEDVWNAFVEQLKFGVALHFRNRDVCRRAEIR
>H8V34IS02FTRDI_rframe3
GRVQRSDRRHTRSILEYQNAGAVPFIQVLTQDQRKTRHLVFDNIAQGFGFPSIKHDEKNT
KMMIDYFNIPPDEAAHWALVLCMAPGVNKRRGTQKSRTEGGGGFCVGKPMELAMGDGFD
YSLTNTQIGPKTGDPTQFNSFEDVWNAFEEQVKFAAALHFRNRDVCRRAEIKY*

ideally, I get a fasta file containing the selected sequences.
Cheers, Marion.

Quote:

Originally Posted by MadeInGermany

Have you tried

Code:

grep "FDCIR" file

?
Multiple matches:

Code:

grep -e "FDCIR" -e "ALSNG" file

Code:

grep "\
FDCIR
ALSNG" file

Code:

egrep "FDCIR|ALSNG" file

Moderator's Comments:

Please use code tags next time for your code and data. Thanks

---------- Post updated at 04:19 AM ---------- Previous update was at 03:47 AM ----------

Hi
Thank you for the quick reply.
I did not work, the new file is empty.

Any other suggestions?
Cheers, Marion.

Quote:

Originally Posted by jim mcnamara

Code:

#  Try awk (or nawk on Solaris)
awk ' />/ {arr[$0]=""; i=$0; next}
        {arr[i]=arr[i] $0}
        END {for( p in arr) { if ( index(arr[p], "FCDIR")>0 ) {print p, arr[p] }}}
       '   fasta_file > newfile

Marion MPI

View Public Profile for Marion MPI

Find all posts by Marion MPI

09-25-2014

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

This one, if matches, returns the whole string from > to the next >

Code:

awk -v search="FDCIR|ALSNG" '$1~/^>/ {buf=sep=""; found=0} found==1 {print; next} {buf=buf sep $0; sep=RS} $0~search {print buf; found=1}' file

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

09-25-2014

Registered User

3, 0

Join Date: Sep 2014

Last Activity: 25 September 2014, 7:51 AM EDT

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Perfect, exactly what I needed: Dankesch�n!
Cheers, Marion.

Quote:

Originally Posted by MadeInGermany

This one, if matches, returns the whole string from > to the next >

Code:

awk -v search="FDCIR|ALSNG" '$1~/^>/ {buf=sep=""; found=0} found==1 {print; next} {buf=buf sep $0; sep=RS} $0~search {print buf; found=1}' file

Marion MPI

View Public Profile for Marion MPI

Find all posts by Marion MPI

UNIX for Dummies Questions & Answers

Select distinct sequences from fasta file and list

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to add specific bases at the beginning and ending of all the fasta sequences?

Discussion started by: dineshkumarsrk

2. Shell Programming and Scripting

Shorten header of protein sequences in fasta file to only organism name

Discussion started by: jerrild

3. UNIX for Beginners Questions & Answers

How to count the length of fasta sequences?

Discussion started by: dineshkumarsrk

4. Shell Programming and Scripting

Getting unique sequences from multiple fasta file

Discussion started by: Ibk

5. Shell Programming and Scripting

Shorten header of protein sequences in fasta file

Discussion started by: alexypaul

6. Shell Programming and Scripting

Extract sequences from a FASTA file based on another file

Discussion started by: nelsonfrans

7. Shell Programming and Scripting

Shell script for changing the accession number of DNA sequences in a FASTA file

Discussion started by: margarita

8. Shell Programming and Scripting

Select distinct rows in a file by last column

Discussion started by: apenkov

9. Shell Programming and Scripting

Select distinct values from a flat file

Discussion started by: smalya

10. UNIX for Dummies Questions & Answers

select distinct row from a file

Discussion started by: merry susana