Select distinct sequences from fasta file and list


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Select distinct sequences from fasta file and list
# 1  
Old 09-24-2014
Select distinct sequences from fasta file and list

Hi
How can I extract sequences from a fasta file with respect a certain criteria? The beginning of my file (containing in total more than 1000 sequences) looks like this:
Code:
>H8V34IS02I59VP 
SDACNDLTIALLQIAREVRVCNPTFSFRWHPQVKDEVMRECFDCIRQGLG
YPSMRNDPILIANCMNWHGHPLEEARQWVHQACMSPCPSTKHGFQPFRMA
SATANCAKIIEYALSNGYDPVVNMQMGPKIGDARDFT*L*TVIRGLGTTR
WSG
>H8V34IS02IRUQO 
SDACNDLTTAYMQAARLVRVCNPTFSFRYHPQVKDEVMREAFGCIRHGLG
YPNIKNDSVLIPNAMYWHGHPLEEARQWVNQACMSPCPPXKYGCQPNRMA
SAANCAKMIEYTLHTGMIM**TCRVGTEGRVIRAYFKDFGGVLTRYGVKQ
MEWLDVVLIVRFT
>H8V34IS02HTVT3 
SDACNGMTIALMQAARLVRTPNPTFAFRWHPKVKDEVMREIFECIRHGLG
YPAMRNDPILISNAMHWHRHPIEEARTWVHQACMSPCPTTKHGTQPMRMA
HATANCAKIMEYALWNGYDHVVNMQMGPRTGDARKFTDFEQLFDAWVKQX
DGC
>H8V34IS02HI4PS 
SDACNALTDCYLEAALVSRVSDPTFGFRYHSKVRTETLRRVFECIRHGLG
YPSIRNDDVLIPNIMHWFGHPLKEARRWLHQACMAPAPDTKWGAPSLRYP
QASIITGSKAISLAMFDGFDPLTGMQTGIKTGDCSKFETFDEFYDAWYEQ
PKAGFKQATGMEH
>H8V34IS02F9NL0 
SDACNDLTIALLQIAREVRVCNPTFSFRWHPQVKDEVMRECFDCIRQGLG
YPSMRNDPILIANCMNWHGHPLEEARQWVHQACMSPCPSTKHGFQPFRMA
SATANCAKIIEYALSNGYDPVVNMQMGPKIGDAGISDFEQLFEAWVQTDG
VA*WIL

I want to extract the sequences containing the motif FDCIR? Can it be done with grep? Or do I need a pearl script?
In a next step: How could I even extract sequences with respect to fullfilling two or more criteria?

Looking forward getting your suggestions.

Cheers, Marion.

Last edited by jim mcnamara; 09-24-2014 at 04:55 PM..
# 2  
Old 09-24-2014
Hi Marion Welcome to Forums, can we have expected output as well please.
# 3  
Old 09-24-2014
Have you tried
Code:
grep "FDCIR" file

?
Multiple matches:
Code:
grep -e "FDCIR" -e "ALSNG" file

or
Code:
grep "\
FDCIR
ALSNG" file

or
Code:
egrep "FDCIR|ALSNG" file

# 4  
Old 09-24-2014
Code:
#  Try awk (or nawk on Solaris)
awk ' />/ {arr[$0]=""; i=$0; next}
        {arr[i]=arr[i] $0}
        END {for( p in arr) { if ( index(arr[p], "FCDIR")>0 ) {print p, arr[p] }}}
       '   fasta_file > newfile

# 5  
Old 09-25-2014
Hi
Many thanks for the quick reply.
I tried but the output looks like
Code:
SDACNDLTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKN
GRVQRSDRRHTRGILEYQDAGAVSCIQILAQNQRKTRHLVFDNIAQGFGFPSIKHEEKTR
SDACNALTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKN
GRVQRSDRRHTRSILEYQNAGAVPFIQVLTQDQRKTRHLVFDNIAQGFGFPSIKHDEKNT
GRVQRSDRRHTRSILEYQDAGAVSCVQIFS*NQRKTRHLVFDNIAQGFGFPSIKHEEKTR
GRVQRSDRRHTRGILEYQDAGAVSCIQILAQNQRKTRHLVFDNIAQGFGFPSIKHEEKNT
GRVQRSDRRHTRSILEYQDTGAVPFIQVLTQDQRKTRHLVFDNIAQGFGFPSIKHEEKNT
GRVQRSDRRHTRSILEYQDTGAVPFIQVLTQDQRKTRHLVFDNIAQGFGFPSIKHEEKNH
SDACNDLTDAILEASLNIRTPEPSLAFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKN
SDACNDLTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKN
SDACNDLTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKN
SDACNALTDAILEASLNIRTPEPSLAFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKN
SDACNDLTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHDEKN
SDACNALTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFRP*SMKKKT
SDACNALTDVILEASLNIRTPEPSLGFRYSPKINEKTRHLVFDNIAQGFGFPSIKRDEKN
SDACNDLTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHDEKN
GRVQRSDRRHTRGILEYQDAGAVSCIQILAQNQRKTRHLVFDNIAQGFGFPSIKHEEKTR
GRVQRLTDAILEASLNIRTPEPSLAFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKNT

instead of
Code:
>H8V34IS02HC5PK_rframe3
GRVQRSDRRHTRGILEYQDAGAVSCIQILAQNQRKTRHLVFDNIAQGF
GFPSIKHEEKTRR***TISIFHRTRPPTGRLFFAWRRA*INAGEPRQERKGRGVALQNPWNLLWETVFDYS
LTNIQMGPKQATLRSSKTSRTYGTHCGTGQIGNSLHFRNQGCMP*GQ
>H8V34IS02FU9GO_rframe1
SDACNALTDAILEASLNIRTPEPSLSFRYSPKINEKTRHLVFDNIAQGFGFPSIKHEEKNTKMLIDYFHIPPDEAAHWALVLCMAPGVNKRRGTQKSRTEGGGALCVAKPIELAMSDGF
DYSLTNAQMGLKTGDPTQFKDFEDVWNAFVEQLKFGVALHFRNRDVCRRAEIR
>H8V34IS02FTRDI_rframe3
GRVQRSDRRHTRSILEYQNAGAVPFIQVLTQDQRKTRHLVFDNIAQGFGFPSIKHDEKNT
KMMIDYFNIPPDEAAHWALVLCMAPGVNKRRGTQKSRTEGGGGFCVGKPMELAMGDGFD
YSLTNTQIGPKTGDPTQFNSFEDVWNAFEEQVKFAAALHFRNRDVCRRAEIKY*

ideally, I get a fasta file containing the selected sequences.
Cheers, Marion.




Quote:
Originally Posted by MadeInGermany
Have you tried
Code:
grep "FDCIR" file

?
Multiple matches:
Code:
grep -e "FDCIR" -e "ALSNG" file

or
Code:
grep "\
FDCIR
ALSNG" file

or
Code:
egrep "FDCIR|ALSNG" file


Moderator's Comments:
Mod Comment Please use code tags next time for your code and data. Thanks


---------- Post updated at 04:19 AM ---------- Previous update was at 03:47 AM ----------

Hi
Thank you for the quick reply.
I did not work, the new file is empty.

Any other suggestions?
Cheers, Marion.


Quote:
Originally Posted by jim mcnamara
Code:
#  Try awk (or nawk on Solaris)
awk ' />/ {arr[$0]=""; i=$0; next}
        {arr[i]=arr[i] $0}
        END {for( p in arr) { if ( index(arr[p], "FCDIR")>0 ) {print p, arr[p] }}}
       '   fasta_file > newfile

# 6  
Old 09-25-2014
This one, if matches, returns the whole string from > to the next >
Code:
awk -v search="FDCIR|ALSNG" '$1~/^>/ {buf=sep=""; found=0} found==1 {print; next} {buf=buf sep $0; sep=RS} $0~search {print buf; found=1}' file

# 7  
Old 09-25-2014
Perfect, exactly what I needed: Dankeschön!
Cheers, Marion.



Quote:
Originally Posted by MadeInGermany
This one, if matches, returns the whole string from > to the next >
Code:
awk -v search="FDCIR|ALSNG" '$1~/^>/ {buf=sep=""; found=0} found==1 {print; next} {buf=buf sep $0; sep=RS} $0~search {print buf; found=1}' file

 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to add specific bases at the beginning and ending of all the fasta sequences?

Hi, I have to add 7 bases of specific nucleotide at the beginning and ending of all the fasta sequences of a file. For example, I have a multi fasta file namely test.fasta as given below test.fasta >TalAA18_Xoo_CIAT_NZ_CP033194.1:_2936369-2939570:+1... (1 Reply)
Discussion started by: dineshkumarsrk
1 Replies

2. Shell Programming and Scripting

Shorten header of protein sequences in fasta file to only organism name

I have a fasta file as follows >sp|Q8WWQ8|STAB2_HUMAN Stabilin-2 OS=Homo sapiens OX=9606 GN=STAB2 PE=1 SV=3 MMLQHLVIFCLGLVVQNFCSPAETTGQARRCDRKSLLTIRTECRSCALNLGVKCPDGYTM ITSGSVGVRDCRYTFEVRTYSLSLPGCRHICRKDYLQPRCCPGRWGPDCIECPGGAGSPC NGRGSCAEGMEGNGTCSCQEGFGGTACETCADDNLFGPSCSSVCNCVHGVCNSGLDGDGT... (3 Replies)
Discussion started by: jerrild
3 Replies

3. UNIX for Beginners Questions & Answers

How to count the length of fasta sequences?

I could calculate the length of entire fasta sequences by following command, awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta But, I need to calculate the length of a particular fasta sequence specified/listed in another txt file. The results to to be... (14 Replies)
Discussion started by: dineshkumarsrk
14 Replies

4. Shell Programming and Scripting

Getting unique sequences from multiple fasta file

Hi, I have a fasta file with multiple sequences. How can i get only unique sequences from the file. For example my_file.fasta >seq1 TCTCAAAGAAAGCTGTGCTGCATACTGTACAAAACTTTGTCTGGAGAGATGGAGAATCTCATTGACTTTACAGGTGTGGACGGTCTTCAGAGATGGCTCAAGCTAACATTCCCTGACACACCTATAGGGAAAGAGCTAAC >seq2... (3 Replies)
Discussion started by: Ibk
3 Replies

5. Shell Programming and Scripting

Shorten header of protein sequences in fasta file

I have a fasta file as follows >sp|O15090|FABP4_HUMAN Fatty acid-binding protein, adipocyte OS=Homo sapiens GN=FABP4 PE=1 SV=3 MCDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDVITIKSESTFKN TEISFILGQEFDEVTADDRKVKSTITLDGGVLVHVQKWDGKSTTIKRKREDDKLVVECVM KGVTSTRVYERA >sp|L18484|AP2A2_RAT AP-2... (3 Replies)
Discussion started by: alexypaul
3 Replies

6. Shell Programming and Scripting

Extract sequences from a FASTA file based on another file

I have two files. File1 is shown below. >153L:B|PDBID|CHAIN|SEQUENCE RTDCYGNVNRIDTTGASCKTAKPEGLSYCGVSASKKIAERDLQAMDRYKTIIKKVGEKLCVEPAVIAGIISRESHAGKVL KNGWGDRGNGFGLMQVDKRSHKPQGTWNGEVHITQGTTILINFIKTIQKKFPSWTKDQQLKGGISAYNAGAGNVRSYARM DIGTTHDDYANDVVARAQYYKQHGY >16VP:A|PDBID|CHAIN|SEQUENCE... (7 Replies)
Discussion started by: nelsonfrans
7 Replies

7. Shell Programming and Scripting

Shell script for changing the accession number of DNA sequences in a FASTA file

Hi, I am having a file of dna sequences in fasta format which look like this: >admin_1_45 atatagcaga >admin_1_46 atatagcagaatatatat with many such thousands of sequences in a single file. I want to the replace the accession Id "admin_1_45" similarly in following sequences to... (5 Replies)
Discussion started by: margarita
5 Replies

8. Shell Programming and Scripting

Select distinct rows in a file by last column

Hi, I have the following file: LOG:015608::ERR:2310:map_spsrec:Invalid parameter LOG:015608::ERR:2471:map_dgdrec:Invalid parameter LOG:015608::ERR:2487:map_nnmrec:Invalid number LOG:015608::ERR:2310:map_nmrec:Invalid number LOG:015608::ERR:2438:map_nmrec:Invalid number As a delimiter I... (2 Replies)
Discussion started by: apenkov
2 Replies

9. Shell Programming and Scripting

Select distinct values from a flat file

Hi , I have a similar problem. Please can anyone help me with a shell script or a perl. I have a flat file like this fruit country apple germany apple india banana pakistan banana saudi mango india I want to get a output like fruit country apple ... (7 Replies)
Discussion started by: smalya
7 Replies

10. UNIX for Dummies Questions & Answers

select distinct row from a file

Hi, buddies out there. I have a text file ( only one column ) which I created using vi editor. The file contains duplicate rows and I would like to select distinct rows, how to go on it using unix command: file content = apple apple orange watermelon apple orange Can it be done... (7 Replies)
Discussion started by: merry susana
7 Replies
Login or Register to Ask a Question