Select distinct sequences from fasta file and list Post: 302918635

Sponsored Content

Top Forums UNIX for Dummies Questions & Answers Select distinct sequences from fasta file and list Post 302918635 by Marion MPI on Wednesday 24th of September 2014 01:44:05 PM

09-24-2014

Registered User

Select distinct sequences from fasta file and list

Hi
How can I extract sequences from a fasta file with respect a certain criteria? The beginning of my file (containing in total more than 1000 sequences) looks like this:

Code:

>H8V34IS02I59VP 
SDACNDLTIALLQIAREVRVCNPTFSFRWHPQVKDEVMRECFDCIRQGLG
YPSMRNDPILIANCMNWHGHPLEEARQWVHQACMSPCPSTKHGFQPFRMA
SATANCAKIIEYALSNGYDPVVNMQMGPKIGDARDFT*L*TVIRGLGTTR
WSG
>H8V34IS02IRUQO 
SDACNDLTTAYMQAARLVRVCNPTFSFRYHPQVKDEVMREAFGCIRHGLG
YPNIKNDSVLIPNAMYWHGHPLEEARQWVNQACMSPCPPXKYGCQPNRMA
SAANCAKMIEYTLHTGMIM**TCRVGTEGRVIRAYFKDFGGVLTRYGVKQ
MEWLDVVLIVRFT
>H8V34IS02HTVT3 
SDACNGMTIALMQAARLVRTPNPTFAFRWHPKVKDEVMREIFECIRHGLG
YPAMRNDPILISNAMHWHRHPIEEARTWVHQACMSPCPTTKHGTQPMRMA
HATANCAKIMEYALWNGYDHVVNMQMGPRTGDARKFTDFEQLFDAWVKQX
DGC
>H8V34IS02HI4PS 
SDACNALTDCYLEAALVSRVSDPTFGFRYHSKVRTETLRRVFECIRHGLG
YPSIRNDDVLIPNIMHWFGHPLKEARRWLHQACMAPAPDTKWGAPSLRYP
QASIITGSKAISLAMFDGFDPLTGMQTGIKTGDCSKFETFDEFYDAWYEQ
PKAGFKQATGMEH
>H8V34IS02F9NL0 
SDACNDLTIALLQIAREVRVCNPTFSFRWHPQVKDEVMRECFDCIRQGLG
YPSMRNDPILIANCMNWHGHPLEEARQWVHQACMSPCPSTKHGFQPFRMA
SATANCAKIIEYALSNGYDPVVNMQMGPKIGDAGISDFEQLFEAWVQTDG
VA*WIL

I want to extract the sequences containing the motif FDCIR? Can it be done with grep? Or do I need a pearl script?
In a next step: How could I even extract sequences with respect to fullfilling two or more criteria?

Looking forward getting your suggestions.

Cheers, Marion.

Last edited by jim mcnamara; 09-24-2014 at 04:55 PM..

Marion MPI

View Public Profile for Marion MPI

Find all posts by Marion MPI

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

select distinct row from a file

Hi, buddies out there. I have a text file ( only one column ) which I created using vi editor. The file contains duplicate rows and I would like to select distinct rows, how to go on it using unix command: file content = apple apple orange watermelon apple orange Can it be done...

2. Shell Programming and Scripting

Select distinct values from a flat file

Hi , I have a similar problem. Please can anyone help me with a shell script or a perl. I have a flat file like this fruit country apple germany apple india banana pakistan banana saudi mango india I want to get a output like fruit country apple ...

3. Shell Programming and Scripting

Select distinct rows in a file by last column

Hi, I have the following file: LOG:015608::ERR:2310:map_spsrec:Invalid parameter LOG:015608::ERR:2471:map_dgdrec:Invalid parameter LOG:015608::ERR:2487:map_nnmrec:Invalid number LOG:015608::ERR:2310:map_nmrec:Invalid number LOG:015608::ERR:2438:map_nmrec:Invalid number As a delimiter I...

4. Shell Programming and Scripting

Shell script for changing the accession number of DNA sequences in a FASTA file

Hi, I am having a file of dna sequences in fasta format which look like this: >admin_1_45 atatagcaga >admin_1_46 atatagcagaatatatat with many such thousands of sequences in a single file. I want to the replace the accession Id "admin_1_45" similarly in following sequences to...

5. Shell Programming and Scripting

Extract sequences from a FASTA file based on another file

6. Shell Programming and Scripting

Shorten header of protein sequences in fasta file

I have a fasta file as follows >sp|O15090|FABP4_HUMAN Fatty acid-binding protein, adipocyte OS=Homo sapiens GN=FABP4 PE=1 SV=3 MCDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDVITIKSESTFKN TEISFILGQEFDEVTADDRKVKSTITLDGGVLVHVQKWDGKSTTIKRKREDDKLVVECVM KGVTSTRVYERA >sp|L18484|AP2A2_RAT AP-2...

7. Shell Programming and Scripting

Getting unique sequences from multiple fasta file

Hi, I have a fasta file with multiple sequences. How can i get only unique sequences from the file. For example my_file.fasta >seq1 TCTCAAAGAAAGCTGTGCTGCATACTGTACAAAACTTTGTCTGGAGAGATGGAGAATCTCATTGACTTTACAGGTGTGGACGGTCTTCAGAGATGGCTCAAGCTAACATTCCCTGACACACCTATAGGGAAAGAGCTAAC >seq2...

8. UNIX for Beginners Questions & Answers

How to count the length of fasta sequences?

I could calculate the length of entire fasta sequences by following command, awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta But, I need to calculate the length of a particular fasta sequence specified/listed in another txt file. The results to to be...

9. Shell Programming and Scripting

Shorten header of protein sequences in fasta file to only organism name

I have a fasta file as follows >sp|Q8WWQ8|STAB2_HUMAN Stabilin-2 OS=Homo sapiens OX=9606 GN=STAB2 PE=1 SV=3 MMLQHLVIFCLGLVVQNFCSPAETTGQARRCDRKSLLTIRTECRSCALNLGVKCPDGYTM ITSGSVGVRDCRYTFEVRTYSLSLPGCRHICRKDYLQPRCCPGRWGPDCIECPGGAGSPC NGRGSCAEGMEGNGTCSCQEGFGGTACETCADDNLFGPSCSSVCNCVHGVCNSGLDGDGT...

10. UNIX for Beginners Questions & Answers

How to add specific bases at the beginning and ending of all the fasta sequences?

Hi, I have to add 7 bases of specific nucleotide at the beginning and ending of all the fasta sequences of a file. For example, I have a multi fasta file namely test.fasta as given below test.fasta >TalAA18_Xoo_CIAT_NZ_CP033194.1:_2936369-2939570:+1...

LEARN ABOUT DEBIAN

srf2fastq

srf2fastq(1)							   Staden io_lib						      srf2fastq(1)

NAME

       srf2fastq - Converts SRF files to Sanger fastq format

SYNOPSIS

       srf2fastq  [options] srf_archive ...

DESCRIPTION

       srf2fastq extracts sequences and qualities from one or more SRF archives and writes them in Sanger fastq format to stdout.

       Note  that  Illumina also have a fastq format (used in the GERALD directories) which differs slightly in the use of log-odds scores for the
       quality values. The format described here is using the traditional Phred style of quality encoding.

OPTIONS

       -c     Outputs calibrated confidence values using the ZTR CNF1 chunk type for a single quality per base.  Without  this	use  the  original
	      Illumina _prb.txt files consisting of four quality values per base, stored in the ZTR CNF4 chunks.

       -C     Masks out sequences tagged as bad quality.

       -s root
	      Generates  files	on  disk with filenames starting root, one file per non-explicit element in the SRF/ZTR region (REGN) chunk. Typi-
	      cally this results in two files for paired end runs. The filename suffixes come from the names listed  in  the  SRF  region  chunks.
	      This option conflicts with the -S parameter.

       -S     Splits sequences into regions, but sequentially lists each sequence region to stdout instead of splitting to separate files on disk.
	      This option conflicts with the -s parameter.

       -n     When using -s the filename suffixes are simply numbered (starting with 1) instead of using  the  names  listed  in  the  SRF  region
	      chunks.

       -a     Appends region index to the sequence names. Ie generate "name/1" and "name/2" for a paired read.

       -e     Include  any  explicit sequence (ZTR region chunk of type 'E') in the sequence output. The explicit sequence is also included in the
	      quality line too. Currently this is utilised by ABI SOLiD to store the last base of the primer.

       -r region list
	      Reverse complements the sequence and reverses the quality values for all regions in the region list. This is a comma separated  list
	      of integer values enumerating the regions, starting from 1. Note that this option only works when either -s or -S are specified.

EXAMPLES

       To extract only the good quality sequences from all srf files in the current directory using calibrated confidence values (if available).

	   srf2fastq -c -C *.srf > runX.fastq

       To extract a paired end run into two separate files with sequences named name/1 and name/2.

	   srf2fastq -s runX -a -n runX.srf

       To extract a paired end run as a single file, alternating forward and reverse sequences, with the second read being reverse complemented.

	   srf2fastq -S -r 2 runX.srf > runX.fastq

AUTHOR

       James Bonfield, Steven Leonard - Wellcome Trust Sanger Institute

								    December 10 						      srf2fastq(1)