Help Parsing Sequence File


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Help Parsing Sequence File
# 1  
Old 07-13-2010
Help Parsing Sequence File

Hi Everyone,

I am new in the world of UNIX and Shell scripting.
I am working with a sequence file that looks like this:
HTML Code:
>contig00001  length=128   numreads=2
aTGTGCTGGgTGGGTGCCTGTTgCCccATGCTCCAGTtCAGGATTtCAGGCAttCTCATG
TCCAGCATTTCTATTTAATCCTGCTGCTGGACTTGGGTGGtCTCAGTCtGGGAAGTGAGC
tGTCTGTG
>contig00002  length=142   numreads=1
cttcttaaatctttattctcaggaaaaatacatgtcaaaatacatgttataggACGCGtt
tCGGCAcgtttaaGccttCCTCaaccATACATGATATAtagtCaaaCTaaacaGACTTAT
ATAGCAATATCCTGTGACGTAC
>contig00003  length=144   numreads=3
TGTCTTCTTAAATCTTTATTctCAGtAACAATACATGTTAAAaTACATGTTATAGGACGC
GTTTCGGCCcGTCTAAGCCTTCCTCAACCATACATAAtATATAATCAaCTcaCAGACTTT
TATAGCACgAATCCTGTGACGTAA
>contig00004  length=272   numreads=1
ACACTATCTTAGATaaaTCTcccTAGTCTCCtttGGCtttATATCTGGcccTCATGACCT
TGATAcccACAGTAACCAGTGACAGTACAGCtttCAttttttCAATCCATGGATGAATGT
ATGaaaaGTGtttggacCAcTTCATCTGCTTCTTGataagaTCTCTACGCTGGTACTGAg
gggaTCAGAAgggAGTAAGGAAGctactTCAggagaaggtggaccaggatAAGCAGCAgg
gCAGCTACCATGGAGGaaaatgtaatgtatat
>contig00005  length=153   numreads=1
GGAATCGTACTCACTTGACCAGTGaaaaGAGGCAgggACaacACaaaCTTCACCaaaGTC
CTTGtttAtttGGCtttCTCAACAATACAGCCAGTACTCCAGATCGCAGCCgggTCAAGT
aaaTGCAGCACACACACAGTCCGTGTTGCCTCA
As one can see, length of each sequence (string) is mentioned in the header (length=153 etc.). what I need is a piece of code (Shell/Perl/Python etc.) to select all those sequences with the size greater than a particular length. For example, if I choose the length >= 150, the output file should look like:
HTML Code:
>contig00004  length=272   numreads=1
ACACTATCTTAGATaaaTCTcccTAGTCTCCtttGGCtttATATCTGGcccTCATGACCT
TGATAcccACAGTAACCAGTGACAGTACAGCtttCAttttttCAATCCATGGATGAATGT
ATGaaaaGTGtttggacCAcTTCATCTGCTTCTTGataagaTCTCTACGCTGGTACTGAg
gggaTCAGAAgggAGTAAGGAAGctactTCAggagaaggtggaccaggatAAGCAGCAgg
gCAGCTACCATGGAGGaaaatgtaatgtatat
>contig00005  length=153   numreads=1
GGAATCGTACTCACTTGACCAGTGaaaaGAGGCAgggACaacACaaaCTTCACCaaaGTC
CTTGtttAtttGGCtttCTCAACAATACAGCCAGTACTCCAGATCGCAGCCgggTCAAGT
aaaTGCAGCACACACACAGTCCGTGTTGCCTCA
Once I've this file I would to extract a fixed length of characters from the beginning and end of each of the sequences of the file. For example, if I choose to extract 10 characters from both ends of each sequence it should look like:

HTML Code:
>contig00004  length=272   numreads=1
ACACTATCTTtaatgtatat
>contig00005  length=153   numreads=1
GGAATCGTACTGTTGCCTCA
Thanks in advance for your help.

Last edited by Fahmida; 07-13-2010 at 07:57 AM..
# 2  
Old 07-13-2010
Use gawk, nawk or /usr/xpg4/bin/awk on Solaris:
Code:
awk 'END {
  if (rec && len <= t[2])
    print rec
  }
/^>/ {
  if (len <= t[2]) print rec
  rec = null; split($2, t, "="); t[2]  
    }
{ 
  rec = rec ? rec RS $0 : $0 
  }' len=153 infile

This User Gave Thanks to radoulov For This Post:
# 3  
Old 07-13-2010
Thanks for your prompt reply. It's working fine. Would appreciate if the second part is solved (i.e. parsing the first and last 10 characters of each sequence). Thanks once again.
# 4  
Old 07-13-2010
If I'm not missing something:

Code:
awk 'END {
  if (rec && len <= t[2]) {
    n = split(rec, tt, RS)
    print tt[1]
    print substr(tt[2], 1, limit) substr(tt[n], length(tt[n]) - limit)
    }
  }
/^>/ {
  if (len <= t[2]) {
    n = split(rec, tt, RS)
    print tt[1]
    print substr(tt[2], 1, limit) substr(tt[n], length(tt[n]) - limit)
      }
  rec = null; split($2, t, "="); t[2]  
    }
{ 
  rec = rec ? rec RS $0 : $0 
  }' len=153 limit=10 infile

# 5  
Old 07-13-2010
It's working fine. Just wonder whether it will work for a slightly different sequence header:

HTML Code:
>1639 length 24 cvg_0.0_tip_0
AAAAAAAAAAAAAAAAAAAAACTA
>1641 length 24 cvg_0.0_tip_0
AAAAAAAAAAAAAAAAAAAAACTC
>1630 length 144 cvg_0.0_tip_0
TGTCTTCTTAAATCTTTATTctCAGtAACAATACATGTTAAAaTACATGTTATAGGACGC
GTTTCGGCCcGTCTAAGCCTTCCTCAACCATACATAAtATATAATCAaCTcaCAGACTTT
TATAGCACgAATCCTGTGACGTAA
>1643 length 24 cvg_0.0_tip_0
AAAAAAAAAAAAAAAAAAAAACTT
>1645 length 24 cvg_0.0_tip_0
AAAAAAAAAAAAAAAAAAAAACTG
>1647 length 24 cvg_0.0_tip_0
AGTTTTTTTTTTTTTTTTTTTTTT
>1649 length 24 cvg_0.0_tip_0
AGTTTTTTTTTTTTTTTTTTTTTG
>1655 length 128 cvg_0.0_tip_0
aTGTGCTGGgTGGGTGCCTGTTgCCccATGCTCCAGTtCAGGATTtCAGGCAttCTCATG
TCCAGCATTTCTATTTAATCCTGCTGCTGGACTTGGGTGGtCTCAGTCtGGGAAGTGAGC
tGTCTGTG
>1662 length 142 cvg_0.0_tip_0
cttcttaaatctttattctcaggaaaaatacatgtcaaaatacatgttataggACGCGtt
tCGGCAcgtttaaGccttCCTCaaccATACATGATATAtagtCaaaCTaaacaGACTTAT
ATAGCAATATCCTGTGACGTAC
# 6  
Old 07-13-2010
Code:
awk 'END {
  if (rec && len <= rec_len) {
    n = split(rec, tt, RS)
    print tt[1]
    print substr(tt[2], 1, limit) substr(tt[n], length(tt[n]) - limit)
    }
  }
/^>/ {
  if (len <= rec_len) {
    n = split(rec, tt, RS)
    print tt[1]
    print substr(tt[2], 1, limit) substr(tt[n], length(tt[n]) - limit)
      }
  rec = null; rec_len = $3  
    }
{ 
  rec = rec ? rec RS $0 : $0 
  }' len=<length> limit=<limit> infile

# 7  
Old 07-13-2010
Many Thanks for your reply. However, for input parameters "len=100 limit=20 testseq.txt", I get the following output:
HTML Code:
>1630 length 144 cvg_0.0_tip_0
TGTCTTCTTAAATCTTTATTAGCACgAATCCTGTGACGTAA
>1655 length 128 cvg_0.0_tip_0
aTGTGCTGGgTGGGTGCCTGtGTCTGTG
>1662 length 142 cvg_0.0_tip_0
cttcttaaatctttattctcTAGCAATATCCTGTGACGTAC
Length of the extracted sequences are 41, 28, and 41 respectively, which should be 40 for all of them. The expected output, considering 20 characters each taken from the start and end of each sequence is:

HTML Code:
>1630 length 144 cvg_0.0_tip_0
TGTCTTCTTAAATCTTTATTGCACgAATCCTGTGACGTAA
>1655 length 128 cvg_0.0_tip_0
aTGTGCTGGgTGGGTGCCTGtGGGAAGTGAGCtGTCTGTG
>1662 length 142 cvg_0.0_tip_0
cttcttaaatctttattctcAGCAATATCCTGTGACGTAC
All of them are 040 characters long. Thanks.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Inserting IDs from a text file into a sequence alignment file

Hi, I have one file with one column and several hundred entries File1: NA1 NA2 NA3And now I need to run a command within a mapping aligner tool to insert these sample names into a sequence alignment file (SAM) such that they look like this @RG ID:Library1 SM:NA1 PL:Illumina ... (7 Replies)
Discussion started by: nans
7 Replies

2. Shell Programming and Scripting

To search duplicate sequence in file

Hi, I want to search only duplicate sequence number in file e.g 4757610 4757610 should display only duplicate sequence number in file. file contain is: 4757610 6zE:EXPNL ORDER_PRIORITY='30600022004757610' ORDER_IDENTIFIER='4257771056' MM_ASK_VOLUME='273' MM_ASK_PRICE='1033.0000' m='GBX'... (5 Replies)
Discussion started by: ashfaque
5 Replies

3. Shell Programming and Scripting

Identifying Missing File Sequence

Hi, I have a file which contains few columns and the first column has the file names, and I would like to identify the missing file sequence number form the file and would copy to another file. My files has data in below format. APKRISPSIN320131231201319_0983,1,54,125,... (5 Replies)
Discussion started by: rramkrishnas
5 Replies

4. Shell Programming and Scripting

Get string of sequence from other file

Hi guys, Does anyone know how to get a string of sequence from other file? Should I use awk? Please see below. Thanks! LIST_FILE: >NAME1 >NAME3 >NAME5 >NAME7 >NAME8 SEQ_FILE: >NAME1 LEN75 100100101001010001010 >NAME2 LEN90 111010101010101101101 >NAME3 LEN27 101000101001010010101... (5 Replies)
Discussion started by: narachaid
5 Replies

5. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ... (1 Reply)
Discussion started by: manigrover
1 Replies

6. Shell Programming and Scripting

Parsing a fasta sequence with start and end coordinates

Hi.. I have a seperate chromosome sequences and i wanted to parse some regions of chromosome based on start site and end site.. how can i achieve this? For Example Chr 1 is in following format I need regions from 2 - 10 should give me AATTCCAAA and in a similar way 15- 25 should give... (8 Replies)
Discussion started by: empyrean
8 Replies

7. Shell Programming and Scripting

Adding sequence to the file

How do I add the sequence number to the file? I have a file seperated by commas. appusage,243,jsdgh,798 appusage,876,0989,900 . . appusage,82374,ajfgdh,9284 The output would be as below 1,appusage,243,jsdgh,798 2,appusage,876,0989,900 . . 100,appusage,876,0989,900 (5 Replies)
Discussion started by: smee
5 Replies

8. Shell Programming and Scripting

Renaming a file use another file as a sequence calling a shl

have this shl that will FTP a file from the a directory in windows to UNIX, It get the name of the file stored in this variable $UpLoadFileName then put in the local directory LocalDir="${MPATH}/xxxxx/dat_files" that part seems to be working, but then I need to take that file and rename, I am using... (3 Replies)
Discussion started by: rechever
3 Replies

9. Shell Programming and Scripting

Parsing of file for Report Generation (String parsing and splitting)

Hey guys, I have this file generated by me... i want to create some HTML output from it. The problem is that i am really confused about how do I go about reading the file. The file is in the following format: TID1 Name1 ATime=xx AResult=yyy AExpected=yyy BTime=xx BResult=yyy... (8 Replies)
Discussion started by: umar.shaikh
8 Replies

10. Shell Programming and Scripting

Adding a sequence string to a file

I have a pipe delimited file I need to add a sequence number to in the third field. The record fields will be variable length, so I have to parse for the second pipe. Another requirement is that the sequence number must be unique to all records in the file and subsequent files created, so the... (5 Replies)
Discussion started by: MrPeabody
5 Replies
Login or Register to Ask a Question