Parse and Join in a text file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Parse and Join in a text file
# 1  
Old 05-12-2011
Parse and Join in a text file

I wanted to parse a text file and join in specific format. please suggest me how to get this done..


Quote:
ID US88811111-0005
OO giensis
OS giensis
SN US74811111
PT I-008, testing for the second phase
PA sandiego group, NC
PI Carozzi; Nadine (Raleigh, NC); Hargiss; Tracy (Cary, NC); Koziel; Michael G. (Raleigh, NC); Duck; Nicholas B. (Apex, NC); Carr; Brian (Raleigh, NC);
PR 20030828 US20030498518P; 20040826 US20040926819; 20070620 US20070765494;
PE US200304985AN 20070765494
P1 Compositions and methods and seeds are provided.
QAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDEL KFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYT
//
ID US74811111-0005
OO giensis
OS giensis
SN US74811111
PT I-003, a gene and methods for its use
PA NIX CORPORATION RESEARCH TRIANGLE PARK, NC
PI Carozzi; Nadine (Raleigh, NC); Hargiss; Tracy (Cary, NC); Koziel; Michael G. (Raleigh, NC); Duck; Nicholas B. (Apex, NC); Carr; Brian (Raleigh, NC);
PR 20030828 US20030498518P; 20040826 US20040926819; 20070620 US20070765494;
PE US200304985AN 20070765494
P1 Compositions and methods and seeds are provided.
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRL EGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLP VGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
//

The output should be in fasta format which consists of lines starting with ID, PT, PA and Sequence. "//" the two slashes are dividing lines between two different sequences.

Quote:
>US88811111-0005 ; I-008, testing for the second phase ; sandiego group, NC
QAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDEL KFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYT

>US74811111-0005 ; I-003, a gene and methods for its use ; NIX CORPORATION RESEARCH TRIANGLE PARK, NC
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRL EGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLP VGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
Like this i have 50,000 in a single file which should be converted to fasta format
# 2  
Old 05-12-2011
Dirty but works -at least with the given sample- ^^
Code:
~/unix.com$ awk '/^(ID|PT|PA)/{y=$1;$1="";x[y]=$0;next}{p=z;z=$0}/^\/\//{print x["ID"]" ;"x["PT"]" ;"x["PA"]"\n"p}' infile

# 3  
Old 05-12-2011
A lazy one:
Code:
nawk '/^ID|^P[AT]/{sub(".*"$2,$2);print}' yourfile | paste -d "\;" - - -

# 4  
Old 05-12-2011
Thank you for the replies.. but the above code is not printing the sequences


any perl solution??
# 5  
Old 05-12-2011
Not a perl solution, but what about printing sequences?
Code:
~/unix.com$ awk '/^(ID|PT|PA)/{y=$1;$1="";x[y]=$0;next}{p=z;z=$0}/^\/\//{print x["ID"]" ;"x["PT"]" ;"x["PA"]"\n"p}' infile 
 US88811111-0005 ; I-008, testing for the second phase ; sandiego group, NC
QAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDEL KFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYT
 US74811111-0005 ; I-003, a gene and methods for its use ; NIX CORPORATION RESEARCH TRIANGLE PARK, NC
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRL EGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLP VGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE

# 6  
Old 05-12-2011
Give a try to
Code:
nawk '{x=y;y=$0}/^ID/{if (w) print w;sub(".*"$2,">"$2);w=$0}/^P[AT]/{sub(".*"$2,$2);w=w?w" \; "$0:$0}/^\/\//{print w"\n" x;w=z}' infile

or
Code:
nawk '{x=y;y=$0}/^ID|^P[AT]/{sub(".*"$2,$2);v=$0;w=w?w" \; "v:">"v}/^\/\//{print w"\n" x;w=z}' infile

Code:
# cat tst
ID US88811111-0005
OO giensis
OS giensis
SN US74811111
PT I-008, testing for the second phase
PA sandiego group, NC
PI Carozzi; Nadine (Raleigh, NC); Hargiss; Tracy (Cary, NC); Koziel; Michael G. (Raleigh, NC); Duck; Nicholas B. (Apex, NC); Carr; Brian (Raleigh, NC);
PR 20030828 US20030498518P; 20040826 US20040926819; 20070620 US20070765494;
PE US200304985AN 20070765494
P1 Compositions and methods and seeds are provided.
QAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDEL KFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYT
//
ID US74811111-0005
OO giensis
OS giensis
SN US74811111
PT I-003, a gene and methods for its use
PA NIX CORPORATION RESEARCH TRIANGLE PARK, NC
PI Carozzi; Nadine (Raleigh, NC); Hargiss; Tracy (Cary, NC); Koziel; Michael G. (Raleigh, NC); Duck; Nicholas B. (Apex, NC); Carr; Brian (Raleigh, NC);
PR 20030828 US20030498518P; 20040826 US20040926819; 20070620 US20070765494;
PE US200304985AN 20070765494
P1 Compositions and methods and seeds are provided.
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRL EGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLP VGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
//

Code:
# nawk '{x=y;y=$0}/^ID/{if (w) print w;sub(".*"$2,">"$2);w=$0}/^P[AT]/{sub(".*"$2,$2);w=w?w" \; "$0:$0}/^\/\//{print w"\n" x ;w=z}' tst
>US88811111-0005 ; I-008, testing for the second phase ; sandiego group, NC
QAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDEL KFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYT
>US74811111-0005 ; I-003, a gene and methods for its use ; NIX CORPORATION RESEARCH TRIANGLE PARK, NC
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRL EGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLP VGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE

Code:
# nawk '{x=y;y=$0}/^ID|^P[AT]/{sub(".*"$2,$2);v=$0;w=w?w" \; "v:">"v}/^\/\//{print w"\n"x;w=z}' tst
>US88811111-0005 ; I-008, testing for the second phase ; sandiego group, NC
QAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDEL KFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYT
>US74811111-0005 ; I-003, a gene and methods for its use ; NIX CORPORATION RESEARCH TRIANGLE PARK, NC
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRL EGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLP VGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE


Last edited by ctsgnb; 05-12-2011 at 06:38 PM..
# 7  
Old 05-12-2011
Quote:
works good if the sequence is in one line.. its not working for this.. its printing only last line.. any edits??

Code:
ID   013789-0068
PS   TBD
OO   huringiensis
OS   ringiensis
OX
SI   68
RA
RL   2010. OKAYAMA UNIVERSITY,JAPAN LAMB CO LTD
FT   source          1..1176
MT
AC   67106
SV
CT
PN   013789
PT   PROTEIN PRODUCTION METHOD, FUSION PROTEIN, AND ANTISERUM
PA   AMA UNIVERSITY,JAPAN LAMB CO LTD.
PI   HAYAKAWA TORU (JP) SAKAI, HIROSHI, HAYAKAWA, TORU
P8
P4   10013789
P5   0
PC   International Classification: \nUS Classification: \nEuropean Classification: C12N15/62; C07K14/47A25
PR   80199166;
PE   199166
AN   09JP63603
KC   1
P1   ng the DNA into a host bacterium to transform the host bacterium; and (c) causing the exprey further comprise a step of removing the peptide chain (B) from the fusion protein. \n \n
P7
P9   112
PO
PM   10013789;
PB   10013789
PQ   10013789;
EM   esentative
W1   PRT
D1   0204
D2   0217
D3   0730
D4   0801
D5   0204
HL   [L[P9_GQ;0;3,WO2010013789,45,67]] [L[PM_PN_GQNUC;0;12,WO2010013789]] [L[PQ_PN_GQNUC;0;12,WO2010013789]]
CC   mer C1-1-f FH   Key             Location/Qualifiers Copyright (c)Inc. 2011
LS   Application
L2   Publ. Of int. appl. w4

  MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQI
  EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV
  YVQAANLHLSVLRDVSVFGQRWGFDAATINSRYNDLTRLIGNYTDYAVRWYNTGLERVWGPDSRDWVRYNQFRRELTLTV
  LDIVALFSNYDSRRYPIRTVSQLTREIYTNPVLENFDGSFRGMAQRIEQNIRQPHLMDILNSITIYTDVHRGFNYWSGHQ
  ITASPVGFSGPEFAFPLFGNAGNAAPPVLVSLTGLGIFRTLSSPLYRRIILGSGPNNQELFVLDGTEFSFASLTTNLPST
  IYRQRGTVDSLDVIPPQDNSVPPRAGFSHRLSHVTMLSQAAGAVYTLRAPTFSWQHRSAEFNNIIPSSQITQIPLTKSTN
  LGSGTSVVKGPGFTGGDILRRTSPGQISTLRVNITAPLSQRYRVRIRYASTTNLQFHTSIDGRPINQGNFSATMSSGSNL
  QSGSFRTVGFTTPFNFSNGSSVFTLSAHVFNSGNEVYIDRIEFVPAEVTFEAEYDLERAQKAVNELFTSSNQIGLKTDVT


Last edited by Scott; 05-13-2011 at 08:43 PM.. Reason: Added code tags
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Join multiple lines from text file

Hi Guys, Could you please advise how to join multiple details lines into single row, with HEADER 1 as the record separator and comma(,) as the field separator. Input: HEADER 1, HEADER 2, HEADER 3, 11,22,33, COLUMN1,COLUMN2,COLUMN3, AA1, BB1, CC1, END: ABC HEADER 1, HEADER 2,... (3 Replies)
Discussion started by: budz26
3 Replies

2. Shell Programming and Scripting

Parse file for fields and specific text

I have a file of ~500,000 entries in the following: file.txt chr1 11868 12227 ENSG00000223972.5 . + HAVANA exon . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type... (17 Replies)
Discussion started by: cmccabe
17 Replies

3. Shell Programming and Scripting

Parse text file using specific tags

awk -F "" '/<href=>|<href=>|<top>|<top>/ {print $3, OFS=\t}' source.txt > output.txt I'm not quite sure how to parse the attached file, but what I am trying to do is in a output file have the link (href=), name (after the <), and count (<top>) in 3 separate columns. My attempt is the above... (2 Replies)
Discussion started by: cmccabe
2 Replies

4. Shell Programming and Scripting

How to parse a file for text b/n double quotes?

Hi guys, I desperately need some help here... I need to parse a file similar to this: I need to read the values for MY_BANNER_SSHD and WARNING_MESSAGE. The value could be empty/single line or multi-line! # Comments . . . Some lines MY_BANNER_SSHD=""... (7 Replies)
Discussion started by: shreeda
7 Replies

5. Shell Programming and Scripting

How to get awk to edit in place and join all lines in text file

Hi, I lack the utter fundamentals on how to craft an awk script. I have hundreds of text files that were mangled by .doc format so all the lines are broken up so I need to join all of the lines of text into a single line. Normally I use vim command "ggVGJ" to join all lines but with so many... (3 Replies)
Discussion started by: n00ti
3 Replies

6. Shell Programming and Scripting

Trying to Parse Version Information from Text File

I have a file name version.properties with the following data: major.version=14 minor.version=234 I'm trying to write a grep expression to only put "14" to stdout. The following is not working. grep "major.version=(+)" version.properties What am I doing wrong? (6 Replies)
Discussion started by: obfunkhouser
6 Replies

7. UNIX for Dummies Questions & Answers

parse through one text file and output many

Hi, everyone The input file pattern is like below: Begin Object1 txt1 end ; Begin Object2 txt2 end ; ... (14 Replies)
Discussion started by: sophiadun
14 Replies

8. Shell Programming and Scripting

parse text file

I have a file that has a header followed by 8 columns of data. I want to toss out the header, and then write the data to another file with a different header and footer. I also need to grab the first values of the first and second column to put in the header. How do I chop off the header? ... (9 Replies)
Discussion started by: craggm
9 Replies

9. Shell Programming and Scripting

parse text file

i am attempting to parse a simple text file with multiple lines and four fields in each line, formatted as such: 12/10/2006 12:34:06 77 38 this is what i'm having problems with in my bash script: sed '1,6d' $RAWDATA > $NEWFILE #removes first 6 lines from file, which are... (3 Replies)
Discussion started by: klick81
3 Replies

10. UNIX for Dummies Questions & Answers

Parse Text file and send mails

Please help. I have a text file which looks something like this aaa@abc.com, c:FilePath\Eaaa.txt bbb@abc.com, c:FilePath\Ebbb.txt ccc@abc.com, c:FilePath\Eccc.txt ddd@abc.com, c:FilePath\Eddd.txt...so on I want to write a shell script which will pick up the first field 'aaa@abc.com' and... (12 Replies)
Discussion started by: Amruta Pitkar
12 Replies
Login or Register to Ask a Question