Parse and Join in a text file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Parse and Join in a text file
# 8  
Old 05-12-2011
Code:
nawk 'NF==1&&(length($1)>3){print $0;next}{x=y;y=$0}/^ID|^P[AT]/{sub(".*"$2,$2);v=$0;w=w?w" \; "v:">"v}/^\/\//{print w"\n"x;w=z}' tst

# 9  
Old 05-13-2011
Any help Smilie

---------- Post updated at 11:29 AM ---------- Previous update was at 10:16 AM ----------

can anyone explain me how the logic of below code works??? i need to modify it or any new solution??
Quote:
nawk 'NF==1&&(length($1)>3){print $0;next}{x=y;y=$0}/^ID|^P[AT]/{sub(".*"$2,$2);v=$0;w=w?w" \; "v:">"v}/^\/\//{print w"\n"x;w=z}' tst
---------- Post updated at 05:38 PM ---------- Previous update was at 11:29 AM ----------

I am fairly new to awk... the above code is working fine but not printing out the id numbers, can someone correct it for this input..

Code:
ID   WO2010013789-0068
PS   TBD
OO   Bacillus thuringiensis
OS   Bacillus thuringiensis
OX   
SI   68
RA  
RL   Patent: WO 2010013789-A/68 04-FEB-2010. OKAYAMA UNIVERSITY,JAPAN LAMB CO LTD
FT   source          1..1176
MT  
AC   DD967106
SV  
CT  
PN   WO2010013789
PT   PROTEIN PRODUCTION METHOD, FUSION PROTEIN, AND ANTISERUM
PA   OKAYAMA UNIVERSITY,JAPAN LAMB CO LTD.
PI   SAKAI HIROSHI (JP); HAYAKAWA TORU (JP) SAKAI, HIROSHI, HAYAKAWA, TORU
P8  
P4   WO2010013789
P5   0
PC   International Classification: \nUS Classification: \nEuropean Classification: C12N15/62; C07K14/47A25
PR   20080801 JP20080199166;
PE   JP20080199166
AN   WO2009JP63603
KC   A1
P1   Disclosed is a high-throues the following steps (a) to (c): (a) producing DNA encoding a fusion protein which is composed of a peptide chain constituting the protein (A) and a C-terminal peptide chain of Cry protein produced by Bacillus thuringiensis or a fragment of the C-terminal peptep of removing the peptide chain (B) from the fusion protein. \n \n 
P7   
P9   112
PO   
PM   WO2010013789; 
PB   WO2010013789
PQ   WO2010013789; 
EM   representative
W1   GQPAT.PRT
D1   20100204
D2   20100217
D3   20090730
D4   20080801
D5   20100204
HL   [L[PN_PN_WIPOPAT;0;12,WO2010013789,A1]] [L[P9_GQ;0;3,WO2010013789,45,67]] [L[PM_PN_GQNUC;0;12,WO2010013789]] [L[PQ_PN_GQNUC;0;12,WO2010013789]]
CC   OS   Artificial CC   Primer C1-1-f FH   Key             Location/Qualifiers Copyright (c) GenomeQuest, Inc. 2011
LS   Application
L2   Publ. Of int. appl. with int. search rep.; 2010-02-04

  MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQI
  EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV
  YVQAANLHLSVLRDVSVFGQRWGFDAATINSRYNDLTRLIGNYTDYAVRWYNTGLERVWGPDSRDWVRYNQFRRELTLTV
  LDIVALFSNYDSRRYPIRTVSQLTREIYTNPVLENFDGSFRGMAQRIEQNIRQPHLMDILNSITIYTDVHRGFNYWSGHQ
  ITASPVGFSGPEFAFPLFGNAGNAAPPVLVSLTGLGIFRTLSSPLYRRIILGSGPNNQELFVLDGTEFSFASLTTNLPST
  IYRQRGTVDSLDVIPPQDNSVPPRAGFSHRLSHVTMLSQAAGAVYTLRAPTFSWQHRSAEFNNIIPSSQITQIPLTKSTN
  LGSGTSVVKGPGFTGGDILRRTSPGQISTLRVNITAPLSQRYRVRIRYASTTNLQFHTSIDGRPINQGNFSATMSSGSNL
  QSGSFRTVGFTTPFNFSNGSSVFTLSAHVFNSGNEVYIDRIEFVPAEVTFEAEYDLERAQKAVNELFTSSNQIGLKTDVT
  DYHIDQVSNLVECLSDEFCLDEKQELSEKVKHAKRLSDERNLLQDPNFRGINRQLDRGWRGSTDITIQGGDDVFKENYVT
  LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC

//
ID  US7482432-0005
PS  disclosure
OO  Bacillus thuringiensis
OS  Bacillus thuringiensis
OX  
SI  5
RA  
RL  
FT  
MT  
AC  
SV  
CT  \n That which is claimed: \n 1. An isolated polypheterologous \n  amino acid sequences. \n 3. A composition comprising the polypeptide of claim <centrifugation, sedimentation, or concentration of a \n  culture of <I>ling a Lepidopteran or Coleopteran pest, \n  comprising contacting said pest with, or feeding to said pest, a \n  pesticidally-effective amount of the polypeptide of claim <B>1</B>. \n 
PN  US7482432
PT  AXMI-003, a delta-endotoxin gene and methods for its use
PA  ATHENIX CORPORATION RESEARCH TRIANGLE PARK, NC 
PI  Carozzi; Nadine (Raleigh, NC); Hargiss; Tracy (Cary, NC); Koziel; Michael G. (Raleigh, NC); Duck; Nicholas B. (Apex, NC); Carr; Brian (Raleigh, NC);
P8  
P4  
P5  0
PC  International Classification: C07K 014/00; C07H 021/04;\nUS Classification: 530/350; 536/023.71; 435/320.1; 435/069.1;\nEuropean Classification: C07H21/04; C07K14/325
PR  20030828 US20030498518P; 20040826 US20040926819; 20070620 US20070765494;
PE  US20030498518P
AN  20070765494
KC  B2
P1  Compositions and methods for conferring pesticidal activity to bacteria, plants, plant cells, tissues and seeds are provided. Compositions comprising a coding sequence for a delta-endotoxin polypeptide are provided. The coding sequences can be used in DNA constructs or expression cassettes for transformation and expression in plants and bacteria. Compositions also comprise transformed bacteria, plants, plant cells, tissues, and seeds. In particular, isolated delta-endotoxin nucleic acid molecules are provided. w Pat. No. 7,253,343. Provisional application No. 60/498,518, filed on 2003/08/28. Previously published as US 2007/0238646 A1, 2007/10/11.
P7  8
P9  31
PO  1-4; 
PM  US20050049410; US20070238646; US20070240239; US7253343; US7482432; WO2005021585; 
PB  US20050049410
PQ  US20050049410; US20070238646; US7253343; US7482432; WO2005021585; 
EM  member
W1  GQPAT.PRT
D1  20090127
D2  20090206
D3  20070620
D4  20030828
D5  20071011
HL  [L[PN_PN_USPTOPAT;0;9,US7482432,B2]] [L[P9_GQ;0;2,US7482432,2,29]] [L[PO_GQ;0;5,US7482432]] [L[PM_PN_GQNUC;0;13,US20050049410]] [L[PM_PN_GQNUC;15;13,US20070238646]] [L[PM_PN_GQNUC;45;9,US7253343]] [L[PM_PN_GQNUC;56;9,US7482432]] [L[PM_PN_GQNUC;67;12,WO2005021585]] [L[PQ_PN_GQNUC;0;13,US20050049410]] [L[PQ_PN_GQNUC;15;13,US20070238646]] [L[PQ_PN_GQNUC;30;9,US7253343]] [L[PQ_PN_GQNUC;41;9,US7482432]] [L[PQ_PN_GQNUC;52;12,WO2005021585]]
CC  Copyright (c) GenomeQuest, Inc. 2011
LS  Granted
L2  Granted patent as second publication / Reexam. certif. 2nd reexam.; 2009-01-27

  MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQI
  EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV
  YVQAANLHLSVLRDVSVFGQRWGFDAATINSRYNDLTRLIGNYTDYAVRWYNTGLERVWGPDSRDWVRYNQFRRELTLTV
  LDIVALFSNYDSRRYPIRTVSQLTREIYTNPVLENFDGSFRGMAQRIEQNIRQPHLMDILNSITIYTDVHRGFNYWSGHQ
  ITASPVGFSGPEFAFPLFGNAGNAAPPVLVSLTGLGIFRTLSSPLYRRIILGSGPNNQELFVLDGTEFSFASLTTNLPST
  IYRQRGTVDSLDVIPPQDNSVPPRAGFSHRLSHVTMLSQAAGAVYTLRAPTFSWQHRSAEFNNIIPSSQITQIPLTKSTN
  LGSGTSVVKGPGFTGGDILRRTSPGQISTLRVNITAPLSQRYRVRIRYASTTNLQFHTSIDGRPINQGNFSATMSSGSNL
  QSGSFRTVGFTTPFNFSNGSSVFTLSAHVFNSGNEVYIDRIEFVPAEVTFEAEYDLERAQKAVNELFTSSNQIGLKTDVT
  DYHIDQVSNLVECLSDEFCLDEKQELSEKVKHAKRLSDERNLLQDPNFRGINRQLDRGWRGSTDITIQGGDDVFKENYVT
  LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC
  APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR
//

---------- Post updated at 07:11 PM ---------- Previous update was at 05:38 PM ----------

Anyone please help...
# 10  
Old 05-14-2011
Code:
#!/bin/sh

sed -r 's/\ +/\ /' "$1" | while read l; do
    f1=$(echo "$l" | cut -d\  -f1)
    f2=$(echo "$l" | cut -d\  -f2-)
    case "$f1" in
        'ID')
            s1=">$f2"
        ;;
        'PA' | 'PT')
            s1="$s1 ; $f2"
        ;;
        '//')
            echo "$s1$s2"
            s1=; s2=;
        ;;
    esac
    [ "${#f1}" -gt 3 ] && s2="$s2\n$f1"
done

exit 0

Try this, it also should be more understandable.
Usage: sh scriptname infile > outfile

Last edited by tukuyomi; 05-14-2011 at 11:43 AM.. Reason: Added -r option to sed & '>' sign before ID
# 11  
Old 05-14-2011
Code:
awk '{n=NF}/^ID|^P[AT]/{sub(".*"$2,$2);v=$0;w=w?w" \; "v:">"v}n==1&&(length($1)>3){a[c++]=$0}/^\/\//{print w;w=z;for(i in a){print a[i];delete a[i]}}' tst

---------- Post updated at 03:01 PM ---------- Previous update was at 02:54 PM ----------

Code:
awk '{n=NF}/^ID|^P[AT]/{sub(".*"$2,$2);v=$0;w=w?w" \; "v:">"v}n==1&&(length($1)>3){g=$0;l=l?l"\n"g:g}/^\/\//{print w"\n"l;w=l=z}' tst

---------- Post updated at 03:04 PM ---------- Previous update was at 03:01 PM ----------

Code:
awk '{n=NF}/^ID|^P[AT]/{sub(".*"$2,$2);v=$0;w=w?w" \; "v:">"v}n==1&&(length($1)>3){if(w){print w;w=z};print}' tst

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Join multiple lines from text file

Hi Guys, Could you please advise how to join multiple details lines into single row, with HEADER 1 as the record separator and comma(,) as the field separator. Input: HEADER 1, HEADER 2, HEADER 3, 11,22,33, COLUMN1,COLUMN2,COLUMN3, AA1, BB1, CC1, END: ABC HEADER 1, HEADER 2,... (3 Replies)
Discussion started by: budz26
3 Replies

2. Shell Programming and Scripting

Parse file for fields and specific text

I have a file of ~500,000 entries in the following: file.txt chr1 11868 12227 ENSG00000223972.5 . + HAVANA exon . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type... (17 Replies)
Discussion started by: cmccabe
17 Replies

3. Shell Programming and Scripting

Parse text file using specific tags

awk -F "" '/<href=>|<href=>|<top>|<top>/ {print $3, OFS=\t}' source.txt > output.txt I'm not quite sure how to parse the attached file, but what I am trying to do is in a output file have the link (href=), name (after the <), and count (<top>) in 3 separate columns. My attempt is the above... (2 Replies)
Discussion started by: cmccabe
2 Replies

4. Shell Programming and Scripting

How to parse a file for text b/n double quotes?

Hi guys, I desperately need some help here... I need to parse a file similar to this: I need to read the values for MY_BANNER_SSHD and WARNING_MESSAGE. The value could be empty/single line or multi-line! # Comments . . . Some lines MY_BANNER_SSHD=""... (7 Replies)
Discussion started by: shreeda
7 Replies

5. Shell Programming and Scripting

How to get awk to edit in place and join all lines in text file

Hi, I lack the utter fundamentals on how to craft an awk script. I have hundreds of text files that were mangled by .doc format so all the lines are broken up so I need to join all of the lines of text into a single line. Normally I use vim command "ggVGJ" to join all lines but with so many... (3 Replies)
Discussion started by: n00ti
3 Replies

6. Shell Programming and Scripting

Trying to Parse Version Information from Text File

I have a file name version.properties with the following data: major.version=14 minor.version=234 I'm trying to write a grep expression to only put "14" to stdout. The following is not working. grep "major.version=(+)" version.properties What am I doing wrong? (6 Replies)
Discussion started by: obfunkhouser
6 Replies

7. UNIX for Dummies Questions & Answers

parse through one text file and output many

Hi, everyone The input file pattern is like below: Begin Object1 txt1 end ; Begin Object2 txt2 end ; ... (14 Replies)
Discussion started by: sophiadun
14 Replies

8. Shell Programming and Scripting

parse text file

I have a file that has a header followed by 8 columns of data. I want to toss out the header, and then write the data to another file with a different header and footer. I also need to grab the first values of the first and second column to put in the header. How do I chop off the header? ... (9 Replies)
Discussion started by: craggm
9 Replies

9. Shell Programming and Scripting

parse text file

i am attempting to parse a simple text file with multiple lines and four fields in each line, formatted as such: 12/10/2006 12:34:06 77 38 this is what i'm having problems with in my bash script: sed '1,6d' $RAWDATA > $NEWFILE #removes first 6 lines from file, which are... (3 Replies)
Discussion started by: klick81
3 Replies

10. UNIX for Dummies Questions & Answers

Parse Text file and send mails

Please help. I have a text file which looks something like this aaa@abc.com, c:FilePath\Eaaa.txt bbb@abc.com, c:FilePath\Ebbb.txt ccc@abc.com, c:FilePath\Eccc.txt ddd@abc.com, c:FilePath\Eddd.txt...so on I want to write a shell script which will pick up the first field 'aaa@abc.com' and... (12 Replies)
Discussion started by: Amruta Pitkar
12 Replies
Login or Register to Ask a Question