Text File Manipulation Help


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Text File Manipulation Help
# 1  
Old 03-24-2011
Text File Manipulation Help

Hi I've two text files FILE_1 and FILE_2 as shown below:

FILE_1.txt

HTML Code:
CO Contig1 342 12 11 U
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGG*AGAGAAGTCATTTTCTTGTTTAG

BQ
35 35 35 50 50 50 50 50 50 50 50 60 65 65 65 65 65 65 65 65 65 65 65 65 50 

AF GP5UOVN01AOPE0 U 1
AF GP5UOVN01AT8W3 U 1

Details of RD GP5UOVN01AOPE0
QA 1 50 1 50
DS rank=0053942 x=164.0 y=1050.0 length=84

Details of RD GP5UOVN01AT8W3
QA 1 123 1 123
DS rank=0000612 x=227.0 y=1557.0 length=210

CO Contig2 217 2 1 U
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGGAGAGAGTTCCATAGTTTCACTGC

BQ
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 

AF GP5UOVN01AOVX3 U 100
AF GP5UOVN01ASTEM U 1

Details of GP5UOVN01AOVX3
QA 12 60 12 60
DS rank=0003594 x=166.0 y=1321.5 length=81

Details of GP5UOVN01ASTEM
QA 1 217 1 217
DS rank=0001573 x=211.5 y=332.0 length=217
In this file each there are two group of texts each starting with the header: "CO Contig1 342 12 11 U" and "CO Contig2 217 2 1 U" and there can be many like this. Under each such header I am interested to pick one word (it's a name starts with GP) from the rows that start with the letters "AF". Let us call them Contig_Entries. For example, in this case, the Contig_Entries are:

CO Contig1 342 12 11 U
GP5UOVN01AOPE0
GP5UOVN01AT8W3

CO Contig2 217 2 1 U
GP5UOVN01AOVX3
GP5UOVN01ASTEM

The purpose of doing this is to use them to extract data from FILE_2, which looks like:

FILE_2.txt

HTML Code:
>GP5UOVN01AOPE0 rank=0053942 x=164.0 y=1050.0 length=84
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGAAGTCATTTCTTGTTAGAGT
>GP5UOVN01AT8W3 rank=0000612 x=227.0 y=1557.0 length=210
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGAAGTCATTTTCTTGTTTAGA
>GP5UOVN01AOVX3 rank=0003594 x=166.0 y=1321.5 length=81
GGGCTGACGTGTAGTCTCAGTGCTCTTACAGTAAAGAGTCCATAGTCTCAGTGCTCTTAC
>GP5UOVN01ASTEM rank=0001573 x=211.5 y=332.0 length=217
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGGAGAGAGTTCCATAGTTTCACTGC
In this file the part of the headers are same as those extracted from file one. For each of the entry the length of the text/data that follows is mentioned in the header as "length=xx". For each "CO Contig" in FILE_1, I would like to compare the length of corresponding 'Contig_Entries' and extract the largest one among them. The final output will have the corresponding 'CO Contig' name added in it's header. For example, From above the sample output file will be:

OUTFILE.txt
HTML Code:
>CO Contig1 GP5UOVN01AT8W3 rank=0000612 x=227.0 y=1557.0 length=210
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGAAGTCATTTTCTTGTTTAGA
>CO Contig2 GP5UOVN01ASTEM rank=0001573 x=211.5 y=332.0 length=217
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGGAGAGAGTTCCATAGTTTCACTGC
Thanks for your help.
# 2  
Old 03-24-2011
Code:
awk '{x=(NR==FNR)?1:0}x&&/^CO Contig/{NF=2;$1=$1;c=$0}x&&/^Det/{g=$NF;C[g]=c}x&&/^DS/{sub(".*"$2,$2,$0);$1=$1;++d;S[g]=$0;split($NF,G,"=");L[g]=G[2];for(i in L) m=(i>m)?i:m;if (!(d%2)){R[m]=C[m]":"m":"S[m]":"L[m];m=c=g=d=z;delete C;delete S;delete G; delete L}}!x&&/^>GP/{g=$1;sub("^.","",g)}!x&&/^[^>]/{if (R[g]) print R[g] RS $0}' f1 f2

Code:
# cat f1
CO Contig1 342 12 11 U
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGG*AGAGAAGTCATTTTCTTGTTTAG

BQ
35 35 35 50 50 50 50 50 50 50 50 60 65 65 65 65 65 65 65 65 65 65 65 65 50

AF GP5UOVN01AOPE0 U 1
AF GP5UOVN01AT8W3 U 1

Details of RD GP5UOVN01AOPE0
QA 1 50 1 50
DS rank=0053942 x=164.0 y=1050.0 length=84

Details of RD GP5UOVN01AT8W3
QA 1 123 1 123
DS rank=0000612 x=227.0 y=1557.0 length=210

CO Contig2 217 2 1 U
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGGAGAGAGTTCCATAGTTTCACTGC

BQ
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

AF GP5UOVN01AOVX3 U 100
AF GP5UOVN01ASTEM U 1

Details of GP5UOVN01AOVX3
QA 12 60 12 60
DS rank=0003594 x=166.0 y=1321.5 length=81

Details of GP5UOVN01ASTEM
QA 1 217 1 217
DS rank=0001573 x=211.5 y=332.0 length=217

Code:
# cat f2
>GP5UOVN01AOPE0 rank=0053942 x=164.0 y=1050.0 length=84
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGAAGTCATTTCTTGTTAGAGT
>GP5UOVN01AT8W3 rank=0000612 x=227.0 y=1557.0 length=210
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGAAGTCATTTTCTTGTTTAGA
>GP5UOVN01AOVX3 rank=0003594 x=166.0 y=1321.5 length=81
GGGCTGACGTGTAGTCTCAGTGCTCTTACAGTAAAGAGTCCATAGTCTCAGTGCTCTTAC
>GP5UOVN01ASTEM rank=0001573 x=211.5 y=332.0 length=217
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGGAGAGAGTTCCATAGTTTCACTGC

Code:
# awk '{x=(NR==FNR)?1:0}x&&/^CO Contig/{NF=2;$1=$1;c=$0}x&&/^Det/{g=$NF;C[g]=c}x&&/^DS/{sub(".*"$2,$2,$0);$1=$1;++d;S[g]=$0;split($NF,G,"=");L[g]=G[2];for(i in L) m=(i>m)?i:m;if (!(d%2)){R[m]=C[m]":"m":"S[m]":"L[m];m=c=g=d=z;delete C;delete S;delete G; delete L}}!x&&/^>GP/{g=$1;sub("^.","",g)}!x&&/^[^>]/{if (R[g]) print R[g] RS $0}' f1 f2
CO Contig1:GP5UOVN01AT8W3:rank=0000612 x=227.0 y=1557.0 length=210:210
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGAAGTCATTTTCTTGTTTAGA
CO Contig2:GP5UOVN01ASTEM:rank=0001573 x=211.5 y=332.0 length=217:217
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGGAGAGAGTTCCATAGTTTCACTGC
#

 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Text File Manipulation

Hello, Supposing I had a huge list as follows: TAC manufacturer Device Type 1392600 LG D959 LG-D959TS FeaturePhone 1409700 LG V410 FeaturePhone 35150806 LG F350S FeaturePhone 35165206 Samsung GT-E1200 FeaturePhone 35194505 Nokia Asha 200 FeaturePhone but I want to make it look like... (3 Replies)
Discussion started by: Cludgie
3 Replies

2. Shell Programming and Scripting

Text file manipulation

Hi Gurus, I have a question I have a flat file like below with three fields (3 rd field is amt) ad|B|500 cc||100 dd|C|600 ee||900 Need to write a code in such a way that when second field is empty then do sum of third field So in this case it will be 100 +900 I tried but no luck... (1 Reply)
Discussion started by: patricjemmy6
1 Replies

3. UNIX for Dummies Questions & Answers

Mathematical manipulation of a text file

I have a tab delimited file with 4 columns. If the value in the first column, equals the value in the second column, I'd like to have the 4th column multiplied by 2 then add 1. If the value in the first column differs from the value in the second, I'd like to have the 4th column multiplied by 2... (5 Replies)
Discussion started by: evelibertine
5 Replies

4. Programming

String Manipulation in a text file

Hi I have a requirement to write a script but not sure which is the best way to approach this I have not worked in sed but I'm aware that its robust for file extraction requirements I have a scripting task. I already developed the code in perl but the script is taking almost 2 mins for... (5 Replies)
Discussion started by: John Trevor
5 Replies

5. Shell Programming and Scripting

Awk to convert a text file to CSV file with some string manipulation

Hi , I have a simple text file with contents as below: 12345678900 971,76 4234560890 22345678900 5971,72 5234560990 32345678900 71,12 6234560190 the new csv-file should be like: Column1;Column2;Column3;Column4;Column5 123456;78900;971,76;423456;0890... (9 Replies)
Discussion started by: FreddyDaKing
9 Replies

6. Shell Programming and Scripting

Text File Manipulation

Hi, I need to write shell script for the scenario explained below - datafile.txt AcctNum,code,Region,,,, 12345451,AN ,abaab 12345452,AN ,xccxc 76677545,RP ,acxcc 43567878,RP ,afghh 32190900,AN ,afrfrf 87312345,AN ,aqaw I have a text file (datafile.txt)... (1 Reply)
Discussion started by: ravigupta2u
1 Replies

7. Shell Programming and Scripting

File text manipulation

What I am trying to do is make a script that will add a port number within a section of a file if it already doesn't exist in that section of the file. The particular line that I would like to add the port number to in the file is formatted like this: TCPPORTS="25 80 125 443 8080 10000" For... (3 Replies)
Discussion started by: nullifx
3 Replies

8. Shell Programming and Scripting

Shell script text file manipulation.

Hello, I have mysql binary file which logs all the database queries and i to insert all queries log in to database. First i coverted binary file to text file. and start playing with it. Text file contains following queries, some samples are, SET INSERT_ID=1; INSERT INTO test... (0 Replies)
Discussion started by: mirfan
0 Replies

9. UNIX for Dummies Questions & Answers

Text file manipulation

I am a new unix user & I wanted to work with unix as it is very good in text manipulations. I need a little help. I will be grateful if someone can help me out. I need help in grepping a pattern of numbers from one file to another file. Specific details are as follows: File one contains only... (4 Replies)
Discussion started by: Ezy
4 Replies

10. UNIX for Dummies Questions & Answers

Text file manipulation

Hi, I need to remove lines from a text file that are less than certain length in UNIX. For example, test.txt file contains the following lines: abcdefghijklmnopqrstuvwxyz. 123456789009876543211234567 This line to be removed. zyxwvutsrqponmlkjihgfedcba. The length of each line is supposed... (5 Replies)
Discussion started by: svannala
5 Replies
Login or Register to Ask a Question