parsing data and incorporating it into another file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting parsing data and incorporating it into another file
# 1  
Old 07-07-2010
parsing data and incorporating it into another file

Hi All

I have two files:

file 1

Code:
>AB_1
MLKKPIIIGVTGGSGGGKTSVSRAILDSFPNARIAMIQHDSYYKDQSHMSFEERVKTNYDHPLAFDTDFM
IQQLKELLAGRPVDIPIYDYKKHTRSNTTFRQDPQDVIIVEGILVLEDERLRDLMDIKLFVDTDDDIRII
RRIKRDMMERGRSLESIIDQYTSVVKPMYHQFIEPSKRYADIVIPEGVSNVVAIDVINSKIASILGEV
>AB_2
MRARLIYNPTSGQELMRKSVPEVLDILEGFGYETSAFQTTAKKNSALNEARRAAKAGFDLLIAAGGDGTI
NEVVNGIAPLKKRPKMAIIPTGTTNDFARALKVPRGNPSQAAKLIGKNQTIQMDIGRAKKDTYFINIAAA
GSLTELTYSVPSQLKTMFGYLAYLAKGVELLPRVSNVPVKITHDKGVFEGQVSMIFAAITNSVGGFEMIA
PDAKLDDGMFTLILIKTANLFEIVHLLRLILDGGKHITDRRVEYIKTSKIVIEPQCGKRMMINLDGEYGG
DAPITLENLKNHITFFADTDLISDDALVLDQDELEIEEIVKKFAHEVEDLEQELEE
>AB_3
MTGYDDFNYALSALKLGADDYLLKPFSKADVEDMLGKLRKKLELSKKTETIQELVEQPQKEVSAIAMAIH
ERLADSDLTLKSLAQQLGFSPNYLSVLIKKELGMPFQDYLVQERLKKAKLFLLTSNLKIYEIAEQVGFED
MNYFSQRFKQLVGVTPSQYKKGGQA

file 2

Code:
AB_1      gi|229194403|ref|ZP_04321208.1| 
AB_2      gi|229194404|ref|ZP_04321209.1|
AB_3      gi|229194405|ref|ZP_04321210.1|
AB_4      gi|229194406|ref|ZP_04321211.1| 
AB_5      gi|229194407|ref|ZP_04321212.1|

The common id between the two files is AB_1, AB_2, AB_3

I would like to modify file 1 using awk or sed by comparing file 1 with file2 so that I would get an output file like


Code:
>AB_1 gi|229194403|ref|ZP_04321208.1|
MLKKPIIIGVTGGSGGGKTSVSRAILDSFPNARIAMIQHDSYYKDQSHMSFEERVKTNYDHPLAFDTDFM
IQQLKELLAGRPVDIPIYDYKKHTRSNTTFRQDPQDVIIVEGILVLEDERLRDLMDIKLFVDTDDDIRII
RRIKRDMMERGRSLESIIDQYTSVVKPMYHQFIEPSKRYADIVIPEGVSNVVAIDVINSKIASILGEV
>AB_2 gi|229194404|ref|ZP_04321209.1|
MRARLIYNPTSGQELMRKSVPEVLDILEGFGYETSAFQTTAKKNSALNEARRAAKAGFDLLIAAGGDGTI
NEVVNGIAPLKKRPKMAIIPTGTTNDFARALKVPRGNPSQAAKLIGKNQTIQMDIGRAKKDTYFINIAAA
GSLTELTYSVPSQLKTMFGYLAYLAKGVELLPRVSNVPVKITHDKGVFEGQVSMIFAAITNSVGGFEMIA
PDAKLDDGMFTLILIKTANLFEIVHLLRLILDGGKHITDRRVEYIKTSKIVIEPQCGKRMMINLDGEYGG
DAPITLENLKNHITFFADTDLISDDALVLDQDELEIEEIVKKFAHEVEDLEQELEE
>AB_3 gi|229194405|ref|ZP_04321210.1|
MTGYDDFNYALSALKLGADDYLLKPFSKADVEDMLGKLRKKLELSKKTETIQELVEQPQKEVSAIAMAIH
ERLADSDLTLKSLAQQLGFSPNYLSVLIKKELGMPFQDYLVQERLKKAKLFLLTSNLKIYEIAEQVGFED
MNYFSQRFKQLVGVTPSQYKKGGQA

Please let me know the best way to do it
# 2  
Old 07-07-2010
Try:
Code:
awk 'NR==FNR{a[">"$1]=$2;next}a[$1]{$0=$0 FS a[$1]}1' file2 file1

Use nawk or /usr/xpg4/bin/awk on Solaris if you get errors.
# 3  
Old 07-07-2010
Hi,

I tried it. But it is only giving me what I wanted

This is the result I got

Code:
awk 'NR==FNR{a[">"$1]=$2;next}a[$1]{$0=$0 FS a[$1]}1' test1.fasta test2.fasta 
AB_1      gi|229194403|ref|ZP_04321208.1| 
AB_2      gi|229194404|ref|ZP_04321209.1|
AB_3      gi|229194405|ref|ZP_04321210.1|
AB_4      gi|229194406|ref|ZP_04321211.1| 
AB_5      gi|229194407|ref|ZP_04321212.1|

which is exactly similar to file2

I want the info in file 2 to be added to file 1 such that I will get a file like below;

Code:
>AB_1 gi|229194403|ref|ZP_04321208.1|
MLKKPIIIGVTGGSGGGKTSVSRAILDSFPNARIAMIQHDSYYKDQSHMSFEERVKTNYDHPLAFDTDFM
IQQLKELLAGRPVDIPIYDYKKHTRSNTTFRQDPQDVIIVEGILVLEDERLRDLMDIKLFVDTDDDIRII
RRIKRDMMERGRSLESIIDQYTSVVKPMYHQFIEPSKRYADIVIPEGVSNVVAIDVINSKIASILGEV
>AB_2 gi|229194404|ref|ZP_04321209.1|
MRARLIYNPTSGQELMRKSVPEVLDILEGFGYETSAFQTTAKKNSALNEARRAAKAGFDLLIAAGGDGTI
NEVVNGIAPLKKRPKMAIIPTGTTNDFARALKVPRGNPSQAAKLIGKNQTIQMDIGRAKKDTYFINIAAA
GSLTELTYSVPSQLKTMFGYLAYLAKGVELLPRVSNVPVKITHDKGVFEGQVSMIFAAITNSVGGFEMIA
PDAKLDDGMFTLILIKTANLFEIVHLLRLILDGGKHITDRRVEYIKTSKIVIEPQCGKRMMINLDGEYGG
DAPITLENLKNHITFFADTDLISDDALVLDQDELEIEEIVKKFAHEVEDLEQELEE
>AB_3 gi|229194405|ref|ZP_04321210.1|
MTGYDDFNYALSALKLGADDYLLKPFSKADVEDMLGKLRKKLELSKKTETIQELVEQPQKEVSAIAMAIH
ERLADSDLTLKSLAQQLGFSPNYLSVLIKKELGMPFQDYLVQERLKKAKLFLLTSNLKIYEIAEQVGFED
MNYFSQRFKQLVGVTPSQYKKGGQA

Please let me know
# 4  
Old 07-07-2010
Your are reading in the file in the wrong order.
First you have to fill the array with the data from file2, than you
can combine the data in the array with the data in file1. So it has
to be:

Code:
awk 'NR==FNR{a[">"$1]=$2;next}a[$1]{$0=$0 FS a[$1]}1' test2.fasta test1.fasta

HTH Chris
# 5  
Old 07-09-2010
Thanks Christoph Spohr for correcting me.

It worked for me.

I guess the awk would work when the file2 have characters with out space

Code:
AB_1      gi|229194403|ref|ZP_04321208.1| 
AB_2      gi|229194404|ref|ZP_04321209.1|
AB_3      gi|229194405|ref|ZP_04321210.1|
AB_4      gi|229194406|ref|ZP_04321211.1| 
AB_5      gi|229194407|ref|ZP_04321212.1|

suppose if I have the above file with some more data separated by one or 2 spaces such as

Code:
AB_1      gi|229194403|ref|ZP_04321208.1| group II intron reverse transcriptase/maturase [asd ert 456]
AB_2      gi|229194404|ref|ZP_04321209.1|
AB_3      gi|229194405|ref|ZP_04321210.1|
AB_4      gi|229194406|ref|ZP_04321211.1| alkylphosphonate uptake protein [Bder ce 33L]
AB_5      gi|229194407|ref|ZP_04321212.1| hypothetical protein pE33L466_0459 [Badfr cereus 33L]

how do I modify the above awk command to parse the complete data.

Please let me know.

---------- Post updated at 04:49 PM ---------- Previous update was at 03:44 PM ----------

Hi All,
This is just to make my question much more clear.

It is a continuation of the awk code that had been previous suggested by Christoph Spohr and Franklin:

Code:
awk 'NR==FNR
Code:
{a[">"$1]=$2;next}a[$1]{$0=$0 FS a[$1]}1' test2.fasta test1.fasta

I found that if file 2 have more charatcers separated by space in it, the above mentioned awk code will only grab the characters until the space and move it to file1.

how would I modify the above awk so that all the characters in file 2 is parsed and placed on file1.

please let me know.

LA
# 6  
Old 07-09-2010
How about:
Code:
awk 'NR==FNR{$1=">"$1;A[$1]=$0;next}A[$1]{$0=A[$1]}1' file2 file1

This User Gave Thanks to Scrutinizer For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parsing file data

Hey Guys, I'm a novice at shell scripts and i need some help parsing file data. Basically, I want to write a script that retrieves URLs. Here is what I have so far. #!/bin/bash echo "Please enter start date (format: yyyy-mm-dd):\c" read STARTDATE echo "Please enter end date... (7 Replies)
Discussion started by: silverdust
7 Replies

2. Shell Programming and Scripting

Parsing C Data Tipes from Input File

Im really beginner in this case, maybe someone can help me find the answer: if my input file like this: void main(int a, int b){ int x; double y; printf("file"); } and i want output like this: int a int b int x double y A awk script that can parse only data tipe, im confused. what... (2 Replies)
Discussion started by: radynaraya
2 Replies

3. Shell Programming and Scripting

Parsing data using keys from one file

I have 2 text files where I need to parse data from file 2 using the data from file 1. Below are my sample files File 1 (tab delimited) 257 350 670 845 725 1025 767 820 ... .... .... file 2 (tab delimited) 220..450 TA AB650 ABCED 520..850 GA AB720 ABCDE 700..1100 TC AB820 ABCDE... (2 Replies)
Discussion started by: Lucky Ali
2 Replies

4. Shell Programming and Scripting

parsing data from a big file using keys from another smaller file

Hi, I have 2 files format of file 1 is: a1 b2 a2 c2 d1 f3 format of file 2 is (tab delimited): a1 1.2 0.5 0.06 0.7 0.9 1 0.023 a3 0.91 0.007 0.12 0.34 0.45 1 0.7 a2 1.05 2.3 0.25 1 0.9 0.3 0.091 b1 1 5.4 0.3 9.2 0.3 0.2 0.1 b2 3 5 7 0.9 1 9 0 1 b3 0.001 1 2.3 4.6 8.9 10 0 1 0... (10 Replies)
Discussion started by: Lucky Ali
10 Replies

5. Shell Programming and Scripting

parsing a portion of Data from a text file

Hi All, I need some help to effectively parse out a subset of results from a big results file. Below is an example of the text file. Each block that I need to parse starts with "Output of GENE for sequence file 100.fasta" (next block starts with another number). I have given the portion of... (8 Replies)
Discussion started by: Lucky Ali
8 Replies

6. Shell Programming and Scripting

parsing data and incorporating it into another file

Hi, I have a folder that contains many (multiple) files 1.fasta 2.fasta 3.fasta 4.fasta 5.fasta . . 100's of files Each such file have data in the following format for example: vi 1.fasta >AB_1 gi|15835212|ref|NP_296971.1| preprotein translocase subunit SecE... (3 Replies)
Discussion started by: Lucky Ali
3 Replies

7. Shell Programming and Scripting

urgent<parsing data from a excel file>

Hi all, I wud like to get ur assistance in retrieving lines containing l1.My excel dataset contains around 8000 lines.I converted it into a text tab delimiter file and got the lines containing l1,My output is a list of lines containing l1 saved in a outfile.Some of d lines from my outfile s... (5 Replies)
Discussion started by: sayee
5 Replies

8. Shell Programming and Scripting

parsing data file picking out certain fields

I have a file that is large and is broken up by groups of data. I want to take certain fields and display them different to make it easier to read. Given input file below: 2008 fl01 LAC 2589 polk doal xx 2008q1 mx sect 25698541 Sales 08 Dept group lead1 ... (8 Replies)
Discussion started by: timj123
8 Replies

9. Shell Programming and Scripting

Parsing the data in a file

Hi, I have file (FILE.tmp) having contents, FILE.tmp ======== filename=menudata records=0000000000037 ldbname=pinsys timestamp=2005/05/14-18:32:33 I want to parse it bring a new file which will look like, filename records ldbname timestamp... (2 Replies)
Discussion started by: Omkumar
2 Replies

10. Shell Programming and Scripting

Parsing file and extracting the useful data block

Greetings All!! I have a very peculiar problem where I have to parse a big text file and extract useful data out of it with starting and ending block pattern matching. e.g. I have a input file like this: sample data block1 sample data start useful data end sample data block2 sample... (5 Replies)
Discussion started by: arminder
5 Replies
Login or Register to Ask a Question