Match files based on either of the two columns awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Match files based on either of the two columns awk
# 1  
Old 06-28-2013
Match files based on either of the two columns awk

Dear Shell experts,
I have 2 files with structure:
File 1: ID and count
Code:
head test_GI_count1.txt 
1000094 2
10039307 1
10039641 1
10047177 11
10047359 1
1008555 2
10120302 1
10120672 13
10121776 1
10121865 32

And 2nd file:
Code:
head Protein_gi_GeneID_symbol.txt
protein_gi GeneID Symbol
10954455 1246500 repA1
10954457 1246501 repA2
10954458 1246502 leuA
10954459 1246503 leuB
10954460 1246504 leuC
10954461 1246505 leuD
31982990 1246509 ibp
31982991 1246510 repA1
10954456 3722426 pLeuDn_02


I need a final file with structure:e.g. (only consisting IDs present in file 1)
Code:
10954458 1246502 leuA 3
10954459 1246503 leuB 24
10954460 1246504 leuC 13

etc

That means I want to subset from the 2nd file which IDs are unique and copy all the cols of 2nd file and add $2 (count) from file 1


To do this first I did:
Code:
smitra:File_editing smitra$ awk 'NR == FNR {
>   k[$1]
>   next
>   }
> ($1) in k
>   ' test_GI_count1.txt  Protein_gi_GeneID_symbol_1.txt > merged_file.txt

And added $2 of 1st file separately later.
This workes perfectly unless I have some problem with new file, where first file IDs ($1) is often protein_gi, but sometimes GeneID.

e.g. in 1st file
Code:
1000094 2

where as in 2nd file
Code:
77747945 1000094 treA


That means I need to search them from $1 OR $2 of 2nd file.
Can anybody please suggest me how can I do that?
Thanks a lot,
Mitra

---------- Post updated at 09:09 AM ---------- Previous update was at 09:05 AM ----------

And also I want to add $2 (count) from file 1, together in same script, so that I will not have mismatch in col length when some IDs are absent from 2nd file.

Any suggestion will be really great.
Thanks a lot,
Mitra

Last edited by smitra; 06-28-2013 at 05:07 AM.. Reason: more description
# 2  
Old 06-28-2013
I do not see a corresponding field/key in file1 and file2?
# 3  
Old 06-28-2013
Sorry Zaxxon,
I don't understand what you mean. My file one do not have a heading.
But fist col is ID and second col is count

Now I want to match this ID from either 1st or 2nd col of File 2

Then want to print the corresponding whole row of file 2 added with Col2 of first file.
# 4  
Old 06-28-2013
In file1, you say the 1st column is the ID. Where is that ID to be found in the 2nd file so that it can be counted?
# 5  
Old 06-28-2013
Dear zaxxon,
that is I tried to explain for example in first file:
Code:
1000094 2
10039307 1
10039641 1
10047177 11
10047359 1

format is ist col:ID, 2nd col:count

Now I need to search each number (e.g. 1000094) from 2nd file...wither from first col or from 2nd col

I have only given the head of both the files.

And I
# 6  
Old 06-28-2013
Quote:
I have only given the head of both the files.
There we go. It makes it much easier to help when relevant examples are given Smilie
So I made an example of your given input and arranged some matched in file1 and file2.

Try:
Code:
$ cat file1
1000094 2
10039307 1
10039641 1
10954458 11
10047359 1
10954460 23
10954459 2
10120302 1
10954460 13
10121776 1
10121865 32
$ cat file2
protein_gi GeneID Symbol
10954455 1246500 repA1
10954457 1246501 repA2
10954458 1246502 leuA
10954459 1246503 leuB
10954460 1246504 leuC
10954461 1246505 leuD
31982990 1246509 ibp
31982991 1246510 repA1
10954456 3722426 pLeuDn_02
$ awk 'NR == FNR {k[$1]+=$2; next} $1 in k {print $0,k[$1]}' file1 file2
10954458 1246502 leuA 11
10954459 1246503 leuB 2
10954460 1246504 leuC 36

This User Gave Thanks to zaxxon For This Post:
# 7  
Old 06-28-2013
Dear zaxxon,
Thank you very much. But it still producec the similar output which I already got.
But I have the file bit more complicated. I am editing your example. Thanks for creating the example.
Code:
$ cat file1
1000094 2
10039307 1
10039641 1
10954458 11
10047359 1
10954459 2
10120302 1
10954460 13
10121776 1
10121865 32
$ cat file2
protein_gi GeneID Symbol
10954455 1246500 repA1
10954457 1246501 repA2
10954458 1246502 leuA
10954459 1246503 leuB
10954460 1246504 leuC
10954461 1246505 leuD
31982990 10121776 ibp
31982991 1246510 repA1
10954456 1000094 pLeuDn_02

Now I want this output
Code:
10954458 1246502 leuA 11
10954459 1246503 leuB 2
10954460 1246504 leuC 13
31982990 10121776 ibp 1
10954456 1000094 pLeuDn_02 2


Your advice will be really helpful..
Thanks in advance,
Best wishes,
Mitra

Last edited by smitra; 06-28-2013 at 11:56 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Data match 2 files based on first 2 columns matching only and join if match

Hi, i have 2 files , the data i need to match is in masterfile and i need to pull out column 3 from master if column 1 and 2 match and output entire row to new file I have tried with join and awk and i keep getting blank outputs or same file is there an easier way than what i am... (4 Replies)
Discussion started by: axis88
4 Replies

2. Shell Programming and Scripting

awk to create subdirectory based on match between two files

In the below awk I am trying to mkdir based of an exact match between file2 line starting with R_2019.... and file1 line starting with R_2019. When a match is found there is a folder located at /home/cmccabe/run with the same name as the match where each $2 in file1 is a new subdirectory in that... (2 Replies)
Discussion started by: cmccabe
2 Replies

3. Shell Programming and Scripting

Comparing two columns in two files and printing a third based on a match

Hello all, First post here. I did not notice a previous post to help me down the right path. I am looking to compare a column in a CSV file against another file (which is not a column match one for one) but more or less when a match is made, I would like to append a third column that contains a... (17 Replies)
Discussion started by: dis0wned
17 Replies

4. Shell Programming and Scripting

awk to update file based on partial match in field1 and exact match in field2

I am trying to create a cronjob that will run on startup that will look at a list.txt file to see if there is a later version of a database using database.txt as the source. The matching lines are written to output. $1 in database.txt will be in list.txt as a partial match. $2 of database.txt... (2 Replies)
Discussion started by: cmccabe
2 Replies

5. Shell Programming and Scripting

New files based off match or no match

Trying to match $2 in original_targets with $2 of new_targets . If the two numbers match exactly then a match.txt file is outputted using the information in the new_targets in the beginning 4 fields $1, $2, $3, $4 and value of $4 in the original_targets . If there is "No Match" then a no... (2 Replies)
Discussion started by: cmccabe
2 Replies

6. UNIX for Advanced & Expert Users

Match and print based on columns

HI, I have 2 different questions in this thread. Consider 2 files as input (input file have different line count ) File 1 1 1 625 56 1 12 657 34 1 9 25 45 1 2 20 54 67 3 25 35 27 4 45 73 36 5 125 56 45 File2 1 1 878 76 1 9 83 67 2 20 73 78 4 47 22 17 3 25 67 99 (4 Replies)
Discussion started by: rossi
4 Replies

7. Shell Programming and Scripting

awk script to split file into multiple files based on many columns

So I have a space delimited file that I'd like to split into multiple files based on multiple column values. This is what my data looks like 1bc9A02 1 10 1000 FTDLNLVQALRQFLWSFRLPGEAQKIDRMMEAFAQRYCQCNNGVFQSTDTCYVLSFAIIMLNTSLHNPNVKDKPTVERFIAMNRGINDGGDLPEELLRNLYESIKNEPFKIPELEHHHHHH 1ku1A02 1 10... (9 Replies)
Discussion started by: viored
9 Replies

8. Shell Programming and Scripting

match two key columns in two files and print output (awk)

I have two files... file1 and file2. Where columns 1 and 2 of file1 match columns 1 and 2 of file2 I want to create a new file that is all file1 + columns 3 and 4 of file2 :b: Many thanks if you know how to do this.... :b: file1 31-101 106 0 92 31-101 106 29 ... (2 Replies)
Discussion started by: pelhabuan
2 Replies

9. Shell Programming and Scripting

match columns using awk

Hi All, I need some help in writing a small script using Awk. My input file has following deatils A,B,C,D 8239359,8239359,8388125,8388125 8239359,8239359,8388125,8388125 7165981,7165981,8363138,8363138 8283830,8283830,8382987,8382987 8209964,8209964,8367098,8367098 ... (8 Replies)
Discussion started by: pistachio
8 Replies

10. Shell Programming and Scripting

awk 3 files to one based on multiple columns

Hi all, I have three files, one is a navigation file, one is a depth file and one is a file containing the measured field of gravity. The formats of the files are; navigation file: 2006 320 17 39 0 0 *nav 21.31542 -157.887 2006 320 17 39 10 0 *nav 21.31542 -157.887 2006 320 17 39 20 0... (2 Replies)
Discussion started by: andrealphus
2 Replies
Login or Register to Ask a Question