Match files based on either of the two columns awk | Unix Linux Forums | Shell Programming and Scripting

  Go Back    


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Match files based on either of the two columns awk

Shell Programming and Scripting


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 06-28-2013
smitra smitra is offline
Registered User
 
Join Date: Aug 2012
Last Activity: 30 June 2013, 6:51 AM EDT
Posts: 28
Thanks: 15
Thanked 0 Times in 0 Posts
Match files based on either of the two columns awk

Dear Shell experts,
I have 2 files with structure:
File 1: ID and count

Code:
head test_GI_count1.txt 
1000094 2
10039307 1
10039641 1
10047177 11
10047359 1
1008555 2
10120302 1
10120672 13
10121776 1
10121865 32

And 2nd file:

Code:
head Protein_gi_GeneID_symbol.txt
protein_gi GeneID Symbol
10954455 1246500 repA1
10954457 1246501 repA2
10954458 1246502 leuA
10954459 1246503 leuB
10954460 1246504 leuC
10954461 1246505 leuD
31982990 1246509 ibp
31982991 1246510 repA1
10954456 3722426 pLeuDn_02


I need a final file with structure:e.g. (only consisting IDs present in file 1)

Code:
10954458 1246502 leuA 3
10954459 1246503 leuB 24
10954460 1246504 leuC 13

etc

That means I want to subset from the 2nd file which IDs are unique and copy all the cols of 2nd file and add $2 (count) from file 1


To do this first I did:

Code:
smitra:File_editing smitra$ awk 'NR == FNR {
>   k[$1]
>   next
>   }
> ($1) in k
>   ' test_GI_count1.txt  Protein_gi_GeneID_symbol_1.txt > merged_file.txt

And added $2 of 1st file separately later.
This workes perfectly unless I have some problem with new file, where first file IDs ($1) is often protein_gi, but sometimes GeneID.

e.g. in 1st file

Code:
1000094 2

where as in 2nd file

Code:
77747945 1000094 treA


That means I need to search them from $1 OR $2 of 2nd file.
Can anybody please suggest me how can I do that?
Thanks a lot,
Mitra

---------- Post updated at 09:09 AM ---------- Previous update was at 09:05 AM ----------

And also I want to add $2 (count) from file 1, together in same script, so that I will not have mismatch in col length when some IDs are absent from 2nd file.

Any suggestion will be really great.
Thanks a lot,
Mitra

Last edited by smitra; 06-28-2013 at 04:07 AM.. Reason: more description
Sponsored Links
    #2  
Old 06-28-2013
zaxxon's Avatar
zaxxon zaxxon is offline Forum Staff  
code tag tagger
 
Join Date: Sep 2007
Last Activity: 19 September 2014, 7:22 AM EDT
Location: St. Gallen, Switzerland
Posts: 6,228
Thanks: 121
Thanked 451 Times in 411 Posts
I do not see a corresponding field/key in file1 and file2?
Sponsored Links
    #3  
Old 06-28-2013
smitra smitra is offline
Registered User
 
Join Date: Aug 2012
Last Activity: 30 June 2013, 6:51 AM EDT
Posts: 28
Thanks: 15
Thanked 0 Times in 0 Posts
Sorry Zaxxon,
I don't understand what you mean. My file one do not have a heading.
But fist col is ID and second col is count

Now I want to match this ID from either 1st or 2nd col of File 2

Then want to print the corresponding whole row of file 2 added with Col2 of first file.
    #4  
Old 06-28-2013
zaxxon's Avatar
zaxxon zaxxon is offline Forum Staff  
code tag tagger
 
Join Date: Sep 2007
Last Activity: 19 September 2014, 7:22 AM EDT
Location: St. Gallen, Switzerland
Posts: 6,228
Thanks: 121
Thanked 451 Times in 411 Posts
In file1, you say the 1st column is the ID. Where is that ID to be found in the 2nd file so that it can be counted?
Sponsored Links
    #5  
Old 06-28-2013
smitra smitra is offline
Registered User
 
Join Date: Aug 2012
Last Activity: 30 June 2013, 6:51 AM EDT
Posts: 28
Thanks: 15
Thanked 0 Times in 0 Posts
Dear zaxxon,
that is I tried to explain for example in first file:

Code:
1000094 2
10039307 1
10039641 1
10047177 11
10047359 1

format is ist col:ID, 2nd col:count

Now I need to search each number (e.g. 1000094) from 2nd file...wither from first col or from 2nd col

I have only given the head of both the files.

And I
Sponsored Links
    #6  
Old 06-28-2013
zaxxon's Avatar
zaxxon zaxxon is offline Forum Staff  
code tag tagger
 
Join Date: Sep 2007
Last Activity: 19 September 2014, 7:22 AM EDT
Location: St. Gallen, Switzerland
Posts: 6,228
Thanks: 121
Thanked 451 Times in 411 Posts
Quote:
I have only given the head of both the files.
There we go. It makes it much easier to help when relevant examples are given
So I made an example of your given input and arranged some matched in file1 and file2.

Try:

Code:
$ cat file1
1000094 2
10039307 1
10039641 1
10954458 11
10047359 1
10954460 23
10954459 2
10120302 1
10954460 13
10121776 1
10121865 32
$ cat file2
protein_gi GeneID Symbol
10954455 1246500 repA1
10954457 1246501 repA2
10954458 1246502 leuA
10954459 1246503 leuB
10954460 1246504 leuC
10954461 1246505 leuD
31982990 1246509 ibp
31982991 1246510 repA1
10954456 3722426 pLeuDn_02
$ awk 'NR == FNR {k[$1]+=$2; next} $1 in k {print $0,k[$1]}' file1 file2
10954458 1246502 leuA 11
10954459 1246503 leuB 2
10954460 1246504 leuC 36

The Following User Says Thank You to zaxxon For This Useful Post:
smitra (06-28-2013)
Sponsored Links
    #7  
Old 06-28-2013
smitra smitra is offline
Registered User
 
Join Date: Aug 2012
Last Activity: 30 June 2013, 6:51 AM EDT
Posts: 28
Thanks: 15
Thanked 0 Times in 0 Posts
Dear zaxxon,
Thank you very much. But it still producec the similar output which I already got.
But I have the file bit more complicated. I am editing your example. Thanks for creating the example.
Code:
$ cat file1
1000094 2
10039307 1
10039641 1
10954458 11
10047359 1
10954459 2
10120302 1
10954460 13
10121776 1
10121865 32
$ cat file2
protein_gi GeneID Symbol
10954455 1246500 repA1
10954457 1246501 repA2
10954458 1246502 leuA
10954459 1246503 leuB
10954460 1246504 leuC
10954461 1246505 leuD
31982990 10121776 ibp
31982991 1246510 repA1
10954456 1000094 pLeuDn_02

Now I want this output

Code:
10954458 1246502 leuA 11
10954459 1246503 leuB 2
10954460 1246504 leuC 13
31982990 10121776 ibp 1
10954456 1000094 pLeuDn_02 2


Your advice will be really helpful..
Thanks in advance,
Best wishes,
Mitra

Last edited by smitra; 06-28-2013 at 10:56 AM..
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Match columns several files @man Shell Programming and Scripting 5 04-09-2013 06:57 AM
awk script to split file into multiple files based on many columns viored Shell Programming and Scripting 9 04-03-2013 10:42 PM
match two key columns in two files and print output (awk) pelhabuan Shell Programming and Scripting 2 09-25-2012 09:35 AM
match columns using awk pistachio Shell Programming and Scripting 8 02-18-2009 01:44 AM
awk 3 files to one based on multiple columns andrealphus Shell Programming and Scripting 2 12-16-2008 10:31 PM



All times are GMT -4. The time now is 05:56 AM.