Comparing specific columns between two files


 
Thread Tools Search this Thread
Operating Systems Linux Comparing specific columns between two files
# 1  
Old 09-29-2015
Comparing specific columns between two files

Dear All,

I have two files. File-A having 5 columns and File-B having 2 columns.
I want to match 4th column of file-A with both columns of file-B and print all contents of file-A + the matching lines of file-B as output.

file-A
Code:
30.00   12      gi|49483390|ref|YP_040614.1|    DIP-29721N|refseq:NP_683750|uniprot:Q8R418      2e-08
30.00   13      gi|49484704|ref|YP_041928.1|    DIP-33449N|uniprot:Q8WZ42       3e-09
30.00   16      gi|49483425|ref|YP_040649.1|    DIP-23879N|refseq:NP_650366|uniprot:Q9VFJ3      4e-06
30.00   17      gi|49484107|ref|YP_041331.1|    DIP-46805N|uniprot:P70388       1e-06
30.00   21      gi|49482259|ref|YP_039483.1|    DIP-25107N|refseq:NP_495440     2e-15
30.00   22      gi|49482976|ref|YP_040200.1|    DIP-22713N|refseq:NP_524108     1e-06
30.00   26      gi|49483184|ref|YP_040408.1|    DIP-17056N|refseq:NP_651605     1e-09
30.00   31      gi|49484099|ref|YP_041323.1|    DIP-29200N|refseq:NP_005436|uniprot:Q9UQE7      6e-12

flle-B
Code:
DIP-10000N|refseq:NP_417192|uniprotkb:P30131    DIP-31848N|uniprotkb:P0A9B2
DIP-10000N|refseq:NP_417192|uniprotkb:P30131    DIP-36429N|uniprotkb:P0AAM7
DIP-10001N|refseq:NP_418748|uniprotkb:P39377    DIP-10001N|refseq:NP_418748|uniprotkb:P39377
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10003N|refseq:NP_290325|uniprotkb:P29209
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10149N|refseq:NP_417877|uniprotkb:P06993
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10397N|refseq:NP_416719|uniprotkb:P06996
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10467N|refseq:NP_415423|uniprotkb:P09373
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10557N|refseq:NP_416344|uniprotkb:P23865
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10573N|refseq:NP_414736|uniprotkb:P16659
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10783N|refseq:NP_417800|uniprotkb:P02359
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-11097N|refseq:NP_290066|uniprotkb:P28242
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-11354N|refseq:NP_415140|uniprotkb:P39177

Is it possible? I'd be highly thankful if someone can help me.

Last edited by Syeda Sumayya; 09-29-2015 at 03:18 AM..
# 2  
Old 09-29-2015
The idea of matching fields in one file against a field in another file is easy with awk. But, given that there is only one line in file-B where both columns are the same and there is no line in file-A that contains the value that appears in that line in file-B, there is no output matching your request. Or, did I misunderstand what you're trying to do???

And, if there were lines in your input files that met your criteria, your description of the output you want is not clear.

Please describe more clearly what you are trying to do and show us the sample output you are trying to produce from your sample inputs.
# 3  
Old 09-29-2015
Any attempts from your side?

---------- Post updated at 11:13 ---------- Previous update was at 11:06 ----------

Based on wild guesses, and appreciating what Don Cragun said (NO matches!), and having removed the DOS <CR> line terminators in file-A, this seemed to do sth like what you wanted:
Code:
awk 'FNR==NR {T[$1]=$0; T[$2]=$0; next} {print $0, T[$4]}' file-B file-A

# 4  
Old 09-29-2015
Actually both the columns of file-B are interacting-protein-partners.
I want to match the column4 of file-A with both the columns of file-B, to see which of the protein of column4, file-A is also present in any of the column of file-B along with its corresponding protein partner.

Following is how I want the output to be like,

Code:
30.00  17  gi|49484107|ref|YP_041331.1|   DIP-46805N|uniprot:P70388  1e-06 DIP-44775N|refseq:NP_006210|uniprotkb:P42338    DIP-46805N|uniprotkb:P70388

i.e all columns of file-A + the matching LINE (both columns) of file-B (if either column contains a value same as that in column4 of file-A).
In the given output note that column2 of file-B had the same value as that of column4 file-A.

Hope I was able to explain my question better.
# 5  
Old 09-29-2015
Quote:
Originally Posted by Syeda Sumayya
Actually both the columns of file-B are interacting-protein-partners.
I want to match the column4 of file-A with both the columns of file-B, to see which of the protein of column4, file-A is also present in any of the column of file-B along with its corresponding protein partner.

Following is how I want the output to be like,

Code:
30.00  17  gi|49484107|ref|YP_041331.1|   DIP-46805N|uniprot:P70388  1e-06 DIP-44775N|refseq:NP_006210|uniprotkb:P42338    DIP-46805N|uniprotkb:P70388

i.e all columns of file-A + the matching LINE (both columns) of file-B (if either column contains a value same as that in column4 of file-A).
In the given output note that column2 of file-B had the same value as that of column4 file-A.

Hope I was able to explain my question better.
You description makes it sound like the text shown in red above in the output you say you want should appear in file-A (and it does appear as the 4th line in your sample) and the text shown in orange should appear in file-B (but it does not). There is no match in the 1st field nor in the 2nd field on any line in file-B for the 4th field on any line in file-A in your sample.

And, even if the text in orange did appear in file-B, there would still be no match... The 4th field in file-A:
Code:
DIP-46805N|uniprot:P70388

and the last field you have shown in your desired output:
Code:
DIP-46805N|uniprotkb:P70388

do NOT match.
# 6  
Old 09-30-2015
Yep, that's a mistake on my part. They sure don't match exactly.
The contents of file-A and B that I have given in my question initially was just an example (and not a very good one) of a very large data set (forgot to mention that).

I sure would try to be more elaborate and exact next time.

Anyhow, the code RudiC has suggested works fine, just the way I wanted.

Thanks anyway. Smilie

Last edited by Syeda Sumayya; 09-30-2015 at 01:20 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Need help in comparing multiple columns from two files.

Hi all, I have two files as below. I need to compare field 2 of file 1 against field 1 of file 2 and field 5 of file 1 against filed 2 of file 2. If both matches , then create a result file 1 with first file data and if not matches , then create file with first fie data. Please help me in... (12 Replies)
Discussion started by: sivarajb
12 Replies

2. Shell Programming and Scripting

Comparing two columns from two different files

Hi, I have a single-column file1 having records like: 00AB01/11 43TG22/00 78RC09/34 ...... ...... and a second file , file 2 having two columns like 78RC09/34 1 45FD11/11 2 00AB01/11 3 43TG22/00 4 ...... ...... (8 Replies)
Discussion started by: amarn
8 Replies

3. Shell Programming and Scripting

AWK: Comparing two columns from two different files

Hi - I have two files as follows: File 1: chr5 118464905 118465027 ENST00000514151 utr5 0 + chr5 118464903 118465118 ENST00000504031 utr5 0 + chr5 118468826 118469180 ENST00000504031 utr5 0 + chr5 118469920 118470084 ... (14 Replies)
Discussion started by: polsum
14 Replies

4. Shell Programming and Scripting

comparing two columns from two different files

Hello, I have two files as 1.txt and 2.txt with number as columns. 1.txt 0 53.7988 1 -30.0859 2 20.1632 3 14.2135 4 14.6366 5 -37.6258 . . . 31608 -8.57333 31609 -2.58554 31610 -24.2857 2.txt (1 Reply)
Discussion started by: AKD
1 Replies

5. UNIX for Dummies Questions & Answers

Comparing columns in two files

Hi, I have two files. File1.txt has 2 columns and looks like: 458739 122345 4456 122657 34200 122600 File2.txt has many columns with column 1 the same as column2 of File1.txt, but with lot more rows: 122786 abcdefg user1@email 122778 uuhjeufh user2@email... (1 Reply)
Discussion started by: ursaan
1 Replies

6. UNIX for Dummies Questions & Answers

Comparing 2 columns from 2 files

Hi, I have two files with the same number of columns. Basically I want to print the 2 columns that match between the two files. File1 looks like this: dr12 12 6 abn dr14 12 7 abn File2 looks something like this: dr12 12 8 abn dr12 14 7 abn So basically if the first... (1 Reply)
Discussion started by: kylle345
1 Replies

7. Shell Programming and Scripting

comparing 2 columns from 2 files

Hey, I have 2 files that have a name and then a number: File 1: dog 21 dog 24 cat 33 cat 27 dog 76 cat 65 File 2: dog 109 dog 248 cat 323 cat 207 cat 66 (2 Replies)
Discussion started by: dcfargo
2 Replies

8. Shell Programming and Scripting

comparing the columns in two files

I have two files file1 and file 2 both are having multiple coloumns.i want to select only two columns. i used following code to get the desired columns,with ',' as delimiter cut -d ',' -f 1,2 file1 | sort > file1.new cut -d ',' -f 1,2 file2 | sort > file2.new I want to get the coloums... (1 Reply)
Discussion started by: bab123
1 Replies

9. Shell Programming and Scripting

Comparing Columns of two FIles

Dear all, I have two files in UNIX File1 and File2 as below: File1: 1,1234,.,67.897,,0 1,4134,.,87.97,,4 0,1564,.,97.8,,1 File2: 2,8798,.,67.897,,0 2,8879,.,77.97,,4 0,1564,.,97.8,,1 I want to do the following: (1) Make sure that both the files have equal number of columns and if... (4 Replies)
Discussion started by: ggopal
4 Replies

10. UNIX for Advanced & Expert Users

Comparing Columns of two FIles

Dear all, I have two files in UNIX File1 and File2 as below: File1: 1,1234,.,67.897,,0 1,4134,.,87.97,,4 0,1564,.,97.8,,1 File2: 2,8798,.,67.897,,0 2,8879,.,77.97,,4 0,1564,.,97.8,,1 I want to do the following: (1) Make sure that both the files have equal number of columns and if... (1 Reply)
Discussion started by: ggopal
1 Replies
Login or Register to Ask a Question