Compare 2 csv files by columns, then extract certain columns of matcing rows


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Compare 2 csv files by columns, then extract certain columns of matcing rows
# 1  
Old 06-01-2014
Compare 2 csv files by columns, then extract certain columns of matcing rows

Hi all, I'm pretty much a newbie to UNIX. I would appreciate any help with UNIX coding on comparing two large csv files (greater than 10 GB in size), and output a file with matching columns.

I want to compare file1 and file2 by 'id' and 'chain' columns, then extract exact matching rows' remaining columns from file2 and add them to file1's columns, and remove no match rows. Also create new entries in file1 for multiple row matches from file2

For example:

Code:
$ head file1
id,chain,offer,market,repeattrips,repeater,offerdate
86246,205,1208251,34,5,t,2013-04-24
86252,205,1197502,34,16,t,2013-03-27
12682470,18,1197502,11,0,f,2013-03-28
12996040,15,1197502,9,0,f,2013-03-25
13089312,15,1204821,9,0,f,2013-04-01


Code:
$ head file2
id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount
86246,205,7,707,1078778070,12564,2012-03-02,12,OZ,1,7.59
86246,205,63,6319,107654575,17876,2012-03-02,64,OZ,1,1.59
86246,205,97,9753,1022027929,0,2012-03-02,1,CT,1,5.99
86976,205,25,2509,107996777,31373,2012-03-02,16,OZ,1,1.99
97646,206,55,5555,107684070,32094,2012-03-02,16,OZ,2,10.38



and the desired output would be:
Code:
id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount,offer,market,repeattrips,repeater,offerdate
86246,205,7,707,1078778070,12564,2012-03-02,12,OZ,1,7.59,1208251,34,5,t,2013-04-24
86246,205,63,6319,107654575,17876,2012-03-02,64,OZ,1,1.59,1208251,34,5,t,2013-04-24
86246,205,97,9753,1022027929,0,2012-03-02,1,CT,1,5.99,1208251,34,5,t,2013-04-24

If you leave a code please explain them a little bit

Thanks

Last edited by Scrutinizer; 06-01-2014 at 02:07 PM.. Reason: code tags
# 2  
Old 06-01-2014
Welcome to forums.
Code:
$ cat file1
id,chain,offer,market,repeattrips,repeater,offerdate
86246,205,1208251,34,5,t,2013-04-24
86252,205,1197502,34,16,t,2013-03-27
12682470,18,1197502,11,0,f,2013-03-28
12996040,15,1197502,9,0,f,2013-03-25
13089312,15,1204821,9,0,f,2013-04-01

Code:
$ cat file2
id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount
86246,205,7,707,1078778070,12564,2012-03-02,12,OZ,1,7.59
86246,205,63,6319,107654575,17876,2012-03-02,64,OZ,1,1.59
86246,205,97,9753,1022027929,0,2012-03-02,1,CT,1,5.99
86976,205,25,2509,107996777,31373,2012-03-02,16,OZ,1,1.99
97646,206,55,5555,107684070,32094,2012-03-02,16,OZ,2,10.38

Code:
$ awk -F, 'FNR==NR{s= $1 FS $2; $1=$2="\b"; A[s]=$0;next}(($1 FS $2) in A){print $0,A[$1 FS $2]}' OFS=',' file1 file2

Resulting
Code:
id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount,offer,market,repeattrips,repeater,offerdate
86246,205,7,707,1078778070,12564,2012-03-02,12,OZ,1,7.59,1208251,34,5,t,2013-04-24
86246,205,63,6319,107654575,17876,2012-03-02,64,OZ,1,1.59,1208251,34,5,t,2013-04-24
86246,205,97,9753,1022027929,0,2012-03-02,1,CT,1,5.99,1208251,34,5,t,2013-04-24

This User Gave Thanks to Akshay Hegde For This Post:
# 3  
Old 06-03-2014
Thank you Akshay. That worked.

I have another similar matching problem. I want to compare the previous outputted result file (I will call it file3) with file4 (below) by 'offer','category','brand', 'company'. Like before, I want to extract exact matching rows' remaining columns ('quantity', 'offervalue') from file4 and add them to file3 columns, and remove no match rows from file3.

Code:
$ head file4
offer,category,quantity,company,offervalue,brand
1190530,9115,1,108500080,5,93904
1194044,9909,1,107127979,1,6732
1197502,3203,1,106414464,0.75,13474
1198271,5558,1,107120272,1.5,5072
1198272,5558,1,107120272,1.5,5072
1198273,5558,1,107120272,1.5,5072
1198274,5558,1,107120272,1.5,5072
1198275,5558,1,107120272,1.5,5072
1199256,4401,1,105100050,2,13791

Thanks
# 4  
Old 06-03-2014
There is no match except header in current example try this

Code:
awk -F, '
	FNR==NR{
		# Read file3

		# Loop through fields...
		  for(i=1;i<=NF;i++)
		   {
			if(i < 4 || ( i > 6 && i < 12 ) || i>12)
			{
				# Array A with index key offer,category,brand,company
                                key = $12 FS $4 FS $6 FS $5

				# All column except 12,4,6,5 are array element with OFS being comma
				A[key] = A[key] ? A[key] OFS $i : $i	
			}
		   }
			next
	        }
	
	# read file4, if exact match print
	(($1 FS $2 FS $6 FS $4) in A) \
		{ 
			print $0, A[$1 FS $2 FS $6 FS $4] 
		}
      ' OFS="," file3 file4


Last edited by Akshay Hegde; 06-03-2014 at 05:53 AM..
This User Gave Thanks to Akshay Hegde For This Post:
# 5  
Old 06-04-2014
Thanks this works with the following examples files. However, it's been 4 hours since I've started running on the actual files I have, and it still has not finished. The actual file3 I have is about 16GB, and the actual file4 is only 2KB. Is there ways to make the program faster? maybe a way to program it without the loop?

Code:
 cat file3.csv
id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount,offer,market,repeattrips,repeater,offerdate
86246,205,7,707,1078778070,12564,2012-03-02,12,OZ,1,7.59,1208251,34,5,t,2013-04-24
86246,205,63,6319,107654575,17876,2012-03-02,64,OZ,1,1.59,1208251,34,5,t,2013-04-24
86246,205,97,9753,1022027929,0,2012-03-02,1,CT,1,5.99,1208251,34,5,t,2013-04-24
86246,205,25,2509,107996777,31373,2012-03-02,16,OZ,1,1.99,1208251,34,5,t,2013-04-24
86246,205,55,5558,107684070,32094,2012-03-02,16,OZ,2,10.38,1208251,34,5,t,2013-04-24
86246,205,97,9753,1021015020,0,2012-03-02,1,CT,1,7.8,1208251,34,5,t,2013-04-24
86246,205,99,9909,107127979,6732,2012-03-02,16,OZ,1,2.49,1194044,34,5,t,2013-04-24
86246,205,59,5907,102900020,2012,2012-03-02,16,OZ,1,1.39,1208251,34,5,t,2013-04-24
86246,205,9,9909,107127979,9209,2012-03-02,4,OZ,2,1.5,1194044,34,5,t,2013-04-24

Code:
$ cat file4.csv
offer,category,quantity,company,offervalue,brand
1190530,9115,1,108500080,5,93904
1194044,9909,1,107127979,1,6732
1197502,3203,1,106414464,0.75,13474
1198271,5558,1,107120272,1.5,5072
1198272,5558,1,107120272,1.5,5072
1198273,5558,1,107120272,1.5,5072
1198274,5558,1,107120272,1.5,5072
1198275,5558,1,107120272,1.5,5072
1199256,4401,1,105100050,2,13791

and the result:
Code:
offer,category,quantity,company,offervalue,brand,id,chain,dept,date,productsize,productmeasure,purchasequantity,purchaseamount,market,repeattrips,repeater,offerdate
1194044,9909,1,107127979,1,6732,86246,205,99,2012-03-02,16,OZ,1,2.49,34,5,t,2013-04-24

# 6  
Old 06-04-2014
Okay this will be faster, try I wasn't knowing about your file size

Code:
$ awk -F, 'FNR==NR{A[$1 FS $2 FS  $6 FS $4] = $3 OFS $5;next}(($12 FS $4 FS $6 FS $5) in A){print $0,A[$12 FS $4 FS $6 FS $5]}' OFS="," file4 file3

id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount,offer,market,repeattrips,repeater,offerdate,quantity,offervalue
86246,205,99,9909,107127979,6732,2012-03-02,16,OZ,1,2.49,1194044,34,5,t,2013-04-24,1,1

---------- Post updated at 02:59 PM ---------- Previous update was at 02:53 PM ----------

adjust {print $0, A[...]}, according to your need,

where


$0 -> full line

$1 -> column1

$2 -> column2, and so on
This User Gave Thanks to Akshay Hegde For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extracting data from specific rows and columns from multiple csv files

I have a series of csv files in the following format eg file1 Experiment Name,XYZ_07/28/15, Specimen Name,Specimen_001, Tube Name, Control, Record Date,7/28/2015 14:50, $OP,XYZYZ, GUID,abc, Population,#Events,%Parent All Events,10500, P1,10071,95.9 Early Apoptosis,1113,11.1 Late... (6 Replies)
Discussion started by: pawannoel
6 Replies

2. Shell Programming and Scripting

Extract rows with different values at 2 columns

Hallo, I would need to extract only rows which has different value in the second and third column. Thank you very much for any advices Input: A 0 0 B 0 1 C 1 1 D 1 3 Output B 0 1 D 1 3 (4 Replies)
Discussion started by: kamcamonty
4 Replies

3. Shell Programming and Scripting

Deleting all the fields(columns) from a .csv file if all rows in that columns are blanks

Hi Friends, I have come across some files where some of the columns don not have data. Key, Data1,Data2,Data3,Data4,Data5 A,5,6,,10,, A,3,4,,3,, B,1,,4,5,, B,2,,3,4,, If we see the above data on Data5 column do not have any row got filled. So remove only that column(Here Data5) and... (4 Replies)
Discussion started by: ks_reddy
4 Replies

4. Shell Programming and Scripting

Converting rows to columns in csv file

Hi, I have a requirement to convert rows into columns. data looks like: c1,c2,c3,.. r1,r2,r3,.. p1,p2,p3,.. and so on.. output shud be like this: c1,r1,p1,.. c2,r2,p2,.. c3,r3,p3,.. Thanks in advance, (12 Replies)
Discussion started by: Divya1987
12 Replies

5. Shell Programming and Scripting

Extract several columns with few rows

Hello, I want to extract several columns and rows from a huge tab delimited file for example: I want to print from from column 3 to 68 till row number 30. I have tried using cut command but it was extracting whole 3rd and 68th column. Please suggest a solution. Ryan (8 Replies)
Discussion started by: ryan9011
8 Replies

6. Shell Programming and Scripting

How to change value in CSV columns and compare two files where Column1 is identical

Hi all, Could someone help me with the following issue: 1st I have an CSV file delimiter is ";" I I have a column 7 where I need to do some multiple mathem. operation, I need all values in this columns to be multiplied by 1.5 and create a new CSV file with the replaced values. 2nd. I... (3 Replies)
Discussion started by: kl1ngac1k
3 Replies

7. Shell Programming and Scripting

Extract values from a matrix given the rows and columns

Hi All, I have a huge (and its really huge!) matrix about 400GB in size (2 million rows by 1.5 million columns) . I am trying to optimize its space by creating a sparse representation of it. Miniature version of the matrix looks like this (matrix.mtx): 3.4543 65.7876 54.564 2.12344... (4 Replies)
Discussion started by: shoaibjameel123
4 Replies

8. Shell Programming and Scripting

How to compare the columns in two .csv files?

Hi I have to compare two .csv files which having 4 columns and i am expecting the output if there is difference in the 3,4columns in two files with respect to the first column. if my statement is not clear please refer the example. Input: ----- File 1 : hostname MAC SWITCH_IP SWITCH_PORT... (7 Replies)
Discussion started by: Kanchana
7 Replies

9. Shell Programming and Scripting

Extract difference of two columns from different rows

Hello guys, Please help me to solve this problem. I have tried some awk commands but couldn't succeed. I have a tab delimited file where each record is separated by ------ and 4th column of each record is same. <INPUT FILE> ------ peon 53931587 53931821 ... (12 Replies)
Discussion started by: sam_2921
12 Replies

10. Shell Programming and Scripting

deleting rows & columns form a csv file

Hi , I want to delete some rows & columns from file. can someone please help me on this? Regards. (2 Replies)
Discussion started by: code19
2 Replies
Login or Register to Ask a Question