Comparing two CSV files

06-01-2016

Registered User

11, 0

Join Date: Jun 2016

Last Activity: 10 June 2016, 1:39 PM EDT

Posts: 11

Thanks Given: 3

Thanked 0 Times in 0 Posts

file 1:

Code:

ID,Zip,Address,Parent,Country
9874125,43232,"493 Marietta St",21152,'United States'
4845622,85489,"434 Beach St",21542,'United States'
9874126,43234,"368 John's Creek Way",21122,'United States'
9874122,43233,"345 Cherry Place",21152,'United States'

file2

Code:

Zip, Parent
43232,21152
43234,21122

desired output

Code:

ID,Zip,Address,Parent,Country
9874125,43232,"493 Marietta St",21152,'United States'
9874126,43234,"368 John's Creek Way",21122,'United States'

This was just a sample of the data, there's thousands of more lines.

when i run:

Code:

awk 'FNR==NR {zip[$1]; next} $2 in zip' file2 file1

i don't get anything in the output...

Also these files are separated by commas because they are CSV files.

---------- Post updated at 05:41 PM ---------- Previous update was at 05:25 PM ----------

Never mind, figured it out! Just needed to add a -F ', ' delimiter! thanks. Final command:

Code:

awk -F ',' 'FNR==NR {zip[$1]; next} $2 in zip' file2 file1

Last edited by dan139; 06-01-2016 at 08:08 PM..

dan139

View Public Profile for dan139

Find all posts by dan139

06-01-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

If you're trying to match on both of the fields that are present in file2, you might want to try something more like:

Code:

awk -F ',' 'FNR==NR {zip_parent[$1,$2]; next} ($2,$4) in zip_parent' file2 file1

And, note that with a <comma> character as your field separator, you MUST use -F ',' with no space between the <comma> character and the closing single-quote character. (With -F ', ', you are specifying a field separator that is a <comma> character followed by a <space> character.)

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

06-02-2016

Registered User

11, 0

Join Date: Jun 2016

Last Activity: 10 June 2016, 1:39 PM EDT

Posts: 11

Thanks Given: 3

Thanked 0 Times in 0 Posts

Hi again -

So after reexamining the output using the following command:

Code:

awk -F ',' 'FNR==NR {zip[$1]; next} $2 in zip' file2 file1

the output has less rows than file2 which is impossible. Every row in file2 exists in file1 but with added data. For example,

file1 has 1000 rows with zipcodes to search through.
file2 has 500 zipcodes we are looking for.

the output of this command:
file3 yields only 350 zipcodes when it should yield 500 zipcodes. I know for a fact every zipcode in file2 exists in file1.

Anyone know what the problem could be?

dan139

View Public Profile for dan139

Find all posts by dan139

06-02-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Perhaps, you can run the opposite to diagnose the problem:

Code:

awk -F ',' 'FNR==NR {zip[$1]; next} !($2 in zip)' file2 file1

That will show the lines that did not make it, from file1. After that, you can analyze what's not according to what you think.

---------- Post updated at 04:15 PM ---------- Previous update was at 04:01 PM ----------

After re-reading your post again, I think a better test would be:

Code:

 awk -F"," 'FNR==NR {zip[$2]; next} !($1 in zip)' file1 file2

That will show the zips found in file2 that it does not have a match in file1, meaning: those are the lines it will not produce a result when you run the real program.

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

06-02-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Please also show us what output you get from the following awk script:

Code:

awk -F, '
NR == 1 {
	next
}
!($1 in z) {
	c++
}
{	z[$1]++
}
END {	printf("%d data lines read\n", NR - 1)
	printf("%d unique zip codes read\n", c)
	for(i in z)
		if(z[i] > 1)
			printf("zip:%s appears %d times\n", i, z[i])
}' file2

As has been said many times before, if you are running this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

With what you have told us so far, I would expect the 2nd line of output to be about 350 and I would expect several lines following that listing zip codes that appear on more than one line in file2 (i.e., some zip codes have more than one parent).

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Comparing two CSV files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk assistance - Comparing 2 csv files

Discussion started by: tirmUK

2. Shell Programming and Scripting

Comparing Select Columns from two CSV files in UNIX and create a third file based on comparision

Discussion started by: ady_koolz

3. Shell Programming and Scripting

Comparing two large unsorted csv files

Discussion started by: vasavi

4. Shell Programming and Scripting

Comparing 2 CSV files and sending the difference to a new csv file

Discussion started by: Naresh101

5. Shell Programming and Scripting

Comparing 2 difference csv files

Discussion started by: bobby1015

6. Shell Programming and Scripting

removing duplicate records comparing 2 csv files

Discussion started by: rajak.net

7. Shell Programming and Scripting

comparing csv files

Discussion started by: sukhdip

8. Shell Programming and Scripting

Comparing Strings in 2 .csv/txt files?

Discussion started by: chickeneaterguy

9. Shell Programming and Scripting

Comparing 2 csv files and matching content

Discussion started by: ghl10000

10. Shell Programming and Scripting

Last field problem while comparing two csv files

Discussion started by: ganapati