Find common numbers from two very large files using awk or the like

04-26-2013

Registered User

858, 184

Join Date: Mar 2013

Last Activity: 12 May 2013, 11:33 PM EDT

Posts: 858

Thanks Given: 18

Thanked 184 Times in 179 Posts

Code:

$ cat file1
111111111111111
123456000017214
123456000017255
123456000018300
123456000100123
123456000100253
223456000001212
223456000013212

Code:

$ cat file2
123456000017214
123456000017255
123456000018300
123456000100123
123456000100253
223456000001212
223456000013212
999999999999999

Code:

$ cat scottie.sh
sed "s/.*/1 &/" file1 > file1.lbl
sed "s/.*/2 &/" file2 > file2.lbl
cat file1.lbl file2.lbl | sort -n -k 2 > all.lbl
uniq -d -f 1 all.lbl | cut -f 2 -d " "
rm file1.lbl file2.lbl all.lbl

Code:

$ ./scottie.sh
123456000017214
123456000017255
123456000018300
123456000100123
123456000100253
223456000001212
223456000013212

hanson44

View Public Profile for hanson44

Find all posts by hanson44

04-26-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by hanson44

Code:

cat file1.lbl file2.lbl | sort -n -k 2 > all.lbl
uniq -d -f 1 all.lbl | cut -f 2 -d " "

There is no point in specifying a numeric sort because uniq only understands lexicographic sorts. This approach will only work when a data set's numeric sort is identical to its lexicographic sort. This is true in this case because all numbers have the same number of digits and consist of nothing but digits (no signs, no radix point).

Regards,
Alister

---------- Post updated at 07:10 PM ---------- Previous update was at 07:09 PM ----------

Quote:

Originally Posted by Scottie1954

I've got two files that each contain a 16-digit number in positions 1-16. The first file has 63,120 entries all sorted numerically. The second file has 142,479 entries, also sorted numerically.

I want to read through each file and output the entries that appear in both. So far I've had no success with comm -12

Is there something in positions beyond 16? No trailing whitespace in either file? Because, since the lexicographic sort of the data sample in post #4 is identical to its numeric sort, comm -12 should work well.

Regards,
Alister

Last edited by alister; 04-26-2013 at 08:21 PM..

alister

View Public Profile for alister

Find all posts by alister

04-26-2013

Registered User

858, 184

Join Date: Mar 2013

Last Activity: 12 May 2013, 11:33 PM EDT

Posts: 858

Thanks Given: 18

Thanked 184 Times in 179 Posts

Quote:

There is no point in specifying a numeric sort

You're right. In this case, the numeric sort has no effect for good or ill, but is superfluous, should not be used:

Code:

$ cat scottie.sh
sed "s/.*/1 &/" file1 > file1.lbl
sed "s/.*/2 &/" file2 > file2.lbl
cat file1.lbl file2.lbl | sort -k 2 > all.lbl
uniq -d -f 1 all.lbl | cut -f 2 -d " "
rm file1.lbl file2.lbl all.lbl

Code:

$ ./scottie.sh
123456000017214
123456000017255
123456000018300
123456000100123
123456000100253
223456000001212
223456000013212

-------------------------------
Or, as suggested by alister:

Code:

$ comm -1 -2 file1 file2
123456000017214
123456000017255
123456000018300
123456000100123
123456000100253
223456000001212
223456000013212

But the OP said there was some problem with this.

hanson44

View Public Profile for hanson44

Find all posts by hanson44

04-26-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by hanson44

You're right. In this case, the numeric sort has no effect for good or ill, but is superfluous, should not be used

To make sure I made my point, please allow me to reiterate: In every case, it is a mistake to feed a numerically sorted file to a tool which only understands lexicographic sorting. In some cases, such as this one, it may not hurt, but it is never the right thing to do.

Tools which require lexicographic sorting include comm, join, and uniq.

join requires special attention because by default it requires sort -b, but if join's -t option is used, sort's -b must not be.

Quote:

Originally Posted by hanson44

Code:

$ comm -1 -2 file1 file2

But the OP said there was some problem with this. Smilie

And that piqued my curiosity, because it should work if the actual data does not deviate from the form of the sample data provided in post #4.

Regards,
Alister

Last edited by alister; 04-26-2013 at 10:17 PM..

alister

View Public Profile for alister

Find all posts by alister

04-26-2013

Registered User

858, 184

Join Date: Mar 2013

Last Activity: 12 May 2013, 11:33 PM EDT

Posts: 858

Thanks Given: 18

Thanked 184 Times in 179 Posts

Quote:

To make sure I made my point

Yes, you made your point and I understood it perfectly previously. I just put in -n flag by habit. In this case, there was no difference, but -n is superfluous. In other cases, there could be a difference, depending on the situation. Unlike sort, uniq never tries to equate "08" with "8", just looks for identical adjacent matching lines. I appreciate your trying to ensure that I really got the point, because it is important.

hanson44

View Public Profile for hanson44

Find all posts by hanson44

04-27-2013

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Code:

awk '
NR==FNR {a[$1]=1; next}
($1 in a)
' shortfile longfile

A $1+0 cast is not needed because all numbers have equal length.
HP-UX awk is very similar to nawk.

Last edited by MadeInGermany; 04-28-2013 at 01:46 PM..

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

05-02-2013

Registered User

43, 0

Join Date: Jul 2011

Last Activity: 13 May 2014, 6:09 PM EDT

Posts: 43

Thanks Given: 10

Thanked 0 Times in 0 Posts

Thanks to all who replied. The reason sdiff wouldn't work is because the bigger file had many more 16-digit entries in between the matches in the smaller file, so a line-to-line comparison between them wasn't successful -- common entries were on very different line numbers in each file.

Instead of shell scripting I found a solution using a database reporting tool called Visimage, which read in both files as flat databases and then found the matches between them. Just for my own knowledge, I'm going to try the awk solution posted above in #13. Thanks, all.

---------- Post updated 05-02-13 at 11:36 AM ---------- Previous update was 05-01-13 at 12:00 PM ----------

Thanks to MIG for the awk code below. It worked.

Quote:

Originally Posted by MadeInGermany

Code:

awk '
NR==FNR {a[$1]=1; next}
($1 in a)
' shortfile longfile

A $1+0 cast is not needed because all numbers have equal length.
HP-UX awk is very similar to nawk.

Scottie1954

View Public Profile for Scottie1954

Find all posts by Scottie1954

UNIX for Dummies Questions & Answers

Find common numbers from two very large files using awk or the like

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find common files between two directories

Discussion started by: sidnow

2. Shell Programming and Scripting

Find Common Values Across Two Files

Discussion started by: angshuman

3. Shell Programming and Scripting

Find common numbers and print yes or no

Discussion started by: manigrover

4. Shell Programming and Scripting

finding common numbers (contents) across 2 or 3 files

Discussion started by: Lucky Ali

5. UNIX for Advanced & Expert Users

Find common Strings in two large files

Discussion started by: kanthrajgowda

6. Shell Programming and Scripting

Drop common lines at head/tail of a large set of files

Discussion started by: dobryden

7. UNIX for Dummies Questions & Answers

Grep alternative to handle large numbers of files

Discussion started by: runnerpaul

8. Shell Programming and Scripting

To find all common lines from 'n' no. of files

Discussion started by: The Observer

9. UNIX for Dummies Questions & Answers

Get un common numbers from two files

Discussion started by: jingi1234