Find common numbers from two very large files using awk or the like


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Find common numbers from two very large files using awk or the like
# 8  
Old 04-26-2013
Code:
$ cat file1
111111111111111
123456000017214
123456000017255
123456000018300
123456000100123
123456000100253
223456000001212
223456000013212

Code:
$ cat file2
123456000017214
123456000017255
123456000018300
123456000100123
123456000100253
223456000001212
223456000013212
999999999999999

Code:
$ cat scottie.sh
sed "s/.*/1 &/" file1 > file1.lbl
sed "s/.*/2 &/" file2 > file2.lbl
cat file1.lbl file2.lbl | sort -n -k 2 > all.lbl
uniq -d -f 1 all.lbl | cut -f 2 -d " "
rm file1.lbl file2.lbl all.lbl

Code:
$ ./scottie.sh
123456000017214
123456000017255
123456000018300
123456000100123
123456000100253
223456000001212
223456000013212

# 9  
Old 04-26-2013
Quote:
Originally Posted by hanson44
Code:
cat file1.lbl file2.lbl | sort -n -k 2 > all.lbl
uniq -d -f 1 all.lbl | cut -f 2 -d " "

There is no point in specifying a numeric sort because uniq only understands lexicographic sorts. This approach will only work when a data set's numeric sort is identical to its lexicographic sort. This is true in this case because all numbers have the same number of digits and consist of nothing but digits (no signs, no radix point).

Regards,
Alister

---------- Post updated at 07:10 PM ---------- Previous update was at 07:09 PM ----------

Quote:
Originally Posted by Scottie1954
I've got two files that each contain a 16-digit number in positions 1-16. The first file has 63,120 entries all sorted numerically. The second file has 142,479 entries, also sorted numerically.

I want to read through each file and output the entries that appear in both. So far I've had no success with comm -12
Is there something in positions beyond 16? No trailing whitespace in either file? Because, since the lexicographic sort of the data sample in post #4 is identical to its numeric sort, comm -12 should work well.

Regards,
Alister

Last edited by alister; 04-26-2013 at 08:21 PM..
# 10  
Old 04-26-2013
Quote:
There is no point in specifying a numeric sort
You're right. In this case, the numeric sort has no effect for good or ill, but is superfluous, should not be used:
Code:
$ cat scottie.sh
sed "s/.*/1 &/" file1 > file1.lbl
sed "s/.*/2 &/" file2 > file2.lbl
cat file1.lbl file2.lbl | sort -k 2 > all.lbl
uniq -d -f 1 all.lbl | cut -f 2 -d " "
rm file1.lbl file2.lbl all.lbl

Code:
$ ./scottie.sh
123456000017214
123456000017255
123456000018300
123456000100123
123456000100253
223456000001212
223456000013212

-------------------------------
Or, as suggested by alister:
Code:
$ comm -1 -2 file1 file2
123456000017214
123456000017255
123456000018300
123456000100123
123456000100253
223456000001212
223456000013212

But the OP said there was some problem with this. Smilie
# 11  
Old 04-26-2013
Quote:
Originally Posted by hanson44
You're right. In this case, the numeric sort has no effect for good or ill, but is superfluous, should not be used
To make sure I made my point, please allow me to reiterate: In every case, it is a mistake to feed a numerically sorted file to a tool which only understands lexicographic sorting. In some cases, such as this one, it may not hurt, but it is never the right thing to do.

Tools which require lexicographic sorting include comm, join, and uniq.

join requires special attention because by default it requires sort -b, but if join's -t option is used, sort's -b must not be.

Quote:
Originally Posted by hanson44
Code:
$ comm -1 -2 file1 file2

But the OP said there was some problem with this. Smilie
And that piqued my curiosity, because it should work if the actual data does not deviate from the form of the sample data provided in post #4.

Regards,
Alister

Last edited by alister; 04-26-2013 at 10:17 PM..
# 12  
Old 04-26-2013
Quote:
To make sure I made my point
Yes, you made your point and I understood it perfectly previously. I just put in -n flag by habit. In this case, there was no difference, but -n is superfluous. In other cases, there could be a difference, depending on the situation. Unlike sort, uniq never tries to equate "08" with "8", just looks for identical adjacent matching lines. I appreciate your trying to ensure that I really got the point, because it is important.
# 13  
Old 04-27-2013
Code:
awk '
NR==FNR {a[$1]=1; next}
($1 in a)
' shortfile longfile

A $1+0 cast is not needed because all numbers have equal length.
HP-UX awk is very similar to nawk.

Last edited by MadeInGermany; 04-28-2013 at 01:46 PM..
This User Gave Thanks to MadeInGermany For This Post:
# 14  
Old 05-02-2013
Thanks to all who replied. The reason sdiff wouldn't work is because the bigger file had many more 16-digit entries in between the matches in the smaller file, so a line-to-line comparison between them wasn't successful -- common entries were on very different line numbers in each file.

Instead of shell scripting I found a solution using a database reporting tool called Visimage, which read in both files as flat databases and then found the matches between them. Just for my own knowledge, I'm going to try the awk solution posted above in #13. Thanks, all.

---------- Post updated 05-02-13 at 11:36 AM ---------- Previous update was 05-01-13 at 12:00 PM ----------

Thanks to MIG for the awk code below. It worked.

Quote:
Originally Posted by MadeInGermany
Code:
awk '
NR==FNR {a[$1]=1; next}
($1 in a)
' shortfile longfile

A $1+0 cast is not needed because all numbers have equal length.
HP-UX awk is very similar to nawk.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find common files between two directories

I have two directories Dir 1 /home/sid/release1 Dir 2 /home/sid/release2 I want to find the common files between the two directories Dir 1 files /home/sid/release1>ls -lrt total 16 -rw-r--r-- 1 sid cool 0 Jun 19 12:53 File123 -rw-r--r-- 1 sid cool 0 Jun 19 12:53... (5 Replies)
Discussion started by: sidnow
5 Replies

2. Shell Programming and Scripting

Find Common Values Across Two Files

Hi All, I have two files like below: File1 MYFILE_28012012_1112.txt|4 MYFILE_28012012_1113.txt|51 MYFILE_28012012_1114.txt|57 MYFILE_28012012_1115.txt|57 MYFILE_28012012_1116.txt|57 MYFILE_28012012_1117.txt|57 File2 MYFILE_28012012_1110.txt|57 MYFILE_28012012_1111.txt|57... (2 Replies)
Discussion started by: angshuman
2 Replies

3. Shell Programming and Scripting

Find common numbers and print yes or no

Hi I have 2 files with following data First file, sp|Q676U5|A16L1_HUMAN, Autophagy-related protein 16-1 OS=Homo sapiens GN=ATG16L1 PE=1 SV=2, Maximum coiled-coil residue probability: 0.657 in position 163. Maximum dimeric residue probability: 0.288 in position 163. ... (1 Reply)
Discussion started by: manigrover
1 Replies

4. Shell Programming and Scripting

finding common numbers (contents) across 2 or 3 files

I have 3 files which are tab delimited and have numbers in it. file 1 1 2 3 4 5 6 7 File 2 3 5 7 8 File 3 1 (4 Replies)
Discussion started by: Lucky Ali
4 Replies

5. UNIX for Advanced & Expert Users

Find common Strings in two large files

Hi , I have a text file in the format DB2: DB2: WB: WB: WB: WB: and a second text file of the format Time=00:00:00.473 Time=00:00:00.436 Time=00:00:00.016 Time=00:00:00.027 Time=00:00:00.471 Time=00:00:00.436 the last string in both the text files is of the... (4 Replies)
Discussion started by: kanthrajgowda
4 Replies

6. Shell Programming and Scripting

Drop common lines at head/tail of a large set of files

Hi! I have a large set of pairs of text files (each pair in their own subdirectory) and each pair shares head/tail (a couple of first and last lines) but differs in the middle part. I need to delete the heads/tails and keep only the middle portions in which they differ. The lengths of heads/tails... (1 Reply)
Discussion started by: dobryden
1 Replies

7. UNIX for Dummies Questions & Answers

Grep alternative to handle large numbers of files

I am looking for a file with 'MCR0000000716214' in it. I tried the following command: grep MCR0000000716214 * The problem is that the folder I am searching in has over 87000 files and I am getting the following: bash: /bin/grep: Arg list too long Is there any command I can use that can... (6 Replies)
Discussion started by: runnerpaul
6 Replies

8. Shell Programming and Scripting

To find all common lines from 'n' no. of files

Hi, I have one situation. I have some 6-7 no. of files in one directory & I have to extract all the lines which exist in all these files. means I need to extract all common lines from all these files & put them in a separate file. Please help. I know it could be done with the help of... (11 Replies)
Discussion started by: The Observer
11 Replies

9. UNIX for Dummies Questions & Answers

Get un common numbers from two files

Hi, I have two files: abc : 50040 123123 31703 cde: 104 97 50040 123123 31703 36609 50534 (3 Replies)
Discussion started by: jingi1234
3 Replies
Login or Register to Ask a Question