Find common numbers from two very large files using awk or the like


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Find common numbers from two very large files using awk or the like
# 1  
Old 04-26-2013
Find common numbers from two very large files using awk or the like

I've got two files that each contain a 16-digit number in positions 1-16. The first file has 63,120 entries all sorted numerically. The second file has 142,479 entries, also sorted numerically.

I want to read through each file and output the entries that appear in both. So far I've had no success with comm -12, nor with grep -f. I've had some success wtih sdiff, but it's not entirely accurate as it's missing some matches.

What I need is a script that loops through one file to see if an entry corresponds to the other file, but this is beyond my skills.

I am using sh on hp-ux 11.31, so I can't use nawk or gawk, etc.

Thank you for your assistance.

Last edited by Scottie1954; 04-26-2013 at 05:52 PM..
# 2  
Old 04-26-2013
Using code tags, could you send a few actual lines from each file? Is the format of the lines consistent through the file, or does it vary?
# 3  
Old 04-26-2013
How long are the lines in these input files?
# 4  
Old 04-26-2013
These are 16 digit numbers, sorted numerically:


Code:
123456000017214
123456000017255
123456000018300
123456000100123
123456000100253
223456000001212
223456000013212

# 5  
Old 04-26-2013
grepping with a large pattern file is NOT very effective, but you could at least try the logics:
Code:
$ cut -c1-16 file1 > patternfile
$ grep -f patternfile file2


BTW - why should sdiff miss some matches? This is difficult to believe!

Last edited by RudiC; 04-26-2013 at 06:46 PM..
# 6  
Old 04-26-2013
Quote:
Originally Posted by Scottie1954
I am using sh on hp-ux 11.31, so I can't use nawk or gawk, etc.
How about awk? I do have it in HP-UX B.11.23
Code:
$ what /usr/bin/awk
/usr/bin/awk:
        $Revision: 92453-07 linker linker crt0.o B.11.16.01 030415 $
         main.c $Date: 2008/05/19 14:40:42 $Revision: r11.23/3 PATCH_11.23 (PHCO_38267)
         lib.c $Date: 2007/02/23 16:15:06 $Revision: r11.23/2 PATCH_11.23 (PHCO_36053)
         run.c $Date: 2008/05/19 14:40:53 $Revision: r11.23/1 PATCH_11.23 (PHCO_38267)
         $Revision: @(#) awk R11.23_BL2008_0602_1 PATCH_11.23 PHCO_38267

# 7  
Old 04-26-2013
Yes, I have awk. GNU utilities like nawk or gawk aren't installed on my OS. Thank you.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find common files between two directories

I have two directories Dir 1 /home/sid/release1 Dir 2 /home/sid/release2 I want to find the common files between the two directories Dir 1 files /home/sid/release1>ls -lrt total 16 -rw-r--r-- 1 sid cool 0 Jun 19 12:53 File123 -rw-r--r-- 1 sid cool 0 Jun 19 12:53... (5 Replies)
Discussion started by: sidnow
5 Replies

2. Shell Programming and Scripting

Find Common Values Across Two Files

Hi All, I have two files like below: File1 MYFILE_28012012_1112.txt|4 MYFILE_28012012_1113.txt|51 MYFILE_28012012_1114.txt|57 MYFILE_28012012_1115.txt|57 MYFILE_28012012_1116.txt|57 MYFILE_28012012_1117.txt|57 File2 MYFILE_28012012_1110.txt|57 MYFILE_28012012_1111.txt|57... (2 Replies)
Discussion started by: angshuman
2 Replies

3. Shell Programming and Scripting

Find common numbers and print yes or no

Hi I have 2 files with following data First file, sp|Q676U5|A16L1_HUMAN, Autophagy-related protein 16-1 OS=Homo sapiens GN=ATG16L1 PE=1 SV=2, Maximum coiled-coil residue probability: 0.657 in position 163. Maximum dimeric residue probability: 0.288 in position 163. ... (1 Reply)
Discussion started by: manigrover
1 Replies

4. Shell Programming and Scripting

finding common numbers (contents) across 2 or 3 files

I have 3 files which are tab delimited and have numbers in it. file 1 1 2 3 4 5 6 7 File 2 3 5 7 8 File 3 1 (4 Replies)
Discussion started by: Lucky Ali
4 Replies

5. UNIX for Advanced & Expert Users

Find common Strings in two large files

Hi , I have a text file in the format DB2: DB2: WB: WB: WB: WB: and a second text file of the format Time=00:00:00.473 Time=00:00:00.436 Time=00:00:00.016 Time=00:00:00.027 Time=00:00:00.471 Time=00:00:00.436 the last string in both the text files is of the... (4 Replies)
Discussion started by: kanthrajgowda
4 Replies

6. Shell Programming and Scripting

Drop common lines at head/tail of a large set of files

Hi! I have a large set of pairs of text files (each pair in their own subdirectory) and each pair shares head/tail (a couple of first and last lines) but differs in the middle part. I need to delete the heads/tails and keep only the middle portions in which they differ. The lengths of heads/tails... (1 Reply)
Discussion started by: dobryden
1 Replies

7. UNIX for Dummies Questions & Answers

Grep alternative to handle large numbers of files

I am looking for a file with 'MCR0000000716214' in it. I tried the following command: grep MCR0000000716214 * The problem is that the folder I am searching in has over 87000 files and I am getting the following: bash: /bin/grep: Arg list too long Is there any command I can use that can... (6 Replies)
Discussion started by: runnerpaul
6 Replies

8. Shell Programming and Scripting

To find all common lines from 'n' no. of files

Hi, I have one situation. I have some 6-7 no. of files in one directory & I have to extract all the lines which exist in all these files. means I need to extract all common lines from all these files & put them in a separate file. Please help. I know it could be done with the help of... (11 Replies)
Discussion started by: The Observer
11 Replies

9. UNIX for Dummies Questions & Answers

Get un common numbers from two files

Hi, I have two files: abc : 50040 123123 31703 cde: 104 97 50040 123123 31703 36609 50534 (3 Replies)
Discussion started by: jingi1234
3 Replies
Login or Register to Ask a Question