Find common numbers from two very large files using awk or the like


 
Thread Tools Search this Thread
# 1  
Find common numbers from two very large files using awk or the like

I've got two files that each contain a 16-digit number in positions 1-16. The first file has 63,120 entries all sorted numerically. The second file has 142,479 entries, also sorted numerically.

I want to read through each file and output the entries that appear in both. So far I've had no success with comm -12, nor with grep -f. I've had some success wtih sdiff, but it's not entirely accurate as it's missing some matches.

What I need is a script that loops through one file to see if an entry corresponds to the other file, but this is beyond my skills.

I am using sh on hp-ux 11.31, so I can't use nawk or gawk, etc.

Thank you for your assistance.

Last edited by Scottie1954; 04-26-2013 at 06:52 PM..
# 2  
Using code tags, could you send a few actual lines from each file? Is the format of the lines consistent through the file, or does it vary?
# 3  
How long are the lines in these input files?
# 4  
These are 16 digit numbers, sorted numerically:


Code:
123456000017214
123456000017255
123456000018300
123456000100123
123456000100253
223456000001212
223456000013212

# 5  
grepping with a large pattern file is NOT very effective, but you could at least try the logics:
Code:
$ cut -c1-16 file1 > patternfile
$ grep -f patternfile file2


BTW - why should sdiff miss some matches? This is difficult to believe!

Last edited by RudiC; 04-26-2013 at 07:46 PM..
# 6  
Quote:
Originally Posted by Scottie1954
I am using sh on hp-ux 11.31, so I can't use nawk or gawk, etc.
How about awk? I do have it in HP-UX B.11.23
Code:
$ what /usr/bin/awk
/usr/bin/awk:
        $Revision: 92453-07 linker linker crt0.o B.11.16.01 030415 $
         main.c $Date: 2008/05/19 14:40:42 $Revision: r11.23/3 PATCH_11.23 (PHCO_38267)
         lib.c $Date: 2007/02/23 16:15:06 $Revision: r11.23/2 PATCH_11.23 (PHCO_36053)
         run.c $Date: 2008/05/19 14:40:53 $Revision: r11.23/1 PATCH_11.23 (PHCO_38267)
         $Revision: @(#) awk R11.23_BL2008_0602_1 PATCH_11.23 PHCO_38267

# 7  
Yes, I have awk. GNU utilities like nawk or gawk aren't installed on my OS. Thank you.
 

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #207
Difficulty: Easy
Open Shortest Path First (OSPF) is a routing protocol for Internet Protocol (IP) networks which uses a link state routing (LSR) algorithm.
True or False?

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find common numbers and print yes or no

Hi I have 2 files with following data First file, sp|Q676U5|A16L1_HUMAN, Autophagy-related protein 16-1 OS=Homo sapiens GN=ATG16L1 PE=1 SV=2, Maximum coiled-coil residue probability: 0.657 in position 163. Maximum dimeric residue probability: 0.288 in position 163. ... (1 Reply)
Discussion started by: manigrover
1 Replies

2. Shell Programming and Scripting

finding common numbers (contents) across 2 or 3 files

I have 3 files which are tab delimited and have numbers in it. file 1 1 2 3 4 5 6 7 File 2 3 5 7 8 File 3 1 (4 Replies)
Discussion started by: Lucky Ali
4 Replies

3. UNIX for Advanced & Expert Users

Find common Strings in two large files

Hi , I have a text file in the format DB2: DB2: WB: WB: WB: WB: and a second text file of the format Time=00:00:00.473 Time=00:00:00.436 Time=00:00:00.016 Time=00:00:00.027 Time=00:00:00.471 Time=00:00:00.436 the last string in both the text files is of the... (4 Replies)
Discussion started by: kanthrajgowda
4 Replies

4. Shell Programming and Scripting

Drop common lines at head/tail of a large set of files

Hi! I have a large set of pairs of text files (each pair in their own subdirectory) and each pair shares head/tail (a couple of first and last lines) but differs in the middle part. I need to delete the heads/tails and keep only the middle portions in which they differ. The lengths of heads/tails... (1 Reply)
Discussion started by: dobryden
1 Replies

5. UNIX for Dummies Questions & Answers

how to find common words and take them out from two files

Hi, everyone, Let's say, we have xxx.txt A 1 2 3 4 5 C 1 2 3 4 5 E 1 2 3 4 5 yyy.txt A 1 2 3 4 5 B 1 2 3 4 5 C 1 2 3 4 5 D 1 2 3 4 5 E 1 2 3 4 5 First I match the first column I find intersection (A,C, E), then I want to take those lines with ACE out from yyy.txt, like A 1... (11 Replies)
Discussion started by: kaixinsjtu
11 Replies

6. UNIX for Dummies Questions & Answers

Grep alternative to handle large numbers of files

I am looking for a file with 'MCR0000000716214' in it. I tried the following command: grep MCR0000000716214 * The problem is that the folder I am searching in has over 87000 files and I am getting the following: bash: /bin/grep: Arg list too long Is there any command I can use that can... (6 Replies)
Discussion started by: runnerpaul
6 Replies

7. Shell Programming and Scripting

Files common in two sets ??? How to find ??

Suppose we have 2 set of files set 1 set 2 ------ ------ abc hgb def ppp mgh vvv nmk sdf hgb ... (1 Reply)
Discussion started by: skyineyes
1 Replies

8. Shell Programming and Scripting

To find all common lines from 'n' no. of files

Hi, I have one situation. I have some 6-7 no. of files in one directory & I have to extract all the lines which exist in all these files. means I need to extract all common lines from all these files & put them in a separate file. Please help. I know it could be done with the help of... (11 Replies)
Discussion started by: The Observer
11 Replies

9. UNIX for Dummies Questions & Answers

Get un common numbers from two files

Hi, I have two files: abc : 50040 123123 31703 cde: 104 97 50040 123123 31703 36609 50534 (3 Replies)
Discussion started by: jingi1234
3 Replies

Featured Tech Videos