Please suggest alternative to grep

06-07-2012

Registered User

6, 0

Join Date: Jun 2012

Last Activity: 19 June 2012, 12:26 AM EDT

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

Please suggest alternative to grep

Hi Experts,
PFB my requirement:
I have a file (named file1) containing numbers like:

Code:

372846078543002
372846078543003
372846078543004
372846078543005
372846078543006

I have another file (nemed file2)where lines containing these numbers(present in file1) are present; Eg:

Code:

lppza087; [2012-06-05 03:00:01,090] <PERFB  > JMSId :ID:414d51204c50505a41303837202020204f657ff1299e7bb7 SvcName :realtime.get.relationship Port :Port1 LobId :AMCSGCDERTUSUSD Card :372846078543002 SrcCd :16 versionNum :3.0 OO [MessageListenerThreadPool : 11  ] OO dao.CustDAO                       OO                 getRelnDetails() OO Entry : getRelnDetails
lppza087; [2012-06-05 03:00:01,100] <PERFB  > JMSId :ID:414d51204c50505a41303837202020204f657ff1299e7bb7 SvcName :realtime.get.relationship Port :Port1 LobId :AMCSGCDERTUSUSD Card :372846078543003 SrcCd :16 versionNum :3.0 OO [MessageListenerThreadPool : 11  ] OO dao.CustDAO                       OO                 getRelnDetails() OO Exit  : getRelnDetails

I need to grep all those lines present in file1 from the other file (file2).
One way will be to run a for loop on file1 and grep in file2. But my data volume is very high an it's taking 5-6 hours.
Can you please suggest the fastest way to achieve this (may be using awk/sed)

Last edited by Franklin52; 06-07-2012 at 04:01 AM.. Reason: Please use code tags

niladri29

View Public Profile for niladri29

Find all posts by niladri29

06-07-2012

Registered User

676, 217

Join Date: Jun 2009

Last Activity: 1 May 2020, 6:28 AM EDT

Location: India

Posts: 676

Thanks Given: 30

Thanked 217 Times in 215 Posts

Hi

Code:

grep -f file1 file2

Guru.

guruprasadpr

View Public Profile for guruprasadpr

Find all posts by guruprasadpr

06-07-2012

Registered User

6, 0

Join Date: Jun 2012

Last Activity: 19 June 2012, 12:26 AM EDT

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks Guru for your prompt response

But my 2nd file size is 15 GB, and the 1st file size is 5 GB. So just wanted to know can this proces be made faster.
I was also was wondering if the lines (as obtained from file2) can be arranged as per the search lines present in file1.

niladri29

View Public Profile for niladri29

Find all posts by niladri29

06-07-2012

Registered User

5,521, 335

Join Date: Dec 2008

Last Activity: 28 March 2014, 8:35 AM EDT

Location: Vienna, Austria, Earth

Posts: 5,521

Thanks Given: 38

Thanked 335 Times in 308 Posts

grep, sed, and awk would all do the same thing: read the first file line by line, and check the second file for occurrences each time, chugging through approximately 75 GB (15*5) of data.

One way it could be done faster would be a script/program that reads the second file (which looks like the wanted information is in the same place on every line), creates a hash/list of the numbers and according line numbers, and the only has to go through the first file once.

pludi

View Public Profile for pludi

Find all posts by pludi

06-07-2012

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

First question is does this absolutely need to be faster? How many times are you going to run it? If it's a single-shot, then perhaps just letting it run to completion is the best solution.

Secondly, the first file looks like it is a sequence. If so, then perhaps a regular expression could be used rather than a volume of 5 GB of memory. If not a regular expression, then possibly a code that determines if a piece of the line matches the base + the sequence -- an arithmetic operation, which might be faster than string comparisons (for example, some mainframes & supercomputers had multiple units for arithmetic).

Thirdly, if you have sufficient IO throughput as well as multiple cores, then one could write a program that internally divides the main file into pieces by keeping track of start-stop line positions, and then uses processes or threads to process one segment each. A less elegant solution along the same lines would be to spilt the files into n sections, each in a file, and then run n instances of grep.

Fourthly, splitting the task up among a network of machines that might share the disk; as well as the easiest (but not cheapest) solution: get a faster box.

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

06-07-2012

Registered User

380, 91

Join Date: Aug 2009

Last Activity: 15 March 2013, 10:40 AM EDT

Location: New Jersey

Posts: 380

Thanks Given: 7

Thanked 91 Times in 75 Posts

Assuming both files are sorted, maybe you can use "join".
If all the 300 million numbers of file1 start with 372846 (if not, then multiple passes maybe), then you can treat them as integers (minus the prefix). This way you can store them as bitmaps and do look up of the numbers (check prefix first separately) from file2. The first chapter of Jon Bentley's book "programming pearl" talked exactly about this problem.

binlib

View Public Profile for binlib

Find all posts by binlib

06-18-2012

Registered User

6, 0

Join Date: Jun 2012

Last Activity: 19 June 2012, 12:26 AM EDT

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

"fgrep -f file1 file2 " worked for me

Thanks,
Niladri

niladri29

View Public Profile for niladri29

Find all posts by niladri29

Shell Programming and Scripting

Please suggest alternative to grep

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with grep, or alternative

Discussion started by: Siwon

2. Shell Programming and Scripting

Alternative command to grep -w option

Discussion started by: veeresh_15

3. UNIX for Dummies Questions & Answers

alternative to the grep trick

Discussion started by: pandeesh

4. Shell Programming and Scripting

Need best grep option or alternative

Discussion started by: alekkz

5. Shell Programming and Scripting

Alternative to grep

Discussion started by: proactiveaditya

6. UNIX for Dummies Questions & Answers

Grep alternative to handle large numbers of files

Discussion started by: runnerpaul

7. UNIX for Advanced & Expert Users

suggest book

Discussion started by: haripatn

8. Shell Programming and Scripting

Can you suggest a more efficient way for this?

Discussion started by: mikie