In memory grep of a large file.

11-16-2015

Registered User

190, 1

Join Date: Jan 2011

Last Activity: 11 October 2017, 1:35 PM EDT

Location: Nowhere

Posts: 190

Thanks Given: 227

Thanked 1 Time in 1 Post

In memory grep of a large file.

I have two files:
1. A huge 8 GB text file. (big_file.txt)
2. A huge list of words approximately 8 million of them. (words_file.txt). Each word is separated by a newline.

What I intend to do is to read each word "w" from words_file.txt and search for that word in big_file.txt. Then extract two words before and after "w" from big_file.txt.

A naive way is to simply run this command below for each of the word.

Code:

while read name
do
grep -owP "(?:\w+\s){0,2}$name(?:\s\w+){0,2}" big_file.txt
done < words_file.txt

But the above code is too slow as the file is being read from the disk several times.

I then tried to store the entire big_file.txt in the memory by modifying the above code (shown below), but still this code is slow. Using "top" command, I can notice that memory usage increases and decreases as if big_file.txt is being read again and again for each "w". I want the big_file.txt to be read just once.

Code:

dump=$(<big_file.txt)
while read name
do
echo $dump | grep -owP "(?:\w+\s){0,2}$name(?:\s\w+){0,2}"
done < words_file.txt

The big_file.txt looks something like this (posting just a small sample of the file):

Code:

in 2004 obama received national attention during his campaign to represent 
illinois in the united states senate with his victory in the march democratic 
party primary his keynote address at the democratic national convention in july 
and his election to the senate in november he began his presidential campaign
in 2007 and after a close primary campaign against hillary rodham clinton 
in 2008 he won sufficient delegates in the democratic party primaries to receive 
the presidential nomination he then defeated republican nominee john mccain 
in the general election and was inaugurated as president on january 20 2009
nine months after his inauguration obama was named the 2009 nobel peace prize laureate

The words_file.txt looks like this (just a sample):

Code:

obama
primaries
water
laureate
computer

The output that the code gives:

Code:

in 2004 obama received national
his inauguration obama was named
democratic party primaries to receive
peace prize laureate

Any suggestions how I can speed up the search and extraction? I am using BASH on Linux.

Last edited by shoaibjameel123; 11-16-2015 at 06:46 AM.. Reason: Edited based on RudiC's comments.

shoaibjameel123

View Public Profile for shoaibjameel123

Find all posts by shoaibjameel123

11-16-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Please post small but meaningful samples of your two files. And, use code not icode tags.

---------- Post updated at 13:02 ---------- Previous update was at 11:32 ----------

Would this help as a first step?

Code:

grep -C2 -fwords_file <(tr ' ' '\n' < big_file)
in
2004
obama
received
national
--
democratic
party
primaries
to
receive
--
his
inauguration
obama
was
named
--
peace
prize
laureate

One single read of each file.

Last edited by RudiC; 11-16-2015 at 08:31 AM.. Reason: corrected file names

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-16-2015

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Wouldn't it be the easiest to put the big_file.txt to some RAMDisk and then read from there? How to make a RAMDisk is depending on your system, but i'm sure there is one. It might require areboot, though, to create one.

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

11-16-2015

Registered User

190, 1

Join Date: Jan 2011

Last Activity: 11 October 2017, 1:35 PM EDT

Location: Nowhere

Posts: 190

Thanks Given: 227

Thanked 1 Time in 1 Post

Thanks for your responses. When I ran RudiC's script, I did not see any output for five minutes and the memory usage crossed 60%. I had to stop running it, but I thank RudiC for his script. I did not try RAMdisk yet. But I did stumble upon a grep regular expression which can grep for multiple words in one statement. I am sure this will help speed up because logically it only requires one file read of my large file. Simple grep command for just three words goes like this:

Code:

grep -w 'obama\|primaries\|water' big_file.txt

But when I try this in my command:

Code:

grep -owP "(?:\w+\s){0,2}obama\|primaries\|water(?:\s\w+){0,2}" big_file.txt

I do not see anything happening i.e. no output and also no error message. I only want to reduce the amount of file reads, and I am sure that it will speed up my script considerably.

shoaibjameel123

View Public Profile for shoaibjameel123

Find all posts by shoaibjameel123

11-16-2015

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

I'm not going to post everything because I'm still thinking about it, but this version of the grep pattern seems to produce the expected output:

Code:

grep -owP "(\w+\s){0,2}(obama|primaries|water)(\s\w+){0,2}" <your_input_file>

producing:

Code:

in 2004 obama received national
democratic party primaries to receive
his inauguration obama was named

Best wishes ... cheers, drl

This User Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

11-16-2015

Registered User

190, 1

Join Date: Jan 2011

Last Activity: 11 October 2017, 1:35 PM EDT

Location: Nowhere

Posts: 190

Thanks Given: 227

Thanked 1 Time in 1 Post

Thanks, drl. I can see some improvement using your command as I can see that all words in "obama|primaries|water" are searched at the same time. This can surely help reduce the amount of iterations needed.

I can also improve upon your script a little bit by doing this:

Code:

LC_ALL=C grep -owP "(\w+\s){0,2}(obama|primaries|water)(\s\w+){0,2}" big_file.txt

I also found another way using GNU parallel. I have tried several variations, but the variation which I am interested in is this (given here GNU Parallel):

Code:

parallel --pipepart --block 100M -a big_file.txt --fifo cat words_file.txt | parallel --pipe -L1000 --round-robin grep -f - {}

But the above code does not do what I really want, so I tried modifying to this:

Code:

parallel --pipepart --block 100M -a big_file.txt --fifo cat  words_file.txt | parallel --pipe -L1000 --round-robin grep -fowP "(?:\w+\s){0,2}$name(?:\s\w+){0,2}" -  {}

There is still some issues about how to include those word patterns from words_file.txt into this command.

shoaibjameel123

View Public Profile for shoaibjameel123

Find all posts by shoaibjameel123

11-16-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Did you try the proposals on a reduced data set (just to prove the applicability)?

Last edited by RudiC; 11-16-2015 at 02:32 PM.. Reason: typo...

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

In memory grep of a large file.

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Large search replace using sed results in memory problem.

Discussion started by: shoaibjameel123

2. UNIX for Dummies Questions & Answers

virtual memory and diff'ing very large files

Discussion started by: uiop44

3. Shell Programming and Scripting

Severe performance issue while 'grep'ing on large volume of data

Discussion started by: Souvik

4. UNIX for Advanced & Expert Users

Out of Memory error when free memory size is large

Discussion started by: lilili07

5. Shell Programming and Scripting

grep/fgrep/egrep for a very large matrix

Discussion started by: realwindfly

6. Shell Programming and Scripting

grep error: range endpoint too large

Discussion started by: Timm��

7. AIX

amount of memory allocated to large page

Discussion started by: daveisme

8. UNIX for Dummies Questions & Answers

Grep alternative to handle large numbers of files

Discussion started by: runnerpaul

9. HP-UX

How can I get memory usage or anything that show memory used from sar file?

Discussion started by: panithat

10. Linux

shmat() Failure While Using a Large Amount of Shared Memory

Discussion started by: theicarusagenda