In memory grep of a large file.


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting In memory grep of a large file.
# 1  
Old 11-16-2015
In memory grep of a large file.

I have two files:
1. A huge 8 GB text file. (big_file.txt)
2. A huge list of words approximately 8 million of them. (words_file.txt). Each word is separated by a newline.

What I intend to do is to read each word "w" from words_file.txt and search for that word in big_file.txt. Then extract two words before and after "w" from big_file.txt.

A naive way is to simply run this command below for each of the word.

Code:
while read name
do
grep -owP "(?:\w+\s){0,2}$name(?:\s\w+){0,2}" big_file.txt
done < words_file.txt

But the above code is too slow as the file is being read from the disk several times.

I then tried to store the entire big_file.txt in the memory by modifying the above code (shown below), but still this code is slow. Using "top" command, I can notice that memory usage increases and decreases as if big_file.txt is being read again and again for each "w". I want the big_file.txt to be read just once.

Code:
dump=$(<big_file.txt)
while read name
do
echo $dump | grep -owP "(?:\w+\s){0,2}$name(?:\s\w+){0,2}"
done < words_file.txt

The big_file.txt looks something like this (posting just a small sample of the file):
Code:
in 2004 obama received national attention during his campaign to represent 
illinois in the united states senate with his victory in the march democratic 
party primary his keynote address at the democratic national convention in july 
and his election to the senate in november he began his presidential campaign
in 2007 and after a close primary campaign against hillary rodham clinton 
in 2008 he won sufficient delegates in the democratic party primaries to receive 
the presidential nomination he then defeated republican nominee john mccain 
in the general election and was inaugurated as president on january 20 2009
nine months after his inauguration obama was named the 2009 nobel peace prize laureate

The words_file.txt looks like this (just a sample):
Code:
obama
primaries
water
laureate
computer

The output that the code gives:
Code:
in 2004 obama received national
his inauguration obama was named
democratic party primaries to receive
peace prize laureate

Any suggestions how I can speed up the search and extraction? I am using BASH on Linux.

Last edited by shoaibjameel123; 11-16-2015 at 06:46 AM.. Reason: Edited based on RudiC's comments.
# 2  
Old 11-16-2015
Please post small but meaningful samples of your two files. And, use code not icode tags.

---------- Post updated at 13:02 ---------- Previous update was at 11:32 ----------

Would this help as a first step?
Code:
grep -C2 -fwords_file <(tr ' ' '\n' < big_file)
in
2004
obama
received
national
--
democratic
party
primaries
to
receive
--
his
inauguration
obama
was
named
--
peace
prize
laureate

One single read of each file.

Last edited by RudiC; 11-16-2015 at 08:31 AM.. Reason: corrected file names
This User Gave Thanks to RudiC For This Post:
# 3  
Old 11-16-2015
Wouldn't it be the easiest to put the big_file.txt to some RAMDisk and then read from there? How to make a RAMDisk is depending on your system, but i'm sure there is one. It might require areboot, though, to create one.

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
# 4  
Old 11-16-2015
Thanks for your responses. When I ran RudiC's script, I did not see any output for five minutes and the memory usage crossed 60%. I had to stop running it, but I thank RudiC for his script. I did not try RAMdisk yet. But I did stumble upon a grep regular expression which can grep for multiple words in one statement. I am sure this will help speed up because logically it only requires one file read of my large file. Simple grep command for just three words goes like this:

Code:
grep -w 'obama\|primaries\|water' big_file.txt

But when I try this in my command:
Code:
grep -owP "(?:\w+\s){0,2}obama\|primaries\|water(?:\s\w+){0,2}" big_file.txt

I do not see anything happening i.e. no output and also no error message. I only want to reduce the amount of file reads, and I am sure that it will speed up my script considerably.
# 5  
Old 11-16-2015
Hi.

I'm not going to post everything because I'm still thinking about it, but this version of the grep pattern seems to produce the expected output:
Code:
grep -owP "(\w+\s){0,2}(obama|primaries|water)(\s\w+){0,2}" <your_input_file>

producing:
Code:
in 2004 obama received national
democratic party primaries to receive
his inauguration obama was named

Best wishes ... cheers, drl
This User Gave Thanks to drl For This Post:
# 6  
Old 11-16-2015
Thanks, drl. I can see some improvement using your command as I can see that all words in "obama|primaries|water" are searched at the same time. This can surely help reduce the amount of iterations needed.

I can also improve upon your script a little bit by doing this:
Code:
LC_ALL=C grep -owP "(\w+\s){0,2}(obama|primaries|water)(\s\w+){0,2}" big_file.txt

I also found another way using GNU parallel. I have tried several variations, but the variation which I am interested in is this (given here GNU Parallel):

Code:
parallel --pipepart --block 100M -a big_file.txt --fifo cat words_file.txt | parallel --pipe -L1000 --round-robin grep -f - {}

But the above code does not do what I really want, so I tried modifying to this:
Code:
parallel --pipepart --block 100M -a big_file.txt --fifo cat  words_file.txt | parallel --pipe -L1000 --round-robin grep -fowP "(?:\w+\s){0,2}$name(?:\s\w+){0,2}" -  {}

There is still some issues about how to include those word patterns from words_file.txt into this command.
# 7  
Old 11-16-2015
Did you try the proposals on a reduced data set (just to prove the applicability)?

Last edited by RudiC; 11-16-2015 at 02:32 PM.. Reason: typo...
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Large search replace using sed results in memory problem.

I have one big file of size 9GB (big_file.txt). This big file has sentences and paragraphs like any usual English document. I have another file consisting of replacement strings for sed to use. The file name is replace.sed and each entry in one line looks like this: s/\<shout\>/shout/g s/\<b is... (2 Replies)
Discussion started by: shoaibjameel123
2 Replies

2. UNIX for Dummies Questions & Answers

virtual memory and diff'ing very large files

(0 Replies)
Discussion started by: uiop44
0 Replies

3. Shell Programming and Scripting

Severe performance issue while 'grep'ing on large volume of data

Background ------------- The Unix flavor can be any amongst Solaris, AIX, HP-UX and Linux. I have below 2 flat files. File-1 ------ Contains 50,000 rows with 2 fields in each row, separated by pipe. Row structure is like Object_Id|Object_Name, as following: 111|XXX 222|YYY 333|ZZZ ... (6 Replies)
Discussion started by: Souvik
6 Replies

4. UNIX for Advanced & Expert Users

Out of Memory error when free memory size is large

I was running a program and it stopped and showed "Out of Memory!". at that time, the RAM used by this process is around 4G and the free memory size of the machine is around 30G. Does anybody know what maybe the reason? this program is written with Perl. the OS of the machine is Solaris U8. And I... (1 Reply)
Discussion started by: lilili07
1 Replies

5. Shell Programming and Scripting

grep/fgrep/egrep for a very large matrix

All, I have a problem with grep/fgrep/egrep. Basically I am building a 200 times 200 correlation matrix. The entries of this matrix need to be retrieved from another very large matrix (~100G). I tried to use the grep/fgrep/egrep to locate each entry and put them into one file. It looks very... (1 Reply)
Discussion started by: realwindfly
1 Replies

6. Shell Programming and Scripting

grep error: range endpoint too large

Hi, my problem: gzgrep "^.\{376\}8301685001120" filename /dev/null ###ERROR ### grep: RE error 11: Range endpoint too large. Whats my mistake? Is the position 376 to large for grep??? Thanks. (2 Replies)
Discussion started by: Timmää
2 Replies

7. AIX

amount of memory allocated to large page

We just set up a system to use large pages. I want to know if there is a command to see how much of the memory is being used for large pages. For example if we have a system with 8GB of RAm assigned and it has been set to use 4GB for large pages is there a command to show that 4GB of the *GB is... (1 Reply)
Discussion started by: daveisme
1 Replies

8. UNIX for Dummies Questions & Answers

Grep alternative to handle large numbers of files

I am looking for a file with 'MCR0000000716214' in it. I tried the following command: grep MCR0000000716214 * The problem is that the folder I am searching in has over 87000 files and I am getting the following: bash: /bin/grep: Arg list too long Is there any command I can use that can... (6 Replies)
Discussion started by: runnerpaul
6 Replies

9. HP-UX

How can I get memory usage or anything that show memory used from sar file?

Refer from title: How can i get memory used or anything that can show memory from sar file example on solaris:- we can use sar with option to show memory used at time that sar crontab run. on HP-UX, it not has option to see memory used. But i think it may be have some parameter or some... (1 Reply)
Discussion started by: panithat
1 Replies

10. Linux

shmat() Failure While Using a Large Amount of Shared Memory

Hi, I'm developing a data processing pipeline with multiple stages, with data being moved between the stages using shared memory segments. The size of the data is typically of the order of hundreds of megabytes, and there are typically a few tens of main shared memory segments each of size... (2 Replies)
Discussion started by: theicarusagenda
2 Replies
Login or Register to Ask a Question