In memory grep of a large file.

11-16-2015

Registered User

190, 1

Join Date: Jan 2011

Last Activity: 11 October 2017, 1:35 PM EDT

Location: Nowhere

Posts: 190

Thanks Given: 227

Thanked 1 Time in 1 Post

Yes, I used split command to split the big_file.txt to smaller chunks of 350MB each. I then tried your command and others too on one of these smaller chunks. I noticed that your suggested case takes sometime to read and then run. I can post the memory and CPU statistics on these smaller chunks, if you want. In fact, I am thinking of using these smaller chunks so that I can run the many grep's in parallel instead of using GNU parallel.

But the reason why I pointed GNU parallel is because in that GNU parallel webapge which I pointed above has a statement below this command:

Code:

cat regexp.txt | parallel --pipe -L1000 --round-robin grep -f - bigfile

Code:

If a line matches multiple regexps, the line may be duplicated. The  command will start one grep per CPU and read bigfile one time per CPU,  but as that is done in parallel, all reads except the first will be  cached in RAM.

shoaibjameel123

View Public Profile for shoaibjameel123

Find all posts by shoaibjameel123

11-16-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I'm not clear on what output you hope to produce. With the sample data you provided in post #1, if you're looking for receive, are you hoping to get the output:

Code:

primaries to receive

or the output:

Code:

primaries to receive the presidential

Or, in other words, are searches limited to words on a single line or do you want searches cross line boundaries?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-17-2015

Registered User

190, 1

Join Date: Jan 2011

Last Activity: 11 October 2017, 1:35 PM EDT

Location: Nowhere

Posts: 190

Thanks Given: 227

Thanked 1 Time in 1 Post

Suppose w is a word. Let us take w='receive', as used by you. My objective is to extract two words before and after 'receive'. So what I want is this:

Code:

primaries to receive the presidential

But the above assumption is just for simplicity. However, ideally it should not search across line boundaries. Therefore, you have a good point here. The ideal output for 'receive' should be:

Code:

primaries to receive

shoaibjameel123

View Public Profile for shoaibjameel123

Find all posts by shoaibjameel123

11-17-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by shoaibjameel123

Suppose w is a word. Let us take w='receive', as used by you. My objective is to extract two words before and after 'receive'. So what I want is this:

Code:

primaries to receive the presidential

But the above assumption is just for simplicity. However, ideally it should not search across line boundaries. Therefore, you have a good point here. The ideal output for 'receive' should be:

Code:

primaries to receive

I had a much more complicated awk script that ignores line boundaries when looking for the two words before and after words found in words_file.txt, but this simple script seems to do what you want only needing to have words_file.txt in memory along with a single line of input from big_file.txt (so it doesn't need huge amounts of memory to run):

Code:

awk '
FNR == NR {
	W[$1]
	next
}
{	for(i = 1; i <= NF; i++) {
		if($i in W) {
			low = i > 2 ? i - 2 : 1
			high = i < NF - 2 ? i + 2 : NF
			for(j = low; j <= high; j++)
				printf("%s%s", $j, (j == high ? ORS : OFS))
		}
	}
}' words_file.txt big_file.txt

With your sample input files from post #1 in this thread, the above produces the output:

Code:

in 2004 obama received national
democratic party primaries to receive
his inauguration obama was named
peace prize laureate

And, if additional words are added to words_file.txt:

Code:

obama
primaries
water
laureate
computer
peace
in
receive

it produces the output:

Code:

in 2004 obama
in 2004 obama received national
illinois in the united
his victory in the march
national convention in july
the senate in november he
in 2007 and
in 2008 he
sufficient delegates in the democratic
democratic party primaries to receive
primaries to receive
in the general
his inauguration obama was named
2009 nobel peace prize laureate
peace prize laureate

If this doesn't run fast enough, you can run multiple copies of this script with distinct subsets of words_file.txt concurrently and concatenate the resulting output files when they are all done. (This will give you the same output, but the order of lines in the output will be different.)

If someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-17-2015

Registered User

190, 1

Join Date: Jan 2011

Last Activity: 11 October 2017, 1:35 PM EDT

Location: Nowhere

Posts: 190

Thanks Given: 227

Thanked 1 Time in 1 Post

I just tried your code, and this is perhaps the best that I have tried so far in terms of efficiency. I am currently running multiple instances of a much slower version on my machine using GNU parallel and grep and when I compared the performance of your code with the ones that I am currently running, this one is lightening fast.

shoaibjameel123

View Public Profile for shoaibjameel123

Find all posts by shoaibjameel123

11-17-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The REs required to match two words of context before and after a specified word using grep are much more complex than the exact word matches I'm using in the awk field in array tests. So I was hoping it would be significantly faster, but without your actual data to use as a testbed there wasn't any way to be sure. I'm glad it is working well for you.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

In memory grep of a large file.

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Large search replace using sed results in memory problem.

Discussion started by: shoaibjameel123

2. UNIX for Dummies Questions & Answers

virtual memory and diff'ing very large files

Discussion started by: uiop44

3. Shell Programming and Scripting

Severe performance issue while 'grep'ing on large volume of data

Discussion started by: Souvik

4. UNIX for Advanced & Expert Users

Out of Memory error when free memory size is large

Discussion started by: lilili07

5. Shell Programming and Scripting

grep/fgrep/egrep for a very large matrix

Discussion started by: realwindfly

6. Shell Programming and Scripting

grep error: range endpoint too large

Discussion started by: Timm��

7. AIX

amount of memory allocated to large page

Discussion started by: daveisme

8. UNIX for Dummies Questions & Answers

Grep alternative to handle large numbers of files

Discussion started by: runnerpaul

9. HP-UX

How can I get memory usage or anything that show memory used from sar file?

Discussion started by: panithat

10. Linux

shmat() Failure While Using a Large Amount of Shared Memory

Discussion started by: theicarusagenda