In memory grep of a large file. Post: 302960574

Sponsored Content

Top Forums Shell Programming and Scripting In memory grep of a large file. Post 302960574 by Don Cragun on Tuesday 17th of November 2015 05:18:20 AM

11-17-2015

Registered User

Quote:

Originally Posted by shoaibjameel123

Suppose w is a word. Let us take w='receive', as used by you. My objective is to extract two words before and after 'receive'. So what I want is this:

Code:

primaries to receive the presidential

But the above assumption is just for simplicity. However, ideally it should not search across line boundaries. Therefore, you have a good point here. The ideal output for 'receive' should be:

Code:

primaries to receive

I had a much more complicated awk script that ignores line boundaries when looking for the two words before and after words found in words_file.txt, but this simple script seems to do what you want only needing to have words_file.txt in memory along with a single line of input from big_file.txt (so it doesn't need huge amounts of memory to run):

Code:

awk '
FNR == NR {
	W[$1]
	next
}
{	for(i = 1; i <= NF; i++) {
		if($i in W) {
			low = i > 2 ? i - 2 : 1
			high = i < NF - 2 ? i + 2 : NF
			for(j = low; j <= high; j++)
				printf("%s%s", $j, (j == high ? ORS : OFS))
		}
	}
}' words_file.txt big_file.txt

With your sample input files from post #1 in this thread, the above produces the output:

Code:

in 2004 obama received national
democratic party primaries to receive
his inauguration obama was named
peace prize laureate

And, if additional words are added to words_file.txt:

Code:

obama
primaries
water
laureate
computer
peace
in
receive

it produces the output:

Code:

in 2004 obama
in 2004 obama received national
illinois in the united
his victory in the march
national convention in july
the senate in november he
in 2007 and
in 2008 he
sufficient delegates in the democratic
democratic party primaries to receive
primaries to receive
in the general
his inauguration obama was named
2009 nobel peace prize laureate
peace prize laureate

If this doesn't run fast enough, you can run multiple copies of this script with distinct subsets of words_file.txt concurrently and concatenate the resulting output files when they are all done. (This will give you the same output, but the order of lines in the output will be different.)

If someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10 More Discussions You Might Find Interesting

1. Linux

shmat() Failure While Using a Large Amount of Shared Memory

Hi, I'm developing a data processing pipeline with multiple stages, with data being moved between the stages using shared memory segments. The size of the data is typically of the order of hundreds of megabytes, and there are typically a few tens of main shared memory segments each of size...

2. HP-UX

How can I get memory usage or anything that show memory used from sar file?

Refer from title: How can i get memory used or anything that can show memory from sar file example on solaris:- we can use sar with option to show memory used at time that sar crontab run. on HP-UX, it not has option to see memory used. But i think it may be have some parameter or some...

3. UNIX for Dummies Questions & Answers

Grep alternative to handle large numbers of files

I am looking for a file with 'MCR0000000716214' in it. I tried the following command: grep MCR0000000716214 * The problem is that the folder I am searching in has over 87000 files and I am getting the following: bash: /bin/grep: Arg list too long Is there any command I can use that can...

4. AIX

amount of memory allocated to large page

We just set up a system to use large pages. I want to know if there is a command to see how much of the memory is being used for large pages. For example if we have a system with 8GB of RAm assigned and it has been set to use 4GB for large pages is there a command to show that 4GB of the *GB is...

5. Shell Programming and Scripting

grep error: range endpoint too large

Hi, my problem: gzgrep "^.\{376\}8301685001120" filename /dev/null ###ERROR ### grep: RE error 11: Range endpoint too large. Whats my mistake? Is the position 376 to large for grep??? Thanks.

6. Shell Programming and Scripting

grep/fgrep/egrep for a very large matrix

All, I have a problem with grep/fgrep/egrep. Basically I am building a 200 times 200 correlation matrix. The entries of this matrix need to be retrieved from another very large matrix (~100G). I tried to use the grep/fgrep/egrep to locate each entry and put them into one file. It looks very...

7. UNIX for Advanced & Expert Users

Out of Memory error when free memory size is large

I was running a program and it stopped and showed "Out of Memory!". at that time, the RAM used by this process is around 4G and the free memory size of the machine is around 30G. Does anybody know what maybe the reason? this program is written with Perl. the OS of the machine is Solaris U8. And I...

8. Shell Programming and Scripting

Severe performance issue while 'grep'ing on large volume of data

Background ------------- The Unix flavor can be any amongst Solaris, AIX, HP-UX and Linux. I have below 2 flat files. File-1 ------ Contains 50,000 rows with 2 fields in each row, separated by pipe. Row structure is like Object_Id|Object_Name, as following: 111|XXX 222|YYY 333|ZZZ ...

9. UNIX for Dummies Questions & Answers

virtual memory and diff'ing very large files

10. Shell Programming and Scripting

Large search replace using sed results in memory problem.

I have one big file of size 9GB (big_file.txt). This big file has sentences and paragraphs like any usual English document. I have another file consisting of replacement strings for sed to use. The file name is replace.sed and each entry in one line looks like this: s/\<shout\>/shout/g s/\<b is...

10 More Discussions You Might Find Interesting

1. Linux

shmat() Failure While Using a Large Amount of Shared Memory

Discussion started by: theicarusagenda

2. HP-UX

How can I get memory usage or anything that show memory used from sar file?

Discussion started by: panithat

3. UNIX for Dummies Questions & Answers

Grep alternative to handle large numbers of files

Discussion started by: runnerpaul

4. AIX

amount of memory allocated to large page

Discussion started by: daveisme

5. Shell Programming and Scripting

grep error: range endpoint too large

Discussion started by: Timm��

6. Shell Programming and Scripting

grep/fgrep/egrep for a very large matrix

Discussion started by: realwindfly

7. UNIX for Advanced & Expert Users

Out of Memory error when free memory size is large

Discussion started by: lilili07

8. Shell Programming and Scripting

Severe performance issue while 'grep'ing on large volume of data

Discussion started by: Souvik

9. UNIX for Dummies Questions & Answers

virtual memory and diff'ing very large files

Discussion started by: uiop44

10. Shell Programming and Scripting

Large search replace using sed results in memory problem.

Discussion started by: shoaibjameel123