Yes, I used split command to split the big_file.txt to smaller chunks of 350MB each. I then tried your command and others too on one of these smaller chunks. I noticed that your suggested case takes sometime to read and then run. I can post the memory and CPU statistics on these smaller chunks, if you want. In fact, I am thinking of using these smaller chunks so that I can run the many grep's in parallel instead of using GNU parallel.
But the reason why I pointed GNU parallel is because in that GNU parallel webapge which I pointed above has a statement below this command:
I'm not clear on what output you hope to produce. With the sample data you provided in post #1, if you're looking for receive, are you hoping to get the output:
or the output:
Or, in other words, are searches limited to words on a single line or do you want searches cross line boundaries?
Suppose w is a word. Let us take w='receive', as used by you. My objective is to extract two words before and after 'receive'. So what I want is this:
But the above assumption is just for simplicity. However, ideally it should not search across line boundaries. Therefore, you have a good point here. The ideal output for 'receive' should be:
Suppose w is a word. Let us take w='receive', as used by you. My objective is to extract two words before and after 'receive'. So what I want is this:
But the above assumption is just for simplicity. However, ideally it should not search across line boundaries. Therefore, you have a good point here. The ideal output for 'receive' should be:
I had a much more complicated awk script that ignores line boundaries when looking for the two words before and after words found in words_file.txt, but this simple script seems to do what you want only needing to have words_file.txt in memory along with a single line of input from big_file.txt (so it doesn't need huge amounts of memory to run):
With your sample input files from post #1 in this thread, the above produces the output:
And, if additional words are added to words_file.txt:
it produces the output:
If this doesn't run fast enough, you can run multiple copies of this script with distinct subsets of words_file.txt concurrently and concatenate the resulting output files when they are all done. (This will give you the same output, but the order of lines in the output will be different.)
If someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
This User Gave Thanks to Don Cragun For This Post:
I just tried your code, and this is perhaps the best that I have tried so far in terms of efficiency. I am currently running multiple instances of a much slower version on my machine using GNU parallel and grep and when I compared the performance of your code with the ones that I am currently running, this one is lightening fast.
The REs required to match two words of context before and after a specified word using grep are much more complex than the exact word matches I'm using in the awk field in array tests. So I was hoping it would be significantly faster, but without your actual data to use as a testbed there wasn't any way to be sure. I'm glad it is working well for you.
This User Gave Thanks to Don Cragun For This Post:
I have one big file of size 9GB (big_file.txt). This big file has sentences and paragraphs like any usual English document. I have another file consisting of replacement strings for sed to use. The file name is replace.sed and each entry in one line looks like this:
s/\<shout\>/shout/g
s/\<b is... (2 Replies)
Background
-------------
The Unix flavor can be any amongst Solaris, AIX, HP-UX and Linux. I have below 2 flat files.
File-1
------
Contains 50,000 rows with 2 fields in each row, separated by pipe.
Row structure is like Object_Id|Object_Name, as following:
111|XXX
222|YYY
333|ZZZ
... (6 Replies)
I was running a program and it stopped and showed "Out of Memory!". at that time, the RAM used by this process is around 4G and the free memory size of the machine is around 30G. Does anybody know what maybe the reason? this program is written with Perl. the OS of the machine is Solaris U8. And I... (1 Reply)
All,
I have a problem with grep/fgrep/egrep. Basically I am building a 200 times 200 correlation matrix. The entries of this matrix need to be retrieved from another very large matrix (~100G). I tried to use the grep/fgrep/egrep to locate each entry and put them into one file. It looks very... (1 Reply)
Hi, my problem:
gzgrep "^.\{376\}8301685001120" filename /dev/null
###ERROR ###
grep: RE error 11: Range endpoint too large.
Whats my mistake?
Is the position 376 to large for grep???
Thanks. (2 Replies)
We just set up a system to use large pages. I want to know if there is a command to see how much of the memory is being used for large pages. For example if we have a system with 8GB of RAm assigned and it has been set to use 4GB for large pages is there a command to show that 4GB of the *GB is... (1 Reply)
I am looking for a file with 'MCR0000000716214' in it. I tried the following command:
grep MCR0000000716214 *
The problem is that the folder I am searching in has over 87000 files and I am getting the following:
bash: /bin/grep: Arg list too long
Is there any command I can use that can... (6 Replies)
Refer from title:
How can i get memory used or anything that can show memory from sar file
example on solaris:-
we can use sar with option to show memory used at time that sar crontab run.
on HP-UX, it not has option to see memory used. But i think it may be have some parameter or some... (1 Reply)
Hi,
I'm developing a data processing pipeline with multiple stages, with data being moved between the stages using shared memory segments. The size of the data is typically of the order of hundreds of megabytes, and there are typically a few tens of main shared memory segments each of size... (2 Replies)