Severe performance issue while 'grep'ing on large volume of data

03-07-2011

Registered User

3, 0

Join Date: Mar 2011

Last Activity: 4 April 2011, 4:47 AM EDT

Posts: 3

Thanks Given: 1

Thanked 0 Times in 0 Posts

Severe performance issue while 'grep'ing on large volume of data

Background
-------------
The Unix flavor can be any amongst Solaris, AIX, HP-UX and Linux. I have below 2 flat files.

File-1
------
Contains 50,000 rows with 2 fields in each row, separated by pipe.
Row structure is like Object_Id|Object_Name, as following:

111|XXX
222|YYY
333|ZZZ

File-2
------
Contains 5,000 rows with a single field in each row.
Each row basically represents a filename with full path, as below:

/app00/applmgr/aprod/appl/au/11.5.0/resource/XXAIMG_CUSTOM_11I.pld
/app00/applmgr/aprod/appl/xbol/11.5.0/forms/US/XXARTLONG.fmt
/app00/applmgr/aprod/appl/au/11.5.0/resource/XXINVIVCSU.pld

Task
-----
I need to search for the occurances of each Object_Name (from each row of File-1) in all the 5000 distinct files (names stored in File-2) and get the search results stored in some 3rd file with below row structure. So the total no of loop iterations would be 250,000,000.

File_Name|Object_Id|Occurance_Count
eg,
/app00/applmgr/aprod/appl/au/11.5.0/resource/XXINVIVCSU.pld|222|13

Request
---------
Please provide the shell scripting method to do the desired job in fastest possible time.

Thanks,
Souvik.

Last edited by Souvik; 03-07-2011 at 02:45 AM..

Souvik

View Public Profile for Souvik

Find all posts by Souvik

03-07-2011

Registered User

6,402, 678

Join Date: Mar 2008

Last Activity: 8 June 2016, 9:58 PM EDT

Posts: 6,402

Thanks Given: 288

Thanked 678 Times in 647 Posts

Did you mention the format and number of records of the files listed in File-2?
What is the most powerful computer you have of the ones you mention?
You mention a performance issue. What have you tried so far?

methyl

View Public Profile for methyl

Find all posts by methyl

03-07-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

I think the amount of data is reasonable to fit in one awk like this:

Code:

# Make a temp file to hold the second column of file-1.
# We can feed the entire file into grep -f, reducing 5000 grep calls to 1.
TMP=`mktemp`
awk '{ print $2}' < file-1 > "$TMP"

# feed the list of filenames into xargs, which calls grep.  Force grep to
# print filenames with -H, force it to print only the matching bit
# with -o, tell it to use the patterns as fixed strings with -F, and tell it
# to use TMP as the fixed strings with -f.
# 
# It will print a bunch of lines like filename1:oid1.
# 
# Then we tell awk to count each unique line(turning : to | ) and print totals.
xargs grep -H -o -F -f "$TMP" < file-2 |
         awk -v OFS="|" -v FS=":"        \
                '{ C[$1 "|" $2]++; } END { for(k in C) print k,C[k]; }'

# clean up the temp file.
rm -f "${TMP}"

If the number of filenames is small enough, it'll run awk only twice, and grep only once, otherwise it will call grep as many times as necessary to open the 50,000 files and feed all its output through the one awk. If you're concerned about awk consuming too much memory, you can run grep | awk on individual files read from file-2 like

Code:

while read FILENAME
do
        grep -H -o -F -f "$TMP" "$FILENAME" | awk ...
done < file-2

this will be less efficient, running awk and grep 50,000 times instead of 50,000/ARG_MAX times, but depending on the size of the files may not be significant.

Results when run on my own test data:

Code:

$ ./extract.sh | sort
a/4|obj0|2
a/4|obj2|2
a/4|obj6|1
b/0|obj1|1
b/0|obj5|1
b/0|obj6|1
b/0|obj7|1
b/0|obj8|1
c/6|obj2|2
c/6|obj5|1
...
x/6|obj6|1
x/6|obj8|1
y/5|obj1|2
y/5|obj2|1
y/5|obj5|1
y/5|obj8|1
z/7|obj3|1
z/7|obj4|2
z/7|obj6|1
z/7|obj7|1

...where I'd created directories [a-z] with files [0-9] and put random object names from file-1 in them one per line. If you don't care what order you get the results in you can forget the sort.

Last edited by Corona688; 03-07-2011 at 01:41 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

03-07-2011

Registered User

3, 0

Join Date: Mar 2011

Last Activity: 4 April 2011, 4:47 AM EDT

Posts: 3

Thanks Given: 1

Thanked 0 Times in 0 Posts

Dear friend Corona688,

I just checked your reply. I am at home and would give a try on your suggestion tomorrow only. I will post feedback immediately after I test the results.

Anyways, many thanks to you for such a detailed advice.

You seem to be solid enough in shell scripting.

Best Regards,
Souvik.

Souvik

View Public Profile for Souvik

Find all posts by Souvik

03-07-2011

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by methyl

Did you mention the format and number of records of the files listed in File-2?

This especially is germaine. If file-2 consists of records with delimited fields and if the only matching to be done is field-wide (no partial matches), then this problem could be solved in a manner that's simpler and more efficient than Corona's approach (which handles generic text formats).

Regards,
Alister

---------- Post updated at 01:29 PM ---------- Previous update was at 01:08 PM ----------

Hey, Corona688:

A few observations:

Quote:

Originally Posted by Corona688

Code:

awk '{ print $2}' < file-1 > "$TMP"

An FS=\| is necessary to properly split that file.

Quote:

Originally Posted by Corona688

Code:

grep -H -o -F -f "$TMP" < file-2

Testing with the only version of grep that I have that supports -o (gnu grep 2.5.1 on a disused laptop that seldom sees any action), if multiple patterns match a single line (not knowing anything about the content of the files, it's a possibility), it does not print the filename before every match (only the first), even with -H. If that output format has changed, apologies for the false alarm.

Quote:

Originally Posted by Corona688

Code:

# It will print a bunch of lines like filename1:oid1.

Actually, assuming the splitting on file-1 is done correctly and $2 is sent to the temp file, the grepping is being done on the object names, not the oids. The result of the grep and awk count will be filename|object name|count, but the desired output is filename|oid|count.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

03-07-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by alister

An FS=\| is necessary to properly split that file.

Good point.

Quote:

Testing with the only version of grep that I have that supports -o (gnu grep 2.5.1 on a disused laptop that seldom sees any action)

Nuts, I thought that was a portable option. Throw that whole script out, then.

Quote:

if multiple patterns match a single line (not knowing anything about the content of the files, it's a possibility), it does not print the filename before every match (only the first), even with -H.

I did note it expected one pattern per line.

Quote:

Actually, assuming the splitting on file-1 is done correctly and $2 is sent to the temp file, the grepping is being done on the object names

Argh. My script doesn't work, then.

Looking less and less like there's really going to be an efficient solution if you have to parse that thing the hard way line by individual line on systems and shells with no features.

---------- Post updated at 01:31 PM ---------- Previous update was at 12:49 PM ----------

OK, here's a brute-force version in awk:

Code:

BEGIN   {       FS="|"  ;       OFS="|"
                # Read in list of ID's
                while(getline < "file-1")       n[$2]=$1;
        }

{
        for(i in n)     if(index(i, $0) > 0)
                o[FILENAME "|" n[i]]++;
}

END     {       for(k in o)     print k, o[k];  }

Don' think it'll be quite as efficient as grep but ought to do. It ran in two seconds on 5,000 files cached, maybe 5-10 seconds uncached. I don't believe I used any GNU specific features either.

Put that in extract.awk and run with

Code:

xargs awk -f extract.awk < file-2

You can change o[FILENAME "|" n[i]]++; to o[FILENAME "|" i]++; if I somehow got name vs ID backwards again.

---------- Post updated at 01:54 PM ---------- Previous update was at 01:31 PM ----------

Estimate that to work in 60 minutes on data similar to yours on a 'fast' machine. use 'xargs -n 10' to prevent it from eating all your memory. Maybe not so good. If there was some sort of pattern to the OIDs/names, that'd help a lot for finding them without having to brute-force check all 50,000 individually... Or if you knew for a fact you had or could get GNU grep, and didn't have more than one object per line, grep's hardwired performance is going to be way better than awk's scripted performance...

Last edited by Corona688; 03-07-2011 at 04:03 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

04-04-2011

Registered User

3, 0

Join Date: Mar 2011

Last Activity: 4 April 2011, 4:47 AM EDT

Posts: 3

Thanks Given: 1

Thanked 0 Times in 0 Posts

Hi Corona688,

Thank you very much. I used your tips (xargs grep) and could manage to get a significant improvement in performance.

Regards,
Souvik

Souvik

View Public Profile for Souvik

Find all posts by Souvik

Shell Programming and Scripting

Severe performance issue while 'grep'ing on large volume of data

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Output large volume of data to CSV file

Discussion started by: dellanicholson

2. Shell Programming and Scripting

Performance issue in Grepping large files

Discussion started by: millan

3. UNIX for Dummies Questions & Answers

Large file data handling issue

Discussion started by: Gurkamal83

4. Programming

Issue when fork()ing processes

Discussion started by: pfpietro

5. UNIX for Dummies Questions & Answers

virtual memory and diff'ing very large files

Discussion started by: uiop44

6. HP-UX

Performance issue with 'grep' command for huge file size

Discussion started by: arb_1984

7. UNIX for Advanced & Expert Users

Gurus needed to diagnose severe performance degradation

Discussion started by: DBA_guy

8. UNIX for Advanced & Expert Users

Large volume file formatting

Discussion started by: darshanw

9. Shell Programming and Scripting

Performance issue in UNIX while generating .dat file from large text file

Discussion started by: KRAMA

10. Shell Programming and Scripting

grep'ing and sed'ing chunks in bash... need help on speeding up a log parser.

Discussion started by: elinenbe