Severe performance issue while 'grep'ing on large volume of data Post: 302502267

Sponsored Content

Top Forums Shell Programming and Scripting Severe performance issue while 'grep'ing on large volume of data Post 302502267 by Corona688 on Monday 7th of March 2011 12:10:44 PM

03-07-2011

Registered User

I think the amount of data is reasonable to fit in one awk like this:

Code:

# Make a temp file to hold the second column of file-1.
# We can feed the entire file into grep -f, reducing 5000 grep calls to 1.
TMP=`mktemp`
awk '{ print $2}' < file-1 > "$TMP"

# feed the list of filenames into xargs, which calls grep.  Force grep to
# print filenames with -H, force it to print only the matching bit
# with -o, tell it to use the patterns as fixed strings with -F, and tell it
# to use TMP as the fixed strings with -f.
# 
# It will print a bunch of lines like filename1:oid1.
# 
# Then we tell awk to count each unique line(turning : to | ) and print totals.
xargs grep -H -o -F -f "$TMP" < file-2 |
         awk -v OFS="|" -v FS=":"        \
                '{ C[$1 "|" $2]++; } END { for(k in C) print k,C[k]; }'

# clean up the temp file.
rm -f "${TMP}"

If the number of filenames is small enough, it'll run awk only twice, and grep only once, otherwise it will call grep as many times as necessary to open the 50,000 files and feed all its output through the one awk. If you're concerned about awk consuming too much memory, you can run grep | awk on individual files read from file-2 like

Code:

while read FILENAME
do
        grep -H -o -F -f "$TMP" "$FILENAME" | awk ...
done < file-2

this will be less efficient, running awk and grep 50,000 times instead of 50,000/ARG_MAX times, but depending on the size of the files may not be significant.

Results when run on my own test data:

Code:

$ ./extract.sh | sort
a/4|obj0|2
a/4|obj2|2
a/4|obj6|1
b/0|obj1|1
b/0|obj5|1
b/0|obj6|1
b/0|obj7|1
b/0|obj8|1
c/6|obj2|2
c/6|obj5|1
...
x/6|obj6|1
x/6|obj8|1
y/5|obj1|2
y/5|obj2|1
y/5|obj5|1
y/5|obj8|1
z/7|obj3|1
z/7|obj4|2
z/7|obj6|1
z/7|obj7|1

...where I'd created directories [a-z] with files [0-9] and put random object names from file-1 in them one per line. If you don't care what order you get the results in you can forget the sort.

Last edited by Corona688; 03-07-2011 at 01:41 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

grep'ing and sed'ing chunks in bash... need help on speeding up a log parser.

I have a file that is 20 - 80+ MB in size that is a certain type of log file. It logs one of our processes and this process is multi-threaded. Therefore the log file is kind of a mess. Here's an example: The logfile looks like: "DATE TIME - THREAD ID - Details", and a new file is created...

2. Shell Programming and Scripting

Performance issue in UNIX while generating .dat file from large text file

Hello Gurus, We are facing some performance issue in UNIX. If someone had faced such kind of issue in past please provide your suggestions on this . Problem Definition: /Few of load processes of our Finance Application are facing issue in UNIX when they uses a shell script having below...

3. UNIX for Advanced & Expert Users

Large volume file formatting

Hi, I have a file which is around 193 gb in size. This file has tonnes of spaces and I need to sanitize it. I tried to use awk script to split this file but it gave me an error like line to long... As of now I am using a sed command to search replace the spaces; however its too slow for such a...

4. UNIX for Advanced & Expert Users

Gurus needed to diagnose severe performance degradation

Hi everyone, newbie forum poster here. I'm an Oracle DBA and I require some guidance from the Unix gurus here about how to pinpoint where a problem is within a Solaris 9 system running on an 8 CPU Fujitsu server that acts as our Oracle database server. Our sysadmins are trying their best to...

5. HP-UX

Performance issue with 'grep' command for huge file size

I have 2 files; one file (say, details.txt) contains the details of employees and another file (say, emp.txt) has some selected employee names. I am extracting employee details from details.txt by using emp.txt and the corresponding code is: while read line do emp_name=`echo $line` grep -e...

6. UNIX for Dummies Questions & Answers

virtual memory and diff'ing very large files

7. Programming

Issue when fork()ing processes

Hi guys! I'll simplify my problem. I have the following code: #include <fcntl.h> #include <stdio.h> #include <string.h> #include <stdlib.h> #include <signal.h> #include <fcntl.h> #include <unistd.h> #include <sys/wait.h> #define max 25 #define buffdim 50 void p1(); void p2();...

8. UNIX for Dummies Questions & Answers

Large file data handling issue

I have a single record large file, semicolon ';' and pipe '|' separated. I am doing a vi on the file. It is throwing an error "File to long" I need to actually remove the last | symbol from this file. sed -e 's/\|*$//' filename is working fine for small files. But not working on this big...

9. Shell Programming and Scripting

Performance issue in Grepping large files

I have around 300 files(*.rdf,*.fmb,*.pll,*.ctl,*.sh,*.sql,*.prog) which are of large size. Around 8000 keywords(which will be in the file $keywordfile) needed to be searched inside those files. If a keyword is found in a file..I have to insert the filename,extension,catagoery,keyword,occurrence...

10. Shell Programming and Scripting

Output large volume of data to CSV file

I have a program that output the ownership and permission on each directory and file on the server to a csv file. I am getting error message when I run the program. The program is not outputting to the csv file. Error: the file access permissions do not allow the specified action cannot...

LEARN ABOUT MINIX

zgrep

ZGREP(1)                                                      General Commands Manual                                                     ZGREP(1)

NAME

       zgrep - search possibly compressed files for a regular expression

SYNOPSIS

       zgrep [ grep_options ] [ -e ] pattern filename...

DESCRIPTION

       Zgrep  invokes  grep  on  compressed  or  gzipped  files.   These  grep  options  will  cause  zgrep  to  terminate  with  an  error  code:
       (-[drRzZ]|--di*|--exc*|--inc*|--rec*|--nu*).  All other options specified are passed directly to grep.  If no file is specified,  then  the
       standard input is decompressed if necessary and fed to grep.  Otherwise the given files are uncompressed if necessary and fed to grep.

       If the GREP environment variable is set, zgrep uses it as the grep program to be invoked.

EXIT CODE

       2 - An option that is not supported was specified.

AUTHOR

       Charles Levert (charles@comm.polymtl.ca)

SEE ALSO

       grep(1), gzexe(1), gzip(1), zdiff(1), zforce(1), zmore(1), znew(1)

                                                                                                                                          ZGREP(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

grep'ing and sed'ing chunks in bash... need help on speeding up a log parser.

Discussion started by: elinenbe

2. Shell Programming and Scripting

Performance issue in UNIX while generating .dat file from large text file

Discussion started by: KRAMA

3. UNIX for Advanced & Expert Users

Large volume file formatting

Discussion started by: darshanw

4. UNIX for Advanced & Expert Users

Gurus needed to diagnose severe performance degradation

Discussion started by: DBA_guy

5. HP-UX

Performance issue with 'grep' command for huge file size

Discussion started by: arb_1984

6. UNIX for Dummies Questions & Answers

virtual memory and diff'ing very large files

Discussion started by: uiop44

7. Programming

Issue when fork()ing processes

Discussion started by: pfpietro

8. UNIX for Dummies Questions & Answers

Large file data handling issue

Discussion started by: Gurkamal83

9. Shell Programming and Scripting

Performance issue in Grepping large files

Discussion started by: millan

10. Shell Programming and Scripting

Output large volume of data to CSV file

Discussion started by: dellanicholson

LEARN ABOUT MINIX

zgrep