Performance issue in Grepping large files

06-11-2013

Registered User

119, 5

Join Date: Jan 2009

Last Activity: 28 December 2016, 8:58 AM EST

Posts: 119

Thanks Given: 3

Thanked 5 Times in 5 Posts

Performance issue in Grepping large files

I have around 300 files(*.rdf,*.fmb,*.pll,*.ctl,*.sh,*.sql,*.prog) which are of large size.
Around 8000 keywords(which will be in the file $keywordfile) needed to be searched inside those files.
If a keyword is found in a file..I have to insert the filename,extension,catagoery,keyword,occurrence to database.
I have implemented following code..but it is taking around 10-12 hours complete.
Could you please suggest how will i change it , so that it will be faster.
I am using Solaris .

Code:

 
/usr/xpg4/bin/find $tmpdir -type f -name "*.rdf" -o -name "*.fmb" -o -name "*.pll" -o -name "*.ctl" -o -name "*.sh" -o -name "*.sql" -o -name "*.prog"| while read filename
do
    while read keyword
    do
       matchCount=`/usr/xpg4/bin/grep -F -i -x "$keyword" "$filename" | wc -l`
       if [ $matchCount -ne 0 ];then
 
  out3=`echo "$filename"|awk -F\. '{print $2}'`
  
  bfilename=`basename "$filename"`
  
  case $out3 in
   'rdf')   catagoery="REPORT";;
      
   'fmb')   catagoery="FORM";;
   'sql')   catagoery="SQL FILE";;
   'pll')   catagoery="Library File";;
   'ctl')   catagoery="Control File";;
   'sh')   catagoery="Shell script";;
    *)    catagoery="OTHER";;
  esac 
  
  echo "bfilename,keyword,matchCount,out3,catagoery are:- $bfilename,$keyword,$matchCount,$out3,$catagoery"
  sqlplus -s $usrname/$password@$dbSID <<-SQL >> spot_fsearch.log
  INSERT INTO AA_DETAIL (FILE_NAME,DEP_OBJECT_NAME,OCCURANCE,FILE_TYPE,PROGRAM_TYPE) values ('$bfilename','$keyword',$matchCount,'$out3','$catagoery');
  UPDATE BB_DETAIL SET (DEP_OBJECT_TYPE,MODULE_SHORT_NAME,APPLICATION,OBJECT_STATUS,OBJ_ADDN_INFO) = (SELECT OBJECT_TYPE,MODULE_SHORT_NAME,APPLICATION,OBJECT_STATUS,OBJ_ADDN_INFO FROM CG_COMPARATIVE_MATRIX_TAB WHERE upper(OBJECT_NAME)=upper('$keyword') AND ROWNUM<2) WHERE upper(DEP_OBJECT_NAME) = upper('$keyword');
  UPDATE CC_CUSTOM_FILES_SUMMARY SET IMPACTED_BY_UPGRADE='$out2' WHERE FILE_NAME='$bfilename';
  quit;
 
SQL
       fi
    done < $keywordfile
done

millan

View Public Profile for millan

Find all posts by millan

06-11-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Searching 8000 keywords in 300 large files is quite something, but the program you show can be optimized for speed.
a) Don't open and reread the keyword file line by line for every file matching your pattern.
b) Don't run the grep process for every single keyword/file combination (300 x 8000 = 2.4 million times!)
c) Don't use wc -l piped to the greps (again 2.4 million times)
d) Don't run the sql command including login for every single keyword/file combination; collect the results into a file and insert & update afterwards.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

06-11-2013

Registered User

119, 5

Join Date: Jan 2009

Last Activity: 28 December 2016, 8:58 AM EST

Posts: 119

Thanks Given: 3

Thanked 5 Times in 5 Posts

Yes Rudic..you have correctly pointed these points.

I can understand these are the problem in this code.
But i m not able to enhance the code to avoid these.

Can you pls suggest what kind of code change i can do for this.

millan

View Public Profile for millan

Find all posts by millan

06-11-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

This is untested and far from complete; you need to experiment. It should replace your two while loops as it reads all the keywords, and then scans all the files found by your find command. It will produce an output that you can capture into a file that you can sqlload into your DB in one go; thereafter do the inserts and updates:

Code:

awk     'BEGIN          {CAT["rdf"]="REPORT"                
                         CAT["fmb"]="FORM"                 
                         CAT["sql"]="SQL FILE"                 
                         CAT["pll"]="Library File"                 
                         CAT["ctl"]="Control File"                 
                         CAT["sh"]= "Shell script"                
                        }

         FNR == NR      {KY[$0]; next}                                  # read in all the keywords

         FNR == 1 && FN {EXT = FN; sub (/.*\./,".", EXT)                # if new file, obtain the extension
                         for (i in MCNT)                                # for all matches,
                           print FN, i, MCNT[i], EXT, CAT[EXT]          # print out the old values 
                         FN = FILENAME                                  # retain FILENAME for next loop
                        }

                        {for (i in KY) if ($0 ~ i) MCNT[i]++}           # find matching keywords in each line

         END            {EXT = FN; sub (/.*\./,".", EXT)                # same as above for last file
                         for (i in MCNT) 
                           print FN, i, MCNT[i], EXT, CAT[EXT]
                        }
        ' $keywordfile $(find $tmpdir -type f -name ....)               # may blast your LINE_MAX

RudiC

View Public Profile for RudiC

Find all posts by RudiC

06-11-2013

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

For multiple CPU hardware, introducing some parallelism might be useful in decreasing real time.

I think I would split the list of files to be examined into several pieces and then run whatever scanning program is desired on each piece.

There are several utilities to help with the control of simultaneous processes: xargs, parallel, etc.

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

06-12-2013

Registered User

119, 5

Join Date: Jan 2009

Last Activity: 28 December 2016, 8:58 AM EST

Posts: 119

Thanks Given: 3

Thanked 5 Times in 5 Posts

Hi Rudic,

I ran the the code which gave me error near the find command.

syntax error at line 25: `(' unexpected

So i have backquoted the find command and run as below.

Code:

 
keywordfile="keyword.txt"
/usr/xpg4/bin/awk    'BEGIN          {CAT["rdf"]="REPORT"                
                         CAT["fmb"]="FORM"                 
                         CAT["sql"]="SQL FILE"                 
                         CAT["pll"]="Library File"                 
                         CAT["ctl"]="Control File"                 
                         CAT["sh"]= "Shell script"                
                        }
         FNR == NR      {KY[$0]; next}                                  # read in all the keywords
         FNR == 1 && FN {EXT = FN; sub (/.*\./,".", EXT)                # if new file, obtain the extension
                         for (i in MCNT)                                # for all matches,
                           print FN, i, MCNT[i], EXT, CAT[EXT]          # print out the old values 
                         FN = FILENAME                                  # retain FILENAME for next loop
                        }
                        {for (i in KY) if ($0 ~ i) MCNT[i]++}           # find matching keywords in each line
         END            {EXT = FN; sub (/.*\./,".", EXT)                # same as above for last file
                         for (i in MCNT) 
                           print FN, i, MCNT[i], EXT, CAT[EXT]
                        }
        ' $keywordfile `/usr/xpg4/bin/find /usr/tmp/SB -type f -name "*.rdf" -o -name "*.fmb" -o -name "*.pll" -o -name "*.ctl" -o -name "*.sh" -o -name "*.sql" -o -name "*.prog"`

but it is giving me the error as below.

Code:

/usr/xpg4/bin/awk: line 16 (NR=7758): /DR$PV_ENTY_ATTR_TEXTS_U2$R/: unknown regex error

And i checked the keyword file and can see some of keywords contain $ symbol.So it is breaking.

And also some filenames contains space.

Please let me know what modification i should do here.

Thank you

Last edited by millan; 06-12-2013 at 09:12 AM..

millan

View Public Profile for millan

Find all posts by millan

06-12-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

As I said: You need to experiment. Try printing the lines with matches. Try smaller files.
Why don't you create a, say, 10 keyword file, and work on a subset of two or three sample files that have a known set of keywords within?
The error msg you post points to the END section, i.e. the problem is within the last file. Which can be good news, as all the earlier files passed!

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Performance issue in Grepping large files

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash script search, improve performance with large files

Discussion started by: SDohmen

2. Shell Programming and Scripting

Grepping verbal forms from a large corpus

Discussion started by: gimley

3. Shell Programming and Scripting

Grepping large list of files

Discussion started by: angshuman

4. Red Hat

Empty directory, large size and performance

Discussion started by: bdx

5. Shell Programming and Scripting

Severe performance issue while 'grep'ing on large volume of data

Discussion started by: Souvik

6. Shell Programming and Scripting

replace issue with large files

Discussion started by: tootles564

7. Shell Programming and Scripting

Performance issue in UNIX while generating .dat file from large text file

Discussion started by: KRAMA

8. Shell Programming and Scripting

Grepping issue..

Discussion started by: LinuxRacr

9. UNIX for Dummies Questions & Answers

Unix File System performance with large directories

Discussion started by: dive