Can someone please help me optimize my code (script searches subdirectories)?

03-15-2012

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

I timed my awk script (from post #5) against a midsized Gutenberg collection (5,354,620 lines of text in 203 documents, 20 directories) I have a phrase list of 3939 phrases.

Processing time: 1h 43min

The original script is still running (Over 17h now)

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

03-15-2012

Registered User

94, 1

Join Date: Apr 2010

Last Activity: 23 January 2014, 3:02 PM EST

Posts: 94

Thanks Given: 15

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Chubler_XL

I timed my awk script (from post #5) against a midsized Gutenberg collection (5,354,620 lines of text in 203 documents, 20 directories) I have a phrase list of 3939 phrases.

Processing time: 1h 43min

The original script is still running (Over 17h now)

for some reason, i had some trouble running both those scripts. The only things i really need to change are the input/output files and path to directory, right?

jl487

View Public Profile for jl487

Find all posts by jl487

03-15-2012

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

the delete statement may also give you some issues on AIX, as I think it might be a GNU extension or only supported in later implementions of awk.

Try:

Code:

S=$SECONDS
find /path/to/files -type f -print | awk '
NR==FNR{w[" "tolower($0)" "]++ ; next}
{ FILE=$0;
  split("",h,",");
  while(getline< FILE) {
     $0=" "tolower($0)" "
     for(l in w)
           if(!(l in h) && match($0, l)) {
              print substr(l,2)"is found in: "FILE
                  h[l]++
           }
  }
  close(FILE)
}' input.txt - > output.txt
echo "Processing time: "$((SECONDS-S))

The split statement is a bit more of a portable way to clear an array.

Just change /path/to/files and input.txt to match your particular setup.

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

03-16-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Quote:

Originally Posted by Chubler_XL

the delete statement may also give you some issues on AIX, as I think it might be a GNU extension or only supported in later implementions of awk.

The delete statement is supported in POSIX awk, but not on arrays. So delete h is an extension, but this should work:

Code:

for(i in h)delete h[i]

Quote:

Code:

w[" "tolower($0)" "]++

This would mean that only words or phrases in between spaces get matched but there are undoubtedly case with , . ! ? ; : and at the beginning or end of a line... ,no?

---------- Post updated at 13:14 ---------- Previous update was at 07:13 ----------

Given the word matching abilities of grep I thought the best approach would be to optimize your script. I changed the following parts:
- replace all those finds with a single find and store the result in a variable called filelist.
- by running grep with the -l option and changing the environment variable IFS so that it only contains a linefeed, the sort and cut and the call to /dev/null were no longer needed.
- add -F flag to grep to switch off regex matching and use literal matching, it also ensures no unintended matches occur

This resulted in this script you could try:

Code:

oldIFS=$IFS
IFS="
"
filelist=$(find /path/to/files -type f)
while read word
do
  a=$(grep -Fwil "$word" $filelist)
  if [ -n "$a" ]; then
    echo "$word is found in: "
    echo "$a"
  fi
  echo ""
done < input.txt > output.txt
IFS=$oldIFS

Preliminary testing showed a factor 15 speed improvement, ymmv..

Last edited by Scrutinizer; 03-16-2012 at 09:24 AM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

03-19-2012

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

AIX 5.3 by default only has 4K for argument expansion, which can result in Argument/Parameter list too long errors when processing quite short parameter strings.

I'm almost sure that xargs would be required in the above script, depending on the actual length of /path/to/files and the current value of the ncargs OS parameter.

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

03-19-2012

Moderator

1,484, 567

Join Date: Mar 2011

Last Activity: 28 November 2020, 9:34 AM EST

Posts: 1,484

Thanks Given: 68

Thanked 567 Times in 444 Posts

You can try this, it's limit is shell expansion and it doesn't handle well if the same search string appear two or more times in same file.
It that case it will give output like :

Code:

<search string> is present in :
/path/to/filename1 
/path/to/filename1

Code:

INPUT=$(tr "\n" "|"  < input | sed "s/|$//")
find /path/to/dir -type f -exec egrep -wi "$INPUT" {} /dev/null \; | \
awk -F":" '{ a[$2] = a[$2] "|" $1 } END  { for ( i in a ) print i " is present in :" a[i] } ' | \
tr "|" "\n"

Peasant

View Public Profile for Peasant

Find all posts by Peasant

03-19-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Hi Chubler, thanks, of course. That was pretty silly.

Then it gets a bit more complicated as would need to chop the file list up in snack size chunks and we could try this:

Code:

oldIFS=$IFS
IFS="
"
write_snack()
{
  if a=$(grep -Fwil "$word" $snack); then
    if ! $wordfound; then
      printf "%s\n" "$word is found in: "
      wordfound=true
    fi
    printf "%s\n" "$a"
  fi
  i=0
  snack=""
}

snacksize=25          # Nr of files to feed to grep at a time
i=0 snack="" 
filelist=$(find /path/to/files -type f)
while read word
do
  wordfound=false
  for f in $filelist
  do
    if [ $((i+=1)) -lt $snacksize ]; then
      snack=${snack}${IFS}${f}
    else
      write_snack
    fi
  done
  if [ $i -gt 0 ]; then
    write_snack
  fi
  printf "\n"
done < input.txt > output.txt

Last edited by Scrutinizer; 03-19-2012 at 07:06 AM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

Shell Programming and Scripting

Can someone please help me optimize my code (script searches subdirectories)?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help Optimize the Script Further

Discussion started by: narayanv

2. Shell Programming and Scripting

Optimize awk code

Discussion started by: SkySmart

3. Shell Programming and Scripting

Looking to optimize code

Discussion started by: Junaid Subhani

4. Shell Programming and Scripting

Optimize my mv script

Discussion started by: whegra

5. Shell Programming and Scripting

pl help me to Optimize the given code

Discussion started by: pk_arun

6. Shell Programming and Scripting

Optimize shell code

Discussion started by: sandy1028

7. Shell Programming and Scripting

Optimize and Speedup the script

Discussion started by: ntgobinath

8. UNIX for Dummies Questions & Answers

Can we optimize this simple script ?

Discussion started by: rajavu

9. Shell Programming and Scripting

optimize the script

Discussion started by: amitrajvarma

10. News, Links, Events and Announcements

New Tool Searches and Replaces SCO Code

Discussion started by: Neo