Can someone please help me optimize my code (script searches subdirectories)?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Can someone please help me optimize my code (script searches subdirectories)?
# 15  
Old 03-15-2012
I timed my awk script (from post #5) against a midsized Gutenberg collection (5,354,620 lines of text in 203 documents, 20 directories) I have a phrase list of 3939 phrases.

Processing time: 1h 43min

The original script is still running (Over 17h now)
# 16  
Old 03-15-2012
Quote:
Originally Posted by Chubler_XL
I timed my awk script (from post #5) against a midsized Gutenberg collection (5,354,620 lines of text in 203 documents, 20 directories) I have a phrase list of 3939 phrases.

Processing time: 1h 43min

The original script is still running (Over 17h now)
for some reason, i had some trouble running both those scripts. The only things i really need to change are the input/output files and path to directory, right?
# 17  
Old 03-15-2012
the delete statement may also give you some issues on AIX, as I think it might be a GNU extension or only supported in later implementions of awk.

Try:

Code:
S=$SECONDS
find /path/to/files -type f -print | awk '
NR==FNR{w[" "tolower($0)" "]++ ; next}
{ FILE=$0;
  split("",h,",");
  while(getline< FILE) {
     $0=" "tolower($0)" "
     for(l in w)
           if(!(l in h) && match($0, l)) {
              print substr(l,2)"is found in: "FILE
                  h[l]++
           }
  }
  close(FILE)
}' input.txt - > output.txt
echo "Processing time: "$((SECONDS-S))

The split statement is a bit more of a portable way to clear an array.

Just change /path/to/files and input.txt to match your particular setup.
# 18  
Old 03-16-2012
Quote:
Originally Posted by Chubler_XL
the delete statement may also give you some issues on AIX, as I think it might be a GNU extension or only supported in later implementions of awk.
The delete statement is supported in POSIX awk, but not on arrays. So delete h is an extension, but this should work:
Code:
for(i in h)delete h[i]

Quote:
Code:
w[" "tolower($0)" "]++

This would mean that only words or phrases in between spaces get matched but there are undoubtedly case with , . ! ? ; : and at the beginning or end of a line... ,no?

---------- Post updated at 13:14 ---------- Previous update was at 07:13 ----------

Given the word matching abilities of grep I thought the best approach would be to optimize your script. I changed the following parts:
- replace all those finds with a single find and store the result in a variable called filelist.
- by running grep with the -l option and changing the environment variable IFS so that it only contains a linefeed, the sort and cut and the call to /dev/null were no longer needed.
- add -F flag to grep to switch off regex matching and use literal matching, it also ensures no unintended matches occur

This resulted in this script you could try:

Code:
oldIFS=$IFS
IFS="
"
filelist=$(find /path/to/files -type f)
while read word
do
  a=$(grep -Fwil "$word" $filelist)
  if [ -n "$a" ]; then
    echo "$word is found in: "
    echo "$a"
  fi
  echo ""
done < input.txt > output.txt
IFS=$oldIFS

Preliminary testing showed a factor 15 speed improvement, ymmv..

Last edited by Scrutinizer; 03-16-2012 at 09:24 AM..
# 19  
Old 03-19-2012
AIX 5.3 by default only has 4K for argument expansion, which can result in Argument/Parameter list too long errors when processing quite short parameter strings.

I'm almost sure that xargs would be required in the above script, depending on the actual length of /path/to/files and the current value of the ncargs OS parameter.
This User Gave Thanks to Chubler_XL For This Post:
# 20  
Old 03-19-2012
You can try this, it's limit is shell expansion and it doesn't handle well if the same search string appear two or more times in same file.
It that case it will give output like :

Code:
<search string> is present in :
/path/to/filename1 
/path/to/filename1

Code:
INPUT=$(tr "\n" "|"  < input | sed "s/|$//")
find /path/to/dir -type f -exec egrep -wi "$INPUT" {} /dev/null \; | \
awk -F":" '{ a[$2] = a[$2] "|" $1 } END  { for ( i in a ) print i " is present in :" a[i] } ' | \
tr "|" "\n"

# 21  
Old 03-19-2012
Hi Chubler, thanks, of course. That was pretty silly.

Then it gets a bit more complicated as would need to chop the file list up in snack size chunks and we could try this:

Code:
oldIFS=$IFS
IFS="
"
write_snack()
{
  if a=$(grep -Fwil "$word" $snack); then
    if ! $wordfound; then
      printf "%s\n" "$word is found in: "
      wordfound=true
    fi
    printf "%s\n" "$a"
  fi
  i=0
  snack=""
}

snacksize=25          # Nr of files to feed to grep at a time
i=0 snack="" 
filelist=$(find /path/to/files -type f)
while read word
do
  wordfound=false
  for f in $filelist
  do
    if [ $((i+=1)) -lt $snacksize ]; then
      snack=${snack}${IFS}${f}
    else
      write_snack
    fi
  done
  if [ $i -gt 0 ]; then
    write_snack
  fi
  printf "\n"
done < input.txt > output.txt


Last edited by Scrutinizer; 03-19-2012 at 07:06 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help Optimize the Script Further

Hi All, I have written a new script to check for DB space and size of dump log file before it can be imported into a Oracle DB. I'm relatively new to shell scripting. Please help me optimize this script further. (0 Replies)
Discussion started by: narayanv
0 Replies

2. Shell Programming and Scripting

Optimize awk code

sample data.file: 0,mfrh_green_screen,1454687485,383934,/PROD/G/cicsmrch/sys/unikixmain.log,37M,mfrh_green_screen,28961345,0,382962--383934 0,mfrh_green_screen,1454687785,386190,/PROD/G/cicsmrch/sys/unikixmain.log,37M,mfrh_green_screen,29139568,0,383934--386190... (7 Replies)
Discussion started by: SkySmart
7 Replies

3. Shell Programming and Scripting

Looking to optimize code

Hi guys, I feel a bit comfortable now doing bash scripting but I am worried that the way I do it is not optimized and I can do much better as to how I code. e.g. I have a whole line in a file from which I want to extract some values. Right now what I am doing is : STATE=`cat... (5 Replies)
Discussion started by: Junaid Subhani
5 Replies

4. Shell Programming and Scripting

Optimize my mv script

Hello, I'm wondering if there is a quicker way of doing this. Here is my mv script. d=/conversion/program/out cd $d ls $d > /home/tempuser/$$tmp while read line ; do a=`echo $line|cut -c1-5|sed "s/_//g"` b=`echo $line|cut -c16-21` if ;then mkdir... (13 Replies)
Discussion started by: whegra
13 Replies

5. Shell Programming and Scripting

pl help me to Optimize the given code

Pl help to me to write the below code in a simple way ... i suupose to use this code 3 to 4 places in my makefile(gnu) .. **************************************** @for i in $(LIST_A); do \ for j in $(LIST_B); do\ if ;then\ echo "Need to sign"\ echo "List A = $$i , List B =$$j"\ ... (2 Replies)
Discussion started by: pk_arun
2 Replies

6. Shell Programming and Scripting

Optimize shell code

#!/usr/bin/perl use strict; use warnings; use Date::Manip; my $date_converted = UnixDate(ParseDate("3 days ago"),"%e/%h/%Y"); open FILE,">$ARGV"; while(<DATA>){ my @tab_delimited_array = split(/\t/,$_); $tab_delimited_array =~ s/^\ =~ s/^\-//; my $converted_date =... (2 Replies)
Discussion started by: sandy1028
2 Replies

7. Shell Programming and Scripting

Optimize and Speedup the script

Hi All, There is a script (test.sh) which is taking more CPU usage. I am attaching the script in this thread. Could anybody please help me out to optimize the script in a better way. Thanks, Gobinath (6 Replies)
Discussion started by: ntgobinath
6 Replies

8. UNIX for Dummies Questions & Answers

Can we optimize this simple script ?

Hi All , I am just a new bie in Unix/Linux . With help of tips from 'here and there' , I just created a simple script to 1. declare one array and some global variables 2. read the schema names from user (user input) and want2proceed flag 3. if user want to proceed , keep reading user... (8 Replies)
Discussion started by: rajavu
8 Replies

9. Shell Programming and Scripting

optimize the script

Hi, I have this following script below. Its searching a log file for 2 string and if found then write the strings to success.txt and If not found write strings to failed.txt . if one found and not other...then write found to success.txt and not found to failed.txt. I want to optimize this... (3 Replies)
Discussion started by: amitrajvarma
3 Replies

10. News, Links, Events and Announcements

New Tool Searches and Replaces SCO Code

See this article: http://story.news.yahoo.com/news?tmpl=story&cid=74&ncid=738&e=9&u=/cmp/20030809/tc_cmp/13000487 (3 Replies)
Discussion started by: Neo
3 Replies
Login or Register to Ask a Question