Bash script search, improve performance with large files


Login or Register to Reply

 
Thread Tools Search this Thread
# 1  
Bash script search, improve performance with large files

Hello,


For several of our scripts we are using awk to search patterns in files with data from other files. This works almost perfectly except that it takes ages to run on larger files. I am wondering if there is a way to speed up this process or have something else that is quicker with the searching.


The part that i use is as follows:


Code:
awk -F";" '
NR==FNR         {id[$0]
                 next
                }
                {for (SP in id) if (tolower($0) ~ SP)    {print > "'"$PAD/removed_woord.csv"'"
                                                 next
                                                }
                }
                {print > "'"$PAD/filtered_winnaar_2.csv"'"
                }
' $PAD/prijslijst_filter.csv $PAD/lowercase_winnaar.csv



I got this piece of programming also from this forum but i added the tolower part myself since not always it seem to get all results from the main file. 1 important part is that the results from filtering need to be saved in another file. The filtered file only contains the not found lines of course.

Last edited by joeyg; 04-04-2019 at 08:58 AM..
# 2  
Your above awk script is so minimalistic that it's hard to dream up a dramatic improvement.
Did you try
Code:
grep -f  $PAD/prijslijst_filter.csv $PAD/lowercase_winnaar.csv

for a performance comparison?
# 3  
You might want to build an "alternation regex", with not too many keywords, and modify the matching slightly. Compare performance of

Code:
awk '
NR==FNR                 {SRCH=SRCH DL $0
                         DL = "|"
                         next
                        }
tolower($0) ~ SRCH      {print > "'"$PAD/removed_woord.csv"'"
                         next
                        }

                        {print > "'"$PAD/filtered_winnaar_2.csv"'"
                        }
' file3 file4 

real    0m2,328s
user    0m2,318s
sys    0m0,005s

to this


Code:
time awk '
NR==FNR         {id[$0]
                 next
                }
                {for (SP in id) if (tolower($0) ~ SP)    {print > "'"$PAD/removed_woord.csv"'"
                                                 next
                                                }
                }
                {print > "'"$PAD/filtered_winnaar_2.csv"'"
                }
' file3 file4
real    0m17,038s
user    0m16,995s
sys    0m0,025s

seems to make a factor of roughly 7. The output seems to be identical. Please try and report back.

Last edited by RudiC; 03-28-2019 at 07:49 AM..
# 4  
I know it is a really short script and if i am not mistaken you even wrote it Smilie.



I just timed both of them and the results are as follows. It looks like there is alot of improvement using the grep line except that i dont know if it removes the filtered lines from the file like the awk solution.



Code:
awk -F";"  prijslijst_filter.csv lowercase_winnaar.csv  260,73s user 0,50s system 99% cpu 4:21,84 total

Code:
grep --color=auto -f prijslijst_filter.csv lowercase_winnaar.csv  45,13s user 0,52s system 99% cpu 45,679 total


Moderator's Comments:
Mod Comment Please use CODE tags as required by forum rules!

(sorry for the code part)



I just tested your last bit and it makes a huge difference.


Code:
awk  prijslijst_filter.csv lowercase_winnaar.csv  9,51s user 0,13s system 99% cpu 9,647 total


I will have to check the files but this would help enormous with all scripts.

Last edited by SDohmen; 03-28-2019 at 07:53 AM.. Reason: extra data
# 5  
Just one additional notice:

grep -F (grep for fixed strings, i. e. no patterns) is a lot faster than regular grep.
so you may try:

Code:
grep -F -f prijslijst_filter.csv lowercase_winnaar.csv

This User Gave Thanks to stomp For This Post:
# 6  
You'd need to run two greps, one for positive matches, one (with the -v option) for non-matches.
# 7  
Quote:
Originally Posted by stomp
Just one additional notice:

grep -F (grep for fixed strings, i. e. no patterns) is a lot faster than regular grep.
so you may try:

Code:
grep -F -f prijslijst_filter.csv lowercase_winnaar.csv


I just tested this and it is even faster


Code:
grep --color=auto -F -f prijslijst_filter.csv lowercase_winnaar.csv  0,19s user 0,11s system 50% cpu 0,594 total

Quote:
Originally Posted by RudiC
You'd need to run two greps, one for positive matches, one (with the -v option) for non-matches.

I believe you mean like this:


Code:
grep -v -F -f prijslijst_filter.csv lowercase_winnaar.csv > unfiltered_stuff.csv

and
Code:
grep -F -f prijslijst_filter.csv lowercase_winnaar.csv > filtered_stuff.csv




1 last question about this though. How well does this behave with capitals and such? The first awk script did not like capitals so i had to lowercase everything. It would be best if it just ignores casing completly.

Last edited by SDohmen; 03-28-2019 at 08:03 AM.. Reason: Small error in the grep code
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
How to improve the performance of this script?
vikatakavi
Hi , i wrote a script to convert dates to the formate i want .it works fine but the conversion is tkaing lot of time . Can some one help me tweek this script #!/bin/bash file=$1 ofile=$2 cp $file $ofile mydates=$(grep -Po '+/+/+' $ofile) # gets 8/1/13 mydates=$(echo "$mydates" | sort |...... UNIX for Dummies Questions & Answers
5
UNIX for Dummies Questions & Answers
Want to improve the performance of script
poweroflinux
Hi All, I have written a script as follows which is taking lot of time in executing/searching only 3500 records taken as input from one file in log file of 12 GB Approximately. Working of script is read the csv file as an input having 2 arguments which are transaction_id,mobile_number and search...... Shell Programming and Scripting
6
Shell Programming and Scripting
Improve the performance of a shell script
apsprabhu
Hi Friends, I wrote the below shell script to generate a report on alert messages recieved on a day. But i for processing around 4500 lines (alerts) the script is taking aorund 30 minutes to process. Please help me to make it faster and improve the performace of the script. i would be very...... Shell Programming and Scripting
10
Shell Programming and Scripting
Any way to improve performance of this script
sirababu
I have a data file of 2 gig I need to do all these, but its taking hours, any where i can improve performance, thanks a lot #!/usr/bin/ksh echo TIMESTAMP="$(date +'_%y-%m-%d.%H-%M-%S')" function showHelp { cat << EOF >&2 syntax extreme.sh FILENAME Specify filename to parse EOF...... Shell Programming and Scripting
3
Shell Programming and Scripting