Bash script search, improve performance with large files

03-28-2019

Registered User

446, 232

Join Date: May 2016

Last Activity: 12 May 2020, 4:52 AM EDT

Posts: 446

Thanks Given: 51

Thanked 232 Times in 163 Posts

grep -F -i ignores case.

stomp

View Public Profile for stomp

Find all posts by stomp

03-28-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Would you mind to also time the proposal in post #3?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-28-2019

Registered User

57, 3

Join Date: Jan 2016

Last Activity: 19 September 2019, 10:01 AM EDT

Posts: 57

Thanks Given: 17

Thanked 3 Times in 2 Posts

Quote:

Originally Posted by RudiC

Would you mind to also time the proposal in post #3?

I actually did but i edited in the post after.

Code:

awk  prijslijst_filter.csv lowercase_winnaar.csv  9,51s user 0,13s system 99% cpu 9,647 total

Since the difference between the grep and this newer awk is only mere seconds i am not sure which i am going to use. The awk one is prefered as it is a drop in solution for the current one but the grep one is still quite alot faster.

Grep has also the advantage that it responds better with the ignore case part. I never seem to get this properly working on the awk one even with the forced lowercase on both files.

I just tried your awk solution again RudiC and it seems something is wrong with it . I did not check the first time because i had to leave right after i tested it (the files got overwritten after).

It seems the part you gave does not give any files to continue the rest of the script.

Code:

awk '
NR==FNR                 {SRCH=SRCH DL $0
                         DL = "|"
                         next
                        }
tolower($0) ~ SRCH      {print > "'"$PAD/removed_woord_blaat33.csv"'"
                         next
                        }

                        {print > "'"$PAD/filtered_winnaar_blaat33.csv"'"
                        }
' prijslijst_filter.csv lowercase_winnaar.csv

I tried with and without time to see if that caused the issue but it did not change the outcome. Both new files arent created.

Last edited by SDohmen; 03-28-2019 at 12:29 PM.. Reason: new info

SDohmen

View Public Profile for SDohmen

Find all posts by SDohmen

03-28-2019

Moderator

1,484, 567

Join Date: Mar 2011

Last Activity: 28 November 2020, 9:34 AM EST

Posts: 1,484

Thanks Given: 68

Thanked 567 Times in 444 Posts

When processing extremely large files you might consider using split first.
Then in multicore environments spawn several awks or greps to process it in parallel from shell script.
There are also GNU tools which offer parallelism without shell logic.

Should be a bit tougher to program, but processing time will be reduced significantly if you have cores and disks are fast to service.

Memory also comes in play, since split will read the files, and operating system will cache those files in memory, if the same is available.
Making those awks or greps processes much faster on read operations.

Of course, limit being free memory on the system and configuration of the file system caching in general.
In default configurations file system caching will be able to use a large portion free memory on most linux / unix systems i've seen.

Hope that helps
Regards
Peasant.

Peasant

View Public Profile for Peasant

Find all posts by Peasant

03-29-2019

Registered User

57, 3

Join Date: Jan 2016

Last Activity: 19 September 2019, 10:01 AM EDT

Posts: 57

Thanks Given: 17

Thanked 3 Times in 2 Posts

Quote:

Originally Posted by Peasant

When processing extremely large files you might consider using split first.
Then in multicore environments spawn several awks or greps to process it in parallel from shell script.
There are also GNU tools which offer parallelism without shell logic.

Should be a bit tougher to program, but processing time will be reduced significantly if you have cores and disks are fast to service.

Memory also comes in play, since split will read the files, and operating system will cache those files in memory, if the same is available.
Making those awks or greps processes much faster on read operations.

Of course, limit being free memory on the system and configuration of the file system caching in general.
In default configurations file system caching will be able to use a large portion free memory on most linux / unix systems i've seen.

Hope that helps
Regards
Peasant.

This sounds very interesting but there are 2 issues.

1. I have to split the files in smaller files (around 5k i guess) which isn't a big deal but a little bit annoying.
2. Since this is running in a script i have no idea how to call multiple instances of awk at the same time. Everything i know says that it handles each part of the script after each other and not at the same time. If you have an idea how to accomplish that please let me know since it does sound interesting/promising.

CPU and MEM arent the issue as they are sufficient. The only thing that can stall the script are the other scripts that are running also. I tried spreading them out as much as possible but some just take quite long to run and thats why i want to slim them down so they dont run together.

SDohmen

View Public Profile for SDohmen

Find all posts by SDohmen

04-02-2019

Registered User

57, 3

Join Date: Jan 2016

Last Activity: 19 September 2019, 10:01 AM EDT

Posts: 57

Thanks Given: 17

Thanked 3 Times in 2 Posts

Quote:

Originally Posted by RudiC

You might want to build an "alternation regex", with not too many keywords, and modify the matching slightly. Compare performance of

Code:

awk '
NR==FNR                 {SRCH=SRCH DL $0
                         DL = "|"
                         next
                        }
tolower($0) ~ SRCH      {print > "'"$PAD/removed_woord.csv"'"
                         next
                        }

                        {print > "'"$PAD/filtered_winnaar_2.csv"'"
                        }
' file3 file4 

real    0m2,328s
user    0m2,318s
sys    0m0,005s

to this

Code:

time awk '
NR==FNR         {id[$0]
                 next
                }
                {for (SP in id) if (tolower($0) ~ SP)    {print > "'"$PAD/removed_woord.csv"'"
                                                 next
                                                }
                }
                {print > "'"$PAD/filtered_winnaar_2.csv"'"
                }
' file3 file4
real    0m17,038s
user    0m16,995s
sys    0m0,025s

seems to make a factor of roughly 7. The output seems to be identical. Please try and report back.

I just did this one again and i got it working. I noticed the -F";" was missing so i added that and it worked flawlessly. The complete script runs in about 20 sec now which was more then 7 min first.

SDohmen

View Public Profile for SDohmen

Find all posts by SDohmen

04-02-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Congrats, that would be a factor ~21 of performance gain!

I'd be surprised if the script needs the -F";" as it doesn't handle single fields but just the entire line, $0

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Bash script search, improve performance with large files

9 More Discussions You Might Find Interesting

1. Programming

Improve the performance of my C++ code

Discussion started by: yifangt

2. Shell Programming and Scripting

Copying large files in a bash script stops execution

Discussion started by: wex_storm

3. UNIX for Dummies Questions & Answers

How to improve the performance of this script?

Discussion started by: vikatakavi

4. Shell Programming and Scripting

Performance issue in Grepping large files

Discussion started by: millan

5. Programming

Help with improve the performance of grep

Discussion started by: cpp_beginner

6. Shell Programming and Scripting

Want to improve the performance of script

Discussion started by: poweroflinux

7. Shell Programming and Scripting

Improve the performance of a shell script

Discussion started by: apsprabhu

8. Shell Programming and Scripting

Any way to improve performance of this script

Discussion started by: sirababu

9. UNIX for Dummies Questions & Answers

Improve Performance

Discussion started by: mazhar99