Processing too slow with loop

06-23-2016

Registered User

177, 2

Join Date: Apr 2011

Last Activity: 8 December 2019, 10:36 AM EST

Location: Bangalore

Posts: 177

Thanks Given: 42

Thanked 2 Times in 2 Posts

Processing too slow with loop

I have 2 files

file 1 : contains

Code:


Code:

ALINE
ALINE BANG
B ON A
B.B.V.A.
BANG AMER CORG
BANG ON MORENA
BANG ON MORENAIC
BANG ON MORENAICA
BANG ON MORENAICA CORP
BANG ON MORENAICA N.A

file 2 contains and is seprated by ^ delimiter :

Code:


Code:

NATIO MARKET^345432534
+ COLUMBUS DISCOVERY in MORENAGO VESPUSSI^999921342
Gadappa'S F315^6716158190
+ SPEEDWAY 0533242 2332492 SPEEDWAY 0534234 352 KETNG CHQ24324435^9392493223
VILA ALINE VILLA ARR 24311605 9900961622^93294932
CHECK # 2193^99939249
online/phone xfr in fr acc 06500518267 date: 04-22-16 time: 11:14:32^45345334
mastermon bang on morena pa cucin new york ny xxxxxxxxxxxx0177^1232131
network printed workign  jean pual dum ave long beac ny xxxxxxxxxxxx0177^1232131
master Bangalore petrol bunk metro 070-mt. v washingto dc xxxxxxxxxxxx0177^1232131

I want file1 string which has limited number of rows to be matched in file2 which has million rows and give me o/p with the count.

I tried below code but it takes lot of time and is not giving proper value in the o/p

Code:


Code:

file="/opt/sdp/.nikhil/PWD/beta.txt"
while read -r line; do
    count=`grep -wi $line /opt/sdp/.nikhil/PWD/alpha.txt|wc -l`
echo $line "|" $count >>  opfile.txt
done < "$file"

o/p i'm getting is incorrect as it is only having aline but it is incrementing the count to +1 in even ALINE BANG of my o/p as shown below which is incorrect similar case with bang on morena as well

Code:


Code:

ALINE | 1
ALINE BANG | 1
B ON A | 0
B.B.V.A. | 0
BANG AMER CORG | 1
BANG ON MORENA | 1
BANG ON MORENAIC | 1
BANG ON MORENAICA | 1
BANG ON MORENAICA CORP | 1
BANG ON MORENAICA N.A | 1

nikhil jain

View Public Profile for nikhil jain

Find all posts by nikhil jain

06-23-2016

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

The following fixes a few issues

Code:

file=beta.txt
while read -r line
do
   count=`grep -wic "$line" alpha.txt`
   echo "$line | $count"
done < $file > opfile.txt

It still does PARTIAL matching of ALL fields.
That means if "ALINE BANG" matches, "ALINE" matches also.
If you would restrict the search to a fixed field, to full field matching, to case sensitive matching, ..., all this can help to make it faster.

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

06-23-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

How about

Code:

grep -oif file1 file2 | sort | uniq -c

RudiC

View Public Profile for RudiC

Find all posts by RudiC

06-23-2016

Registered User

177, 2

Join Date: Apr 2011

Last Activity: 8 December 2019, 10:36 AM EST

Location: Bangalore

Posts: 177

Thanks Given: 42

Thanked 2 Times in 2 Posts

Rudi,

Thanks for that, it works fine for the smaller number of files, with huge files size varying in 5-6 GB, performance dips gradually.
Is there any alternate approach?

MadeinGermany -- Thanks :-)

nikhil jain

View Public Profile for nikhil jain

Find all posts by nikhil jain

06-23-2016

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Does performance actually get worse? Or does it just take 100,000x longer to process a 100,000x larger file? About how many matches are you expecting?

There are memory-heavy ways to do it faster, but they're not really applicable to massive files. You could try divide-and-conquer: Run as many simultaneously as your CPU and disks can easily handle, sort their output individually, then merge them in one final step.

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-23-2016

Registered User

446, 232

Join Date: May 2016

Last Activity: 12 May 2020, 4:52 AM EDT

Posts: 446

Thanks Given: 51

Thanked 232 Times in 163 Posts

If the patterns are always fixed strings the usage of fgrep or grep -F may result in a HUGE Performance Boost.

If possible, run fgrep without -i. That'll get you another Performance Boost and also put LANG=C before the fgrep command, which should speed up things a little too.

Sidenote

There was a scripting task request in the german linux forum(www.linuxforen.de) here: Linuxforen.de Thread regarding fgrep

The task was similar. The big file had 5.000.000 lines (300 MB). The smaller file had 100.000 lines (3 MB). The results:

Winner fgrep: 7 Seconds
extremeley optimized lua script: 8,6 Seconds
awk-Script: ~97 hours (obviously the great awk-hackers here would get a whole lot more out of awk)
regular grep: stopped after 45 Minutes runtime and 12 GB RAM-Usage

I think the situation is not so far away from this situation here. I suppose the smaller file here is a lot smaller, so the task will not be as cpu-intensive as the other one but this task has a lot more to read(5-6 GB as said by the nikhil).

Last edited by stomp; 06-23-2016 at 08:43 PM..

These 2 Users Gave Thanks to stomp For This Post:

stomp

View Public Profile for stomp

Find all posts by stomp

06-24-2016

Registered User

177, 2

Join Date: Apr 2011

Last Activity: 8 December 2019, 10:36 AM EST

Location: Bangalore

Posts: 177

Thanks Given: 42

Thanked 2 Times in 2 Posts

Stomp,

Thanks a lot for that, but this thing does not ignore the case and do a strong word checking even after options "i" and "w" used.
May be something to do with "F" option, It does overwrite i suppose.

Corona,

File 2 is around 6GB File and File 1 is around 2.4K.

nikhil jain

View Public Profile for nikhil jain

Find all posts by nikhil jain

Shell Programming and Scripting

Processing too slow with loop

8 More Discussions You Might Find Interesting

1. Programming

awk processing / Shell Script Processing to remove columns text file

Discussion started by: ajayram

2. Shell Programming and Scripting

For loop to read folder which are not under processing

Discussion started by: chetan.c

3. Shell Programming and Scripting

Help with File Slow Processing

Discussion started by: srattani

4. Shell Programming and Scripting

How to make parallel processing rather than serial processing ??

Discussion started by: Samb95

5. Shell Programming and Scripting

File processing is very slow with cut command

Discussion started by: bilalghazi

6. Shell Programming and Scripting

Nested Loop becomes slow

Discussion started by: ekb

7. SCO

Slow Processing - not matching hardware capabilities

Discussion started by: atpbrownie

8. UNIX for Dummies Questions & Answers

for loop very slow

Discussion started by: clemeot