Processing too slow with loop


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Processing too slow with loop
# 1  
Old 06-23-2016
Processing too slow with loop

I have 2 files

file 1 : contains
Code:
Code:
ALINE
ALINE BANG
B ON A
B.B.V.A.
BANG AMER CORG
BANG ON MORENA
BANG ON MORENAIC
BANG ON MORENAICA
BANG ON MORENAICA CORP
BANG ON MORENAICA N.A



file 2 contains and is seprated by ^ delimiter :

Code:
Code:
NATIO MARKET^345432534
+ COLUMBUS DISCOVERY in MORENAGO VESPUSSI^999921342
Gadappa'S F315^6716158190
+ SPEEDWAY 0533242 2332492 SPEEDWAY 0534234 352 KETNG CHQ24324435^9392493223
VILA ALINE VILLA ARR 24311605 9900961622^93294932
CHECK # 2193^99939249
online/phone xfr in fr acc 06500518267 date: 04-22-16 time: 11:14:32^45345334
mastermon bang on morena pa cucin new york ny xxxxxxxxxxxx0177^1232131
network printed workign  jean pual dum ave long beac ny xxxxxxxxxxxx0177^1232131
master Bangalore petrol bunk metro 070-mt. v washingto dc xxxxxxxxxxxx0177^1232131

I want file1 string which has limited number of rows to be matched in file2 which has million rows and give me o/p with the count.

I tried below code but it takes lot of time and is not giving proper value in the o/p

Code:
Code:
file="/opt/sdp/.nikhil/PWD/beta.txt"
while read -r line; do
    count=`grep -wi $line /opt/sdp/.nikhil/PWD/alpha.txt|wc -l`
echo $line "|" $count >>  opfile.txt
done < "$file"

o/p i'm getting is incorrect as it is only having aline but it is incrementing the count to +1 in even ALINE BANG of my o/p as shown below which is incorrect similar case with bang on morena as well

Code:
Code:
ALINE | 1
ALINE BANG | 1
B ON A | 0
B.B.V.A. | 0
BANG AMER CORG | 1
BANG ON MORENA | 1
BANG ON MORENAIC | 1
BANG ON MORENAICA | 1
BANG ON MORENAICA CORP | 1
BANG ON MORENAICA N.A | 1

# 2  
Old 06-23-2016
The following fixes a few issues
Code:
file=beta.txt
while read -r line
do
   count=`grep -wic "$line" alpha.txt`
   echo "$line | $count"
done < $file > opfile.txt

It still does PARTIAL matching of ALL fields.
That means if "ALINE BANG" matches, "ALINE" matches also.
If you would restrict the search to a fixed field, to full field matching, to case sensitive matching, ..., all this can help to make it faster.
# 3  
Old 06-23-2016
How about
Code:
grep -oif file1 file2 | sort | uniq -c

# 4  
Old 06-23-2016
Rudi,

Thanks for that, it works fine for the smaller number of files, with huge files size varying in 5-6 GB, performance dips gradually.
Is there any alternate approach?

MadeinGermany -- Thanks :-)
# 5  
Old 06-23-2016
Does performance actually get worse? Or does it just take 100,000x longer to process a 100,000x larger file? About how many matches are you expecting?

There are memory-heavy ways to do it faster, but they're not really applicable to massive files. You could try divide-and-conquer: Run as many simultaneously as your CPU and disks can easily handle, sort their output individually, then merge them in one final step.
This User Gave Thanks to Corona688 For This Post:
# 6  
Old 06-23-2016
If the patterns are always fixed strings the usage of fgrep or grep -F may result in a HUGE Performance Boost.

If possible, run fgrep without -i. That'll get you another Performance Boost and also put LANG=C before the fgrep command, which should speed up things a little too.

Sidenote

There was a scripting task request in the german linux forum(www.linuxforen.de) here: Linuxforen.de Thread regarding fgrep

The task was similar. The big file had 5.000.000 lines (300 MB). The smaller file had 100.000 lines (3 MB). The results:

  • Winner fgrep: 7 Seconds
  • extremeley optimized lua script: 8,6 Seconds
  • awk-Script: ~97 hours (obviously the great awk-hackers here would get a whole lot more out of awk)
  • regular grep: stopped after 45 Minutes runtime and 12 GB RAM-Usage
I think the situation is not so far away from this situation here. I suppose the smaller file here is a lot smaller, so the task will not be as cpu-intensive as the other one but this task has a lot more to read(5-6 GB as said by the nikhil).

Last edited by stomp; 06-23-2016 at 08:43 PM..
These 2 Users Gave Thanks to stomp For This Post:
# 7  
Old 06-24-2016
Stomp,

Thanks a lot for that, but this thing does not ignore the case and do a strong word checking even after options "i" and "w" used.
May be something to do with "F" option, It does overwrite i suppose.

Corona,

File 2 is around 6GB File and File 1 is around 2.4K.
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Programming

awk processing / Shell Script Processing to remove columns text file

Hello, I extracted a list of files in a directory with the command ls . However this is not my computer, so the ls functionality has been revamped so that it gives the filesizes in front like this : This is the output of ls command : I stored the output in a file filelist 1.1M... (5 Replies)
Discussion started by: ajayram
5 Replies

2. Shell Programming and Scripting

For loop to read folder which are not under processing

Hi, I want to read folders which do not have a file "processing" in a for loop ordered by timestamp. Currently im doing like this. Like cd /home/working for i in `ls -c1` do some command... done I want to exclude folders which have that "processing" file. The directory... (2 Replies)
Discussion started by: chetan.c
2 Replies

3. Shell Programming and Scripting

Help with File Slow Processing

Hello, Hope you are doing fine. Let me describe the problem, I have a script that calls another script K2Test.sh, this script K2Test.sh (created by another team) takes date as argument and generates approx 1365 files in localcurves directory for given date. Out of these 1365 I am only... (11 Replies)
Discussion started by: srattani
11 Replies

4. Shell Programming and Scripting

How to make parallel processing rather than serial processing ??

Hello everybody, I have a little problem with one of my program. I made a plugin for collectd (a stats collector for my servers) but I have a problem to make it run in parallel. My program gathers stats from logs, so it needs to run in background waiting for any new lines added in the log... (0 Replies)
Discussion started by: Samb95
0 Replies

5. Shell Programming and Scripting

File processing is very slow with cut command

Dear All, I am using the following script to find and replace the date format in a file. The field18 in the file has the following format: "01/26/2010 11:55:14 GMT+04:00" which I want to convert into the following format "20100126115514" for this purpose I am using the following lines of codes:... (5 Replies)
Discussion started by: bilalghazi
5 Replies

6. Shell Programming and Scripting

Nested Loop becomes slow

Hello I have some nested loop to display files and form menu item. The part of the code is below. I found that after runnining the script for a while the display becomes very slow. Does ksh shell provided any easy way to release variables, reinit, etc. while ];do script=0 ... (2 Replies)
Discussion started by: ekb
2 Replies

7. SCO

Slow Processing - not matching hardware capabilities

I have been a SCO UNIX user, never an administrator...so I am stumbling around looking for information. I don't know too much about what is onboard in terms of hardware, however; I will try my best. We have SCO 5.07 and have applied MP5. We have a quad core processor with 4 250 GB... (1 Reply)
Discussion started by: atpbrownie
1 Replies

8. UNIX for Dummies Questions & Answers

for loop very slow

Dear unix users, Any comments on this matter or not? I've been facing this kind of problem for very long. Thinking how to make my script move faster instead of using C to write it. I still prefer using Shell script. Normally a FOR loop will make the script loop for very long time, what to do... (1 Reply)
Discussion started by: clemeot
1 Replies
Login or Register to Ask a Question