Grep, count and match two files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Grep, count and match two files
# 15  
Old 05-31-2006
that was my mistake,
i assumed and worked out with the first format given for Detail file

try the below one,
this would work fine !!!

Code:
#! /usr/bin/ksh

index=1
tmp=1

for val in `awk '{print $2}' d.txt`
do
  if [ $tmp -eq 1 ]
  then
     fruit[$index]=$val
     cnt[$index]=0
     tmp=$(($tmp + 1))
  fi
  temp=1
  while [ $temp -le $index ]
  do
     if [ ${fruit[$temp]} = $val ]
     then
        cnt[$temp]=$((${cnt[$temp]} + 1))
        break
     fi
     temp=$(($temp + 1))
  done
if [ $temp -gt $index ]
  then
     index=$(($index + 1))
     fruit[$index]=$val
     cnt[$index]=$((${cnt[$index]} + 1))
  fi
done
awk '{print $2, $3}' s.txt | while read first second
do
temp=1
while [ $temp -le $index ]
do
  if [ ${fruit[$temp]} = $first ]
  then
     if [ cnt[$temp] -eq $second ]
     then
        print ${fruit[$temp]} ${cnt[$temp]}
        break
     else
        exit 1
     fi
  fi
  temp=$(($temp + 1))
done
if [ $temp -gt $index ]
then
exit 1
fi
done

exit 0


Last edited by matrixmadhan; 05-31-2006 at 03:01 AM..
# 16  
Old 05-31-2006
Quote:
Originally Posted by tmarikle
Complete analysis reporting based on both available summary and available detail information.

Presumably the test is being performed as a control mechanism for some other automated task, and certainly a break must occur once a problem is identified; however, based on my experience, at some point, more specific analysis will need to be performed regardless of how the failure affects the driving process. Since a complete analysis can simply be obtained at run-time(as is true in this case), failing the process and producing a complete analysis report simultaneously over simply failing the process may be worth the extra "effort."
Accept that complete analysis prior to termination is needed and that would be preferred. Well your solution was more generic !!! Smilie

Quote:
Originally Posted by tmarikle
Speed over thre shell script solution.

Replicating the same D.txt lines so that 150,000 records exist in the file produced a 22 second execution time before the shell script determined that it needed to exit (more than 150,000 records caused the shell script to fail on my AIX system). The awk solution completed a total analysis both from summary and detail perspectives in less than 1 second. Now, given this, which solution processes unnecessarily? Also, if an exit code is all that is necessary, the awk solution can exit as soon as a mismatch occurs (albeit, after collecting data from the entire detail file).
it might have failed for the reason that the parsing of Detail file was not correct, i have modified it now.
I tried with 150009 records and it completed in less than 21 seconds, may be you could try once and see.

Code:
time ./fine.ksh
MIDWEST 150004
AIRTRAN 3
CAESAR 2

real    0m20.84s
user    0m20.91s
sys     0m0.09s

prior to that we cannot arrive at which solution is speeding up for the requirement !!!
Quote:
In this case, missing the letter of the requirement is probably forgivable.
I always take requirement as the higher priority!!! then work out for the optimization and speed when the former is done Smilie Smilie Smilie
# 17  
Old 05-31-2006
For what it's worth, and in defense of thinking through the performance implications of a scripting solution before coding one:

Fist, sorry but I couldn't get your modified script to work on my AIX system using 1 million records without changing the first loop as follows:
Code:
while read junk val junk junk junk junk junk junk junk junk junk junk
do
    ...
done < D.txt

This at least eliminates one additional process call to awk but still takes nearly two minutes to execute:
Code:
time ksh s2.sh

real    1m57.53s
user    1m57.26s
sys     0m0.10s

There are other methods for improving upon this that can significantly increase the shell script's performance but it will allways fall short of the awk approach due to awk's optimization for text processing.

The awk solution ran in 10 seconds for the same 1 million records:
Code:
 time { ksh s.sh ; print $? ;  }
Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                       3     314565 <==== Error ===
ORBITZ                        0          0                
FRONTIER                      0          0                
BACON                         0     104854 <==== Error ===
CAESAR                        2     209710 <==== Error ===
THOMAS                       44          0 <==== Error ===
MIDWEST                       4     419419 <==== Error ===
1

real    0m10.30s
user    0m10.07s
sys     0m0.12s

Running the test again on 150000 records:
Shell script solution:
Code:
 time { ksh s2.sh ; print $? ; }
1

real    0m16.72s
user    0m16.67s
sys     0m0.02s

Awk solution:
Code:
 time { ksh s.sh ; print $? ; } 
Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                       3      45000 <==== Error ===
ORBITZ                        0          0                
FRONTIER                      0          0                
BACON                         0      15000 <==== Error ===
CAESAR                        2      30000 <==== Error ===
THOMAS                       44          0 <==== Error ===
MIDWEST                       4      60000 <==== Error ===
1

real    0m1.40s
user    0m1.37s
sys     0m0.02s

The shell script operates between 8500 to 9000 records per second given where the first count mismatch is found. The awk solution is operating between 97000 to 107000 records per second.

To be fair, the shell script really isn't minimizing much from its own potential work load since the detail file must be processed fully since a mismatch can occur anywhere in the file. The summary file only contains 4 or 5 records so a few seconds is saved at best. A potential benefit is gained when the summary file increases but, again, it is fairly minimal.

Therefore, the letter of the requirement can still be achieved in the awk solution by adding a test in the END procedure but your performance gain can be measured in milliseconds. I'll leave it out and seek forgiveness instead Smilie .
# 18  
Old 05-31-2006
Quote:
Fist, sorry but I couldn't get your modified script to work on my AIX system using 1 million records without changing the first loop as follows:
Code:

while read junk val junk junk junk junk junk junk junk junk junk junk do ... done < D.txt
i really dont understand why its not running in AIX
i tried it and tested on both solaris and HP-UX and works fine ...

currently i dont have access to an AIX box... so i dont have the reason why its not working !!!!
# 19  
Old 05-31-2006
Quote:
Originally Posted by madhunk
The Detail file is almost 15 million records

....
I tried Madan's script with the actual file and it produced an output like this
parse.ksh[22]: no space
Quote:
Originally Posted by matrixmadhan
I tried with 150009 records and it completed in less than 21 seconds
.....


i really dont understand why its not running in AIX
i tried it and tested on both solaris and HP-UX and works fine ...
Code:
#! /usr/bin/ksh

index=1
tmp=1

for val in `awk '{print $2}' d.txt`
do

A line like:
for val in `awk '{print $2}' d.txt`
when run with 15,000,000 lines in d.txt is asking ksh to construct a command line over 15,000,000 words. Remember that command extends until the "done". Once ksh has the entire compound command in memory, it will compile it and execute the compiled code. The compiled code will be almost as long as the original compound command in this case. ksh is willing to attempt this feat, whether or not it succeeds depends on how the kernel was tuned, how much virtual memory is available, how much tmp space is available, etc.

Consider switching to:
awk '{print $2)' d.txt | while read val

This is almost the same thing, but you don't need 2 copies of 15,000,000 items in core simultaneous.

The
while read junk val junk junk junk junk junk junk junk junk junk junk
solution is better still. You no longer have an awk process pumping 15,000,000 words though a pipe. But
while read junk val junk
would do just as nicely. The end of the line accumulates in the last variable of a read regardless of IFS setting.
# 20  
Old 05-31-2006
Very useful to know regarding ksh and I suceeded in running a test on my Sun box with 1.1 million records. I have seen this effect while tracing but I didn't extend this to 15 million in my mind until you pointed it out.

Again, to beat my previous point to death further still and for what it's worth:

Original shell (which runs faster than "while read junk val junk" interestingly enough):
Code:
time { ksh s2.sh ; print $? ; }
1

real    4m45.44s
user    4m43.56s
sys     0m1.32s

Records per second: 3,935
Estimated execution time on 15 million records: slightly longer than 1 hour

Awk solution:
Code:
time { ksh s.sh ; print $? ; } 
Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                   45000     336963 <==== Error ===
BACON                     15000     112321 <==== Error ===
CAESAR                    30000     224642 <==== Error ===
THOMAS                        1          0 <==== Error ===
MIDWEST                   60000     449304 <==== Error ===
1

real    0m9.25s
user    0m7.78s
sys     0m0.49s

Records per second: 121,430
Estimated execution time on 15 million records: slightly longer than 2 minutes

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Count the number of files to delete doesnt match

Good evening, need your help please Need to delete certain files before octobre 1 2016, so need to know how many files im going to delete, for instance ls -lrt file_20160*.lis!wc -l but using grep -c to another file called bplist which contains the list of all files backed up doesn match... (7 Replies)
Discussion started by: alexcol
7 Replies

2. UNIX for Dummies Questions & Answers

Grep Files with and without match

Hi There, How do i finding files with match and without match Normally, I will use grep -l 'Hello' grep -L 'Hello World' How do we combined (2 Replies)
Discussion started by: alvinoo
2 Replies

3. Shell Programming and Scripting

Error files count while coping files from source to destination locaton as well count success full

hi All, Any one answer my requirement. I have source location src_dir="/home/oracle/arun/IRMS-CM" My Target location dest_dir="/home/oracle/arun/LiveLink/IRMS-CM/$dc/$pc/$ct" my source text files check with below example.text file content $fn "\t" $dc "\t" $pc "\t" ... (3 Replies)
Discussion started by: sravanreddy
3 Replies

4. UNIX for Dummies Questions & Answers

[Solved] Grep multiple files and display first match

I have a need to grep a large number of files, but only display the first result from each file. I have tried to use grep, but am not limited to it. I can use perl and awk as well. Please help! (9 Replies)
Discussion started by: dbiggied
9 Replies

5. Shell Programming and Scripting

Pattern match using grep between two files

Hello Everyone , I have two files. I want to pick line from file-1 and match with the complete data in file-2 , if there is a match print all the match lines in file 3. Below is the file cat test1.txt vikas vikasjain j ain testt douknow hello@vik@ # 33 ||@@ vcpzxcmvhvdsh... (1 Reply)
Discussion started by: mailvkjain
1 Replies

6. Shell Programming and Scripting

grep - match files containing minimum number of pattern matches

I want to search a bunch of files and list only those containing a minimum number of pattern matches. So if I want to identify files containing 3 (or more) instances of the pattern "said:" and I have file1 that contains the lines: He said: She said: and file2 that contains the lines: He... (3 Replies)
Discussion started by: stumpyuk
3 Replies

7. UNIX for Dummies Questions & Answers

Grep bunch of gzip files to count based on category

Started using unix commands recently. I have 50 gzip files. I want to grep each of these files for a line count based particular category in column 3. How can I do that? For example Sr.No Date City Description Code Address 1 06/09 NY living here 0909 10st st nyc 2 ... (5 Replies)
Discussion started by: jinxx
5 Replies

8. Shell Programming and Scripting

Grep string from logs of last 1 hour on files of 2 different servers and calculate count

Hi, I am trying to grep a particular string from the files of 2 different servers without copying and calculate the total count of its occurence on both files. File structure is same on both servers and for reference as follows: 27-Aug-2010... (4 Replies)
Discussion started by: poweroflinux
4 Replies

9. UNIX for Dummies Questions & Answers

Comparing two files and count number of lines that match

Hello all, I always found help for my problems using the search option, but this time my request is too specific. I have two files that I want to compare. File1 is the index and File2 contains the data: File1: chr1 protein_coding exon 500 600 . + . gene_id "20532";... (0 Replies)
Discussion started by: DerSeb
0 Replies

10. UNIX for Advanced & Expert Users

grep count across multiple files

I have a number of simulation log files and I want to get a total count of the "PASSED" expression in them. If I use grep -c <files>, grep would give a tally for each file. I just want one number, the total count. How do I do that? (4 Replies)
Discussion started by: CrunchMunch
4 Replies
Login or Register to Ask a Question