Grep, count and match two files

05-31-2006

Registered User

3,216, 33

Join Date: Mar 2005

Last Activity: 4 September 2020, 7:11 AM EDT

Location: classification algos

Posts: 3,216

Thanks Given: 19

Thanked 33 Times in 30 Posts

that was my mistake,
i assumed and worked out with the first format given for Detail file

try the below one,
this would work fine !!!

Code:

#! /usr/bin/ksh

index=1
tmp=1

for val in `awk '{print $2}' d.txt`
do
  if [ $tmp -eq 1 ]
  then
     fruit[$index]=$val
     cnt[$index]=0
     tmp=$(($tmp + 1))
  fi
  temp=1
  while [ $temp -le $index ]
  do
     if [ ${fruit[$temp]} = $val ]
     then
        cnt[$temp]=$((${cnt[$temp]} + 1))
        break
     fi
     temp=$(($temp + 1))
  done
if [ $temp -gt $index ]
  then
     index=$(($index + 1))
     fruit[$index]=$val
     cnt[$index]=$((${cnt[$index]} + 1))
  fi
done
awk '{print $2, $3}' s.txt | while read first second
do
temp=1
while [ $temp -le $index ]
do
  if [ ${fruit[$temp]} = $first ]
  then
     if [ cnt[$temp] -eq $second ]
     then
        print ${fruit[$temp]} ${cnt[$temp]}
        break
     else
        exit 1
     fi
  fi
  temp=$(($temp + 1))
done
if [ $temp -gt $index ]
then
exit 1
fi
done

exit 0

Last edited by matrixmadhan; 05-31-2006 at 03:01 AM..

matrixmadhan

View Public Profile for matrixmadhan

Find all posts by matrixmadhan

05-31-2006

Registered User

3,216, 33

Join Date: Mar 2005

Last Activity: 4 September 2020, 7:11 AM EDT

Location: classification algos

Posts: 3,216

Thanks Given: 19

Thanked 33 Times in 30 Posts

Quote:

Originally Posted by tmarikle

Complete analysis reporting based on both available summary and available detail information.

Presumably the test is being performed as a control mechanism for some other automated task, and certainly a break must occur once a problem is identified; however, based on my experience, at some point, more specific analysis will need to be performed regardless of how the failure affects the driving process. Since a complete analysis can simply be obtained at run-time(as is true in this case), failing the process and producing a complete analysis report simultaneously over simply failing the process may be worth the extra "effort."

Accept that complete analysis prior to termination is needed and that would be preferred. Well your solution was more generic !!!

Quote:

Originally Posted by tmarikle

Speed over thre shell script solution.

Replicating the same D.txt lines so that 150,000 records exist in the file produced a 22 second execution time before the shell script determined that it needed to exit (more than 150,000 records caused the shell script to fail on my AIX system). The awk solution completed a total analysis both from summary and detail perspectives in less than 1 second. Now, given this, which solution processes unnecessarily? Also, if an exit code is all that is necessary, the awk solution can exit as soon as a mismatch occurs (albeit, after collecting data from the entire detail file).

it might have failed for the reason that the parsing of Detail file was not correct, i have modified it now.
I tried with 150009 records and it completed in less than 21 seconds, may be you could try once and see.

Code:

time ./fine.ksh
MIDWEST 150004
AIRTRAN 3
CAESAR 2

real    0m20.84s
user    0m20.91s
sys     0m0.09s

prior to that we cannot arrive at which solution is speeding up for the requirement !!!

Quote:

In this case, missing the letter of the requirement is probably forgivable.

I always take requirement as the higher priority!!! then work out for the optimization and speed when the former is done

matrixmadhan

View Public Profile for matrixmadhan

Find all posts by matrixmadhan

05-31-2006

Registered User

683, 5

Join Date: Jan 2005

Last Activity: 27 September 2011, 12:36 PM EDT

Posts: 683

Thanks Given: 0

Thanked 5 Times in 5 Posts

For what it's worth, and in defense of thinking through the performance implications of a scripting solution before coding one:

Fist, sorry but I couldn't get your modified script to work on my AIX system using 1 million records without changing the first loop as follows:

Code:

while read junk val junk junk junk junk junk junk junk junk junk junk
do
    ...
done < D.txt

This at least eliminates one additional process call to awk but still takes nearly two minutes to execute:

Code:

time ksh s2.sh

real    1m57.53s
user    1m57.26s
sys     0m0.10s

There are other methods for improving upon this that can significantly increase the shell script's performance but it will allways fall short of the awk approach due to awk's optimization for text processing.

The awk solution ran in 10 seconds for the same 1 million records:

Code:

 time { ksh s.sh ; print $? ;  }
Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                       3     314565 <==== Error ===
ORBITZ                        0          0                
FRONTIER                      0          0                
BACON                         0     104854 <==== Error ===
CAESAR                        2     209710 <==== Error ===
THOMAS                       44          0 <==== Error ===
MIDWEST                       4     419419 <==== Error ===
1

real    0m10.30s
user    0m10.07s
sys     0m0.12s

Running the test again on 150000 records:
Shell script solution:

Code:

 time { ksh s2.sh ; print $? ; }
1

real    0m16.72s
user    0m16.67s
sys     0m0.02s

Awk solution:

Code:

 time { ksh s.sh ; print $? ; } 
Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                       3      45000 <==== Error ===
ORBITZ                        0          0                
FRONTIER                      0          0                
BACON                         0      15000 <==== Error ===
CAESAR                        2      30000 <==== Error ===
THOMAS                       44          0 <==== Error ===
MIDWEST                       4      60000 <==== Error ===
1

real    0m1.40s
user    0m1.37s
sys     0m0.02s

The shell script operates between 8500 to 9000 records per second given where the first count mismatch is found. The awk solution is operating between 97000 to 107000 records per second.

To be fair, the shell script really isn't minimizing much from its own potential work load since the detail file must be processed fully since a mismatch can occur anywhere in the file. The summary file only contains 4 or 5 records so a few seconds is saved at best. A potential benefit is gained when the summary file increases but, again, it is fairly minimal.

Therefore, the letter of the requirement can still be achieved in the awk solution by adding a test in the END procedure but your performance gain can be measured in milliseconds. I'll leave it out and seek forgiveness instead

tmarikle

View Public Profile for tmarikle

Find all posts by tmarikle

05-31-2006

Registered User

3,216, 33

Join Date: Mar 2005

Last Activity: 4 September 2020, 7:11 AM EDT

Location: classification algos

Posts: 3,216

Thanks Given: 19

Thanked 33 Times in 30 Posts

Quote:

Fist, sorry but I couldn't get your modified script to work on my AIX system using 1 million records without changing the first loop as follows:
Code:

while read junk val junk junk junk junk junk junk junk junk junk junk do ... done < D.txt

i really dont understand why its not running in AIX
i tried it and tested on both solaris and HP-UX and works fine ...

currently i dont have access to an AIX box... so i dont have the reason why its not working !!!!

matrixmadhan

View Public Profile for matrixmadhan

Find all posts by matrixmadhan

05-31-2006

Administrator Emeritus

9,926, 461

Join Date: Aug 2001

Last Activity: 26 February 2016, 12:31 PM EST

Location: Ashburn, Virginia

Posts: 9,926

Thanks Given: 63

Thanked 461 Times in 270 Posts

Quote:

Originally Posted by madhunk

The Detail file is almost 15 million records

....
I tried Madan's script with the actual file and it produced an output like this
parse.ksh[22]: no space

Quote:

Originally Posted by matrixmadhan

I tried with 150009 records and it completed in less than 21 seconds
.....

i really dont understand why its not running in AIX
i tried it and tested on both solaris and HP-UX and works fine ...

Code:

#! /usr/bin/ksh

index=1
tmp=1

for val in `awk '{print $2}' d.txt`
do

A line like:
for val in `awk '{print $2}' d.txt`
when run with 15,000,000 lines in d.txt is asking ksh to construct a command line over 15,000,000 words. Remember that command extends until the "done". Once ksh has the entire compound command in memory, it will compile it and execute the compiled code. The compiled code will be almost as long as the original compound command in this case. ksh is willing to attempt this feat, whether or not it succeeds depends on how the kernel was tuned, how much virtual memory is available, how much tmp space is available, etc.

Consider switching to:
awk '{print $2)' d.txt | while read val

This is almost the same thing, but you don't need 2 copies of 15,000,000 items in core simultaneous.

The
while read junk val junk junk junk junk junk junk junk junk junk junk
solution is better still. You no longer have an awk process pumping 15,000,000 words though a pipe. But
while read junk val junk
would do just as nicely. The end of the line accumulates in the last variable of a read regardless of IFS setting.

Perderabo

View Public Profile for Perderabo

Find all posts by Perderabo

05-31-2006

Registered User

683, 5

Join Date: Jan 2005

Last Activity: 27 September 2011, 12:36 PM EDT

Posts: 683

Thanks Given: 0

Thanked 5 Times in 5 Posts

Very useful to know regarding ksh and I suceeded in running a test on my Sun box with 1.1 million records. I have seen this effect while tracing but I didn't extend this to 15 million in my mind until you pointed it out.

Again, to beat my previous point to death further still and for what it's worth:

Original shell (which runs faster than "while read junk val junk" interestingly enough):

Code:

time { ksh s2.sh ; print $? ; }
1

real    4m45.44s
user    4m43.56s
sys     0m1.32s

Records per second: 3,935
Estimated execution time on 15 million records: slightly longer than 1 hour

Awk solution:

Code:

time { ksh s.sh ; print $? ; } 
Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                   45000     336963 <==== Error ===
BACON                     15000     112321 <==== Error ===
CAESAR                    30000     224642 <==== Error ===
THOMAS                        1          0 <==== Error ===
MIDWEST                   60000     449304 <==== Error ===
1

real    0m9.25s
user    0m7.78s
sys     0m0.49s

Records per second: 121,430
Estimated execution time on 15 million records: slightly longer than 2 minutes

tmarikle

View Public Profile for tmarikle

Find all posts by tmarikle

Shell Programming and Scripting

Grep, count and match two files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Count the number of files to delete doesnt match

Discussion started by: alexcol

2. UNIX for Dummies Questions & Answers

Grep Files with and without match

Discussion started by: alvinoo

3. Shell Programming and Scripting

Error files count while coping files from source to destination locaton as well count success full

Discussion started by: sravanreddy

4. UNIX for Dummies Questions & Answers

[Solved] Grep multiple files and display first match

Discussion started by: dbiggied

5. Shell Programming and Scripting

Pattern match using grep between two files

Discussion started by: mailvkjain

6. Shell Programming and Scripting

grep - match files containing minimum number of pattern matches

Discussion started by: stumpyuk

7. UNIX for Dummies Questions & Answers

Grep bunch of gzip files to count based on category

Discussion started by: jinxx

8. Shell Programming and Scripting

Grep string from logs of last 1 hour on files of 2 different servers and calculate count

Discussion started by: poweroflinux

9. UNIX for Dummies Questions & Answers

Comparing two files and count number of lines that match

Discussion started by: DerSeb

10. UNIX for Advanced & Expert Users

grep count across multiple files

Discussion started by: CrunchMunch