The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
grep on multiple words to match text template rider29 Shell Programming and Scripting 6 05-23-2008 12:21 PM
grep count across multiple files CrunchMunch UNIX for Advanced & Expert Users 4 05-15-2008 04:47 AM
grep question - match a url pauljohn UNIX for Dummies Questions & Answers 5 04-11-2008 12:26 PM
Exact Match thru grep ????? manas_ranjan UNIX for Advanced & Expert Users 2 08-17-2007 06:57 AM
how to use pattern match with grep rei UNIX for Dummies Questions & Answers 5 01-05-2007 04:33 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 05-26-2006
madhunk madhunk is offline
Registered User
  
 

Join Date: Nov 2005
Posts: 91
Grep, count and match two files

I am writing the below script to do a grep and count number of occurances between two tab delimited files.

I am trying to achieve..

1) Extract column 2 and column 3 from the S.txt file. Put it in a temp pattern file
2) Grep and count column 2 in D.txt file
3) Compare the counts between D.txt and S.txt files.
4) Abort if counts doesn't match

Example: APPLE occurs 4 times in D.txt and is a match in S.txt

Code:
#!/usr/bin/ksh

SUM_COUNT=`nawk '{if ($0 ~ /^S/) print $2,$3 >"S1.txt" }'` S.txt

for i in S1.txt
do
DETAIL_COUNT=`grep $i D.txt | wc -l`
if [ ${DETAIL_COUNT} -eq ${SUM_COUNT} ]
then
     echo "Count between Detail and Summary matches"
else
     echo "Count didn't match"
      exit
fi
done
The script goes in a loop and never exits....I am not sure if this is the right way to code.

S.txt
PHP Code:
S       APPLES                          4
S       ORANGES                         1
S       PEARS                           1
S       PINEAPPLES                      1
S       TOMATOES                        0
S       PEPPERS                         1 
D.txt
PHP Code:
D       PINEAPPLES
D       ORANGES
D       PEARS
D       APPLES
D       APPLES
D       APPLES
D       APPLES
D       PEPPERS 
I am still in the learning phase and would appreciate any input on this..
  #2 (permalink)  
Old 05-26-2006
tmarikle tmarikle is offline Forum Advisor  
Registered User
  
 

Join Date: Jan 2005
Posts: 683
Gosh I hope this isn't homework related but this mish mash of commands was too much for me. It's way too easy to do this kind of stuff right in awk by itself. Take a look at this quick example (just so happens I just helped a coworker with a nearly identical business related problem on very large flat files):

Code:
nawk '
    # Compile summary array
    FILENAME=="S.txt" {
        SKeys[$2]=$3
    }

    # Compile details array
    FILENAME=="D.txt" {
        DKeys[$2]++
    }

    END {
        # Look for summary records that do not match detail counts
        for (i in SKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MKeys[i]++
            }
        }
        # Look for detail counts where a summary record is missing
        for (i in DKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MKeys[i]++
            }
        }

        # Print a merged set of records
        printf("%-20s %10s %10s\n", "Fruit", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")

        for (i in MKeys) printf("%-20s %10d %10d\n", i, SKeys[i], DKeys[i])
    }
' S.txt D.txt
results:
Code:
Fruit                   Summary     Detail
==================== ========== ==========
BACON                         0          8
APPLES                        4          6
S.txt
Code:
S       APPLES                          4 
S       ORANGES                         1 
S       PEARS                           1 
S       PINEAPPLES                      1 
S       TOMATOES                        0 
S       PEPPERS                         1
D.txt
Code:
D       PINEAPPLES 
D       ORANGES 
D       PEARS 
D       APPLES 
D       APPLES 
D       APPLES 
D       BACON
D       BACON
D       BACON
D       APPLES 
D       APPLES 
D       APPLES 
D       PEPPERS 
D       BACON
D       BACON
D       BACON
D       BACON
D       BACON
  #3 (permalink)  
Old 05-30-2006
madhunk madhunk is offline
Registered User
  
 

Join Date: Nov 2005
Posts: 91
Thank You Thomas....It is absolutely not a homework problem. It is an ACR validation check for a bank. Actually we have a authorization file which has partner names in the Detail text file and the Summary file with partner names and the count of records for each partner name.

We get this file every week and we need to validate the count of records for each partner name. (The keys in your code).

The Detail file is almost 15 million records and the Summary file has about 4 or 5 records.

I will take your example and test it out on these huge files tomorrow...
  #4 (permalink)  
Old 05-30-2006
matrixmadhan matrixmadhan is offline Forum Advisor  
Technorati Master
  
 

Join Date: Mar 2005
Location: leaf node in B+ tree
Posts: 2,954
try this one,
much simpler one :::

Code:
#! /usr/bin/ksh

index=1
tmp=1

for val in `sed 's/^.*  //' d.txt`
do
  if [ $tmp -eq 1 ]
  then
     fruit[$index]=$val
     cnt[$index]=0
     tmp=$(($tmp + 1))
  fi
  temp=1
  while [ $temp -le $index ]
  do
     if [ ${fruit[$temp]} = $val ]
     then
        cnt[$temp]=$((${cnt[$temp]} + 1))
        break
     fi
     temp=$(($temp + 1))
  done
if [ $temp -gt $index ]
  then
     index=$(($index + 1))
     fruit[$index]=$val
     cnt[$index]=$((${cnt[$index]} + 1))
  fi
done
awk '{print $2, $3}' s.txt | while read first second
do
temp=1
while [ $temp -le $index ]
do
  if [ ${fruit[$temp]} = $first ]
  then
     if [ cnt[$temp] -eq $second ]
     then
        print ${fruit[$temp]} ${cnt[$temp]}
        break
     else
        exit 1
     fi
  fi
  temp=$(($temp + 1))
done
if [ $temp -gt $index ]
then
exit 1
fi
done

exit 0
  #5 (permalink)  
Old 05-30-2006
tmarikle tmarikle is offline Forum Advisor  
Registered User
  
 

Join Date: Jan 2005
Posts: 683
Quote:
Originally Posted by matrixmadhan

Quote:
4) Abort if counts doesn't match
the above point is not met.

unnecessary entire file parsing each time.
this could have been avoided
Agreed, my awk solution missed requirement 4 but considering this:

Quote:
Originally Posted by madhunk
It is an ACR validation check for a bank. Actually we have a authorization file which has partner names in the Detail text file and the Summary file with partner names and the count of records for each partner name.

We get this file every week and we need to validate the count of records for each partner name. (The keys in your code).

The Detail file is almost 15 million records and the Summary file has about 4 or 5 records.
It does have additional benefits such as:
  1. Complete analysis reporting based on both available summary and available detail information.

    Presumably the test is being performed as a control mechanism for some other automated task, and certainly a break must occur once a problem is identified; however, based on my experience, at some point, more specific analysis will need to be performed regardless of how the failure affects the driving process. Since a complete analysis can simply be obtained at run-time(as is true in this case), failing the process and producing a complete analysis report simultaneously over simply failing the process may be worth the extra "effort."
    .
  2. Speed over thre shell script solution.

    Replicating the same D.txt lines so that 150,000 records exist in the file produced a 22 second execution time before the shell script determined that it needed to exit (more than 150,000 records caused the shell script to fail on my AIX system). The awk solution completed a total analysis both from summary and detail perspectives in less than 1 second. Now, given this, which solution processes unnecessarily? Also, if an exit code is all that is necessary, the awk solution can exit as soon as a mismatch occurs (albeit, after collecting data from the entire detail file).
In this case, missing the letter of the requirement is probably forgivable.
  #6 (permalink)  
Old 05-31-2006
matrixmadhan matrixmadhan is offline Forum Advisor  
Technorati Master
  
 

Join Date: Mar 2005
Location: leaf node in B+ tree
Posts: 2,954
Quote:
Originally Posted by tmarikle
Complete analysis reporting based on both available summary and available detail information.

Presumably the test is being performed as a control mechanism for some other automated task, and certainly a break must occur once a problem is identified; however, based on my experience, at some point, more specific analysis will need to be performed regardless of how the failure affects the driving process. Since a complete analysis can simply be obtained at run-time(as is true in this case), failing the process and producing a complete analysis report simultaneously over simply failing the process may be worth the extra "effort."
Accept that complete analysis prior to termination is needed and that would be preferred. Well your solution was more generic !!!

Quote:
Originally Posted by tmarikle
Speed over thre shell script solution.

Replicating the same D.txt lines so that 150,000 records exist in the file produced a 22 second execution time before the shell script determined that it needed to exit (more than 150,000 records caused the shell script to fail on my AIX system). The awk solution completed a total analysis both from summary and detail perspectives in less than 1 second. Now, given this, which solution processes unnecessarily? Also, if an exit code is all that is necessary, the awk solution can exit as soon as a mismatch occurs (albeit, after collecting data from the entire detail file).
it might have failed for the reason that the parsing of Detail file was not correct, i have modified it now.
I tried with 150009 records and it completed in less than 21 seconds, may be you could try once and see.

Code:
time ./fine.ksh
MIDWEST 150004
AIRTRAN 3
CAESAR 2

real    0m20.84s
user    0m20.91s
sys     0m0.09s
prior to that we cannot arrive at which solution is speeding up for the requirement !!!
Quote:
In this case, missing the letter of the requirement is probably forgivable.
I always take requirement as the higher priority!!! then work out for the optimization and speed when the former is done
  #7 (permalink)  
Old 05-31-2006
tmarikle tmarikle is offline Forum Advisor  
Registered User
  
 

Join Date: Jan 2005
Posts: 683
For what it's worth, and in defense of thinking through the performance implications of a scripting solution before coding one:

Fist, sorry but I couldn't get your modified script to work on my AIX system using 1 million records without changing the first loop as follows:
Code:
while read junk val junk junk junk junk junk junk junk junk junk junk
do
    ...
done < D.txt
This at least eliminates one additional process call to awk but still takes nearly two minutes to execute:
Code:
time ksh s2.sh

real    1m57.53s
user    1m57.26s
sys     0m0.10s
There are other methods for improving upon this that can significantly increase the shell script's performance but it will allways fall short of the awk approach due to awk's optimization for text processing.

The awk solution ran in 10 seconds for the same 1 million records:
Code:
 time { ksh s.sh ; print $? ;  }
Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                       3     314565 <==== Error ===
ORBITZ                        0          0                
FRONTIER                      0          0                
BACON                         0     104854 <==== Error ===
CAESAR                        2     209710 <==== Error ===
THOMAS                       44          0 <==== Error ===
MIDWEST                       4     419419 <==== Error ===
1

real    0m10.30s
user    0m10.07s
sys     0m0.12s
Running the test again on 150000 records:
Shell script solution:
Code:
 time { ksh s2.sh ; print $? ; }
1

real    0m16.72s
user    0m16.67s
sys     0m0.02s
Awk solution:
Code:
 time { ksh s.sh ; print $? ; } 
Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                       3      45000 <==== Error ===
ORBITZ                        0          0                
FRONTIER                      0          0                
BACON                         0      15000 <==== Error ===
CAESAR                        2      30000 <==== Error ===
THOMAS                       44          0 <==== Error ===
MIDWEST                       4      60000 <==== Error ===
1

real    0m1.40s
user    0m1.37s
sys     0m0.02s
The shell script operates between 8500 to 9000 records per second given where the first count mismatch is found. The awk solution is operating between 97000 to 107000 records per second.

To be fair, the shell script really isn't minimizing much from its own potential work load since the detail file must be processed fully since a mismatch can occur anywhere in the file. The summary file only contains 4 or 5 records so a few seconds is saved at best. A potential benefit is gained when the summary file increases but, again, it is fairly minimal.

Therefore, the letter of the requirement can still be achieved in the awk solution by adding a test in the END procedure but your performance gain can be measured in milliseconds. I'll leave it out and seek forgiveness instead .
Closed Thread

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 03:59 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0