Grep, count and match two files

05-30-2006

Registered User

683, 5

Join Date: Jan 2005

Last Activity: 27 September 2011, 12:36 PM EDT

Posts: 683

Thanks Given: 0

Thanked 5 Times in 5 Posts

Quote:

Originally Posted by matrixmadhan

Quote:

4) Abort if counts doesn't match

the above point is not met.

unnecessary entire file parsing each time.
this could have been avoided

Agreed, my awk solution missed requirement 4 but considering this:

Quote:

Originally Posted by madhunk

It is an ACR validation check for a bank. Actually we have a authorization file which has partner names in the Detail text file and the Summary file with partner names and the count of records for each partner name.

We get this file every week and we need to validate the count of records for each partner name. (The keys in your code).

The Detail file is almost 15 million records and the Summary file has about 4 or 5 records.

It does have additional benefits such as:

Complete analysis reporting based on both available summary and available detail information.

Presumably the test is being performed as a control mechanism for some other automated task, and certainly a break must occur once a problem is identified; however, based on my experience, at some point, more specific analysis will need to be performed regardless of how the failure affects the driving process. Since a complete analysis can simply be obtained at run-time(as is true in this case), failing the process and producing a complete analysis report simultaneously over simply failing the process may be worth the extra "effort."
.
Speed over thre shell script solution.

Replicating the same D.txt lines so that 150,000 records exist in the file produced a 22 second execution time before the shell script determined that it needed to exit (more than 150,000 records caused the shell script to fail on my AIX system). The awk solution completed a total analysis both from summary and detail perspectives in less than 1 second. Now, given this, which solution processes unnecessarily? Also, if an exit code is all that is necessary, the awk solution can exit as soon as a mismatch occurs (albeit, after collecting data from the entire detail file).

In this case, missing the letter of the requirement is probably forgivable.

tmarikle

View Public Profile for tmarikle

Find all posts by tmarikle

05-30-2006

Registered User

683, 5

Join Date: Jan 2005

Last Activity: 27 September 2011, 12:36 PM EDT

Posts: 683

Thanks Given: 0

Thanked 5 Times in 5 Posts

Quote:

Originally Posted by madhunk

I tested Thomas script and it produced no output somehow....It also didn't abort...

Whoops, look at your data sample. My awk solution didn't print anything because all record counts matched. It was designed to isolate mismatches only. You can change it if you want a complete report as follows:

Code:

nawk '
    # Compile summary array
    FILENAME=="S.txt" {
        SKeys[$2]=$3
    }

    # Compile details array
    FILENAME=="D.txt" {
        DKeys[$2]++
    }

    END {
        # Look for summary records that do not match detail counts
        for (i in SKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MismatchedKeys[i]++
            } else {
                MatchedKeys[i]++
            }
        }
        # Look for detail counts where a summary record is missing
        for (i in DKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MismatchedKeys[i]++
            } else {
                MatchedKeys[i]++
            }
        }

        # Print a merged set of records
        print "Mismatched summary:"
        printf("%-20s %10s %10s\n", "Fruit", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")

        for (i in MismatchedKeys) printf("%-20s %10d %10d\n", i, SKeys[i], DKeys[i])

        print ""
        print "Matched summary:"
        printf("%-20s %10s %10s\n", "Fruit", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")
        for (i in MatchedKeys) printf("%-20s %10d %10d\n", i, SKeys[i], DKeys[i])
    }
' S.txt D.txt

resultsL

Code:

Mismatched summary:
Fruit                   Summary     Detail
==================== ========== ==========

Matched summary:
Fruit                   Summary     Detail
==================== ========== ==========
AIRTRAN                       3          3
ORBITZ                        0          0
FRONTIER                      0          0
CAESAR                        2          2
MIDWEST                       4          4

Quote:

I tried Madan's script with the actual file and it produced an output like this
parse.ksh[22]: no space

This is my experience also, probably a variable is being overloaded but I have analyzed the script to find out for sure.

tmarikle

View Public Profile for tmarikle

Find all posts by tmarikle

05-30-2006

Registered User

683, 5

Join Date: Jan 2005

Last Activity: 27 September 2011, 12:36 PM EDT

Posts: 683

Thanks Given: 0

Thanked 5 Times in 5 Posts

A more concise report can be achieved as follows:

Code:

nawk '
    # Compile summary array
    FILENAME=="S.txt" {
        SKeys[$2]=$3
        GlobalKeys[$2]++
    }

    # Compile details array
    FILENAME=="D.txt" {
        DKeys[$2]++
        GlobalKeys[$2]++
    }

    END {
        printf("%-20s %-10s %-10s %-15s\n", "Partner", "Summary", "Detail", "Count Mismatch")
        printf("%-20s %10s %10s %15s\n", "====================", "==========", "==========", "===============")

        # Look for summary records that do not match detail counts
        for (i in GlobalKeys) {
            printf("%-20s %10d %10d %-15s\n", i, SKeys[i], DKeys[i], (DKeys[i] == SKeys[i] ? "" : "<==== Error ==="))
        }
    }
' S.txt D.txt

Note: I added one more "MIDWEST" and a "BACON" record to D.txt to show the mismatch flag.
Report:

Code:

Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                       3          3                
ORBITZ                        0          0                
FRONTIER                      0          0                
BACON                         0          1 <==== Error ===
CAESAR                        2          2                
MIDWEST                       4          5 <==== Error ===

Testing the two methods with 100,000 records produced 6.11 seconds for the shell script and .80 (subsecond) for the awk script.

tmarikle

View Public Profile for tmarikle

Find all posts by tmarikle

05-30-2006

Registered User

95, 0

Join Date: Nov 2005

Last Activity: 21 September 2017, 9:57 PM EDT

Posts: 95

Thanks Given: 1

Thanked 0 Times in 0 Posts

That is awesome Thomas.....

Please see the report...

PHP Code:


Mismatched summary:

Partner                 Summary     Detail

==================== ========== ==========



Matched summary:

Partner                 Summary     Detail

==================== ========== ==========

AIRTRAN                 3191969    3191969

ORBITZ                  5995609    5995609

FRONTIER                1672209    1672209

CAESAR                        0          0

MIDWEST                 1577373    1577373

BESTWESTERN              582813     582813



real    3m40.28s

user    1m45.91s

sys     0m9.46s

I have added time to see how much does it take....I tested on the real file and it is perfect...

Please find the code below...I have tried to parameterize it and not sure how we can in nawk..

Code:

#!/usr/bin/ksh

if [ $# -ne 3 ]
then
   echo " "
   echo " Incorrect number of parameters entered..."
   echo " Correct usage: " $0 "<DIR> <SUMMARY FILE> <DETAIL FILE>"
   echo " "
   exit 1
fi


DIR=$1
SUMMARY_FILE=$2
DETAIL_FILE=$3

cd ${DIR}

time {
nawk '
    # Compile summary array
    FILENAME=="$(SUMMARY_FILE}" {
        SKeys[$2]=$3
    }

    # Compile details array
    FILENAME=="$(DETAIL_FILE}" {
        DKeys[$2]++
    }

    END {
        # Look for summary records that do not match detail counts
        for (i in SKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MismatchedKeys[i]++
            } else {
                MatchedKeys[i]++
            }
        }
        # Look for detail counts where a summary record is missing
        for (i in DKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MismatchedKeys[i]++
            } else {
                MatchedKeys[i]++
            }
        }

        # Print a merged set of records
        print "Mismatched summary:"
        printf("%-20s %10s %10s\n", "Partner", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")

        for (i in MismatchedKeys) printf("%-20s %10d %10d\n", i, SKeys[i], DKeys[i])

        print ""
        print "Matched summary:"
        printf("%-20s %10s %10s\n", "Partner", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")
        for (i in MatchedKeys) printf("%-20s %10d %10d\n", i, SKeys[i], DKeys[i])
    }' $(SUMMARY_FILE} $(DETAIL_FILE}
}

Does the code abort with status 1 if there is output in Mismatched Summary..

Thank you again for all the help..

madhunk

View Public Profile for madhunk

Find all posts by madhunk

05-30-2006

Registered User

683, 5

Join Date: Jan 2005

Last Activity: 27 September 2011, 12:36 PM EDT

Posts: 683

Thanks Given: 0

Thanked 5 Times in 5 Posts

To parameterize this correctly, you can add variables to know which file you are looking at:

Code:

nawk -v summary_file=${SUMMARY_FILE} -v detail_file=${DETAIL_FILE} '
    # Compile summary array
    FILENAME==summary_file {
        SKeys[$2]=$3
    }

    # Compile details array
    FILENAME==detail_file {
        DKeys[$2]++
    }

To exit with a non zero result add a flag to the "END" procedure:

Code:

        # Look for summary records that do not match detail counts
        for (i in SKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MismatchedKeys[i]++
                exit_code=1
            } else {
                MatchedKeys[i]++
            }
        }
        # Look for detail counts where a summary record is missing
        for (i in DKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MismatchedKeys[i]++
                exit_code=1
            } else {
                MatchedKeys[i]++
            }
        }
        exit (exit_code)
        ...

By the way, I like the last awk script that I provided you since it provides a complete report and it's more concise.

tmarikle

View Public Profile for tmarikle

Find all posts by tmarikle

05-30-2006

Registered User

95, 0

Join Date: Nov 2005

Last Activity: 21 September 2017, 9:57 PM EDT

Posts: 95

Thanks Given: 1

Thanked 0 Times in 0 Posts

Thank You Thomas...

I used your last script which is more concise...But somehow I am unable to get the parameters passed to the script..

I have also added the exit status as you have mentioned...Can you please take a quick glance and see where I am doing wrong here...

Code:

#!/usr/bin/ksh

if [ $# -ne 3 ]
then
   echo " "
   echo " Incorrect number of parameters entered..."
   echo " Correct usage: " $0 "<DIR> <SUMMARY FILE> <DETAIL FILE>"
   echo " "
   exit 1
fi


DIR=$1
SUMMARY_FILE=$2
DETAIL_FILE=$3

time {
nawk -v directory=${DIR} -v summary_file=${SUMMARY_FILE} -v detail_file=${DETAIL_FILE} '
    # Compile summary array
    FILENAME==summary_file {
        SKeys[$2]=$3
        GlobalKeys[$2]++
    }

    # Compile details array
    FILENAME==detail_file {
        DKeys[$2]++
        GlobalKeys[$2]++
    }

    END {
        printf("%-20s %-10s %-10s %-15s\n", "Partner", "Summary", "Detail", "Count Mismatch")
        printf("%-20s %10s %10s %15s\n", "====================", "==========", "==========", "===============")

        # Look for summary records that do not match detail counts
        for (i in GlobalKeys) {
            printf("%-20s %10d %10d %-15s\n", i, SKeys[i], DKeys[i], (DKeys[i] == SKeys[i] ? "" : "<==== Error ==="))
             _ex=1
            }
    }
' directory/summary_file directory/detail_file
}

madhunk

View Public Profile for madhunk

Find all posts by madhunk

05-30-2006

Registered User

683, 5

Join Date: Jan 2005

Last Activity: 27 September 2011, 12:36 PM EDT

Posts: 683

Thanks Given: 0

Thanked 5 Times in 5 Posts

Code:

#!/usr/bin/ksh

if [ $# -ne 3 ]
then
   echo " "
   echo " Incorrect number of parameters entered..."
   echo " Correct usage: " $0 "<DIR> <SUMMARY FILE> <DETAIL FILE>"
   echo " "
   exit 1
fi


DIR=$1
SUMMARY_FILE=$2
DETAIL_FILE=$3

time {
nawk -v summary_file=${DIR}/${SUMMARY_FILE} -v detail_file=${DIR}/${DETAIL_FILE}  '
    # Compile summary array
    FILENAME==summary_file {
        SKeys[$2]=$3
        GlobalKeys[$2]++
    }

    # Compile details array
    FILENAME==detail_file {
        DKeys[$2]++
        GlobalKeys[$2]++
    }

    END {
        printf("%-20s %-10s %-10s %-15s\n", "Partner", "Summary", "Detail", "Count Mismatch")
        printf("%-20s %10s %10s %15s\n", "====================", "==========", "==========", "===============")

        # Look for summary records that do not match detail counts
        for (i in GlobalKeys) {
            printf("%-20s %10d %10d %-15s\n", i, SKeys[i], DKeys[i], (DKeys[i] == SKeys[i] ? "" : "<==== Error ==="))
             _ex=1
            }
        exit (_ex)
    }
' ${DIR}/${SUMMARY_FILE} ${DIR}/${DETAIL_FILE}
}

${DIR}, if needed, would be added with -v and as the last line along with ${SUMMARY_FILE}. e.g. ${DIR}/${SUMMARY_FILE}

Also, you need to make the last executable line of the "END" procedure an "exit" as follows: exit (_ex)

tmarikle

View Public Profile for tmarikle

Find all posts by tmarikle

Shell Programming and Scripting

Grep, count and match two files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Count the number of files to delete doesnt match

Discussion started by: alexcol

2. UNIX for Dummies Questions & Answers

Grep Files with and without match

Discussion started by: alvinoo

3. Shell Programming and Scripting

Error files count while coping files from source to destination locaton as well count success full

Discussion started by: sravanreddy

4. UNIX for Dummies Questions & Answers

[Solved] Grep multiple files and display first match

Discussion started by: dbiggied

5. Shell Programming and Scripting

Pattern match using grep between two files

Discussion started by: mailvkjain

6. Shell Programming and Scripting

grep - match files containing minimum number of pattern matches

Discussion started by: stumpyuk

7. UNIX for Dummies Questions & Answers

Grep bunch of gzip files to count based on category

Discussion started by: jinxx

8. Shell Programming and Scripting

Grep string from logs of last 1 hour on files of 2 different servers and calculate count

Discussion started by: poweroflinux

9. UNIX for Dummies Questions & Answers

Comparing two files and count number of lines that match

Discussion started by: DerSeb

10. UNIX for Advanced & Expert Users

grep count across multiple files

Discussion started by: CrunchMunch