Grep, count and match two files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Grep, count and match two files
# 8  
Old 05-30-2006
Quote:
Originally Posted by matrixmadhan

Quote:
4) Abort if counts doesn't match
the above point is not met.

unnecessary entire file parsing each time.
this could have been avoided
Agreed, my awk solution missed requirement 4 but considering this:

Quote:
Originally Posted by madhunk
It is an ACR validation check for a bank. Actually we have a authorization file which has partner names in the Detail text file and the Summary file with partner names and the count of records for each partner name.

We get this file every week and we need to validate the count of records for each partner name. (The keys in your code).

The Detail file is almost 15 million records and the Summary file has about 4 or 5 records.
It does have additional benefits such as:
  1. Complete analysis reporting based on both available summary and available detail information.

    Presumably the test is being performed as a control mechanism for some other automated task, and certainly a break must occur once a problem is identified; however, based on my experience, at some point, more specific analysis will need to be performed regardless of how the failure affects the driving process. Since a complete analysis can simply be obtained at run-time(as is true in this case), failing the process and producing a complete analysis report simultaneously over simply failing the process may be worth the extra "effort."
    .
  2. Speed over thre shell script solution.

    Replicating the same D.txt lines so that 150,000 records exist in the file produced a 22 second execution time before the shell script determined that it needed to exit (more than 150,000 records caused the shell script to fail on my AIX system). The awk solution completed a total analysis both from summary and detail perspectives in less than 1 second. Now, given this, which solution processes unnecessarily? Also, if an exit code is all that is necessary, the awk solution can exit as soon as a mismatch occurs (albeit, after collecting data from the entire detail file).
In this case, missing the letter of the requirement is probably forgivable.
# 9  
Old 05-30-2006
Quote:
Originally Posted by madhunk
I tested Thomas script and it produced no output somehow....It also didn't abort...
Whoops, look at your data sample. My awk solution didn't print anything because all record counts matched. It was designed to isolate mismatches only. You can change it if you want a complete report as follows:
Code:
nawk '
    # Compile summary array
    FILENAME=="S.txt" {
        SKeys[$2]=$3
    }

    # Compile details array
    FILENAME=="D.txt" {
        DKeys[$2]++
    }

    END {
        # Look for summary records that do not match detail counts
        for (i in SKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MismatchedKeys[i]++
            } else {
                MatchedKeys[i]++
            }
        }
        # Look for detail counts where a summary record is missing
        for (i in DKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MismatchedKeys[i]++
            } else {
                MatchedKeys[i]++
            }
        }

        # Print a merged set of records
        print "Mismatched summary:"
        printf("%-20s %10s %10s\n", "Fruit", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")

        for (i in MismatchedKeys) printf("%-20s %10d %10d\n", i, SKeys[i], DKeys[i])

        print ""
        print "Matched summary:"
        printf("%-20s %10s %10s\n", "Fruit", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")
        for (i in MatchedKeys) printf("%-20s %10d %10d\n", i, SKeys[i], DKeys[i])
    }
' S.txt D.txt

resultsL
Code:
Mismatched summary:
Fruit                   Summary     Detail
==================== ========== ==========

Matched summary:
Fruit                   Summary     Detail
==================== ========== ==========
AIRTRAN                       3          3
ORBITZ                        0          0
FRONTIER                      0          0
CAESAR                        2          2
MIDWEST                       4          4

Quote:
I tried Madan's script with the actual file and it produced an output like this
parse.ksh[22]: no space
This is my experience also, probably a variable is being overloaded but I have analyzed the script to find out for sure.
# 10  
Old 05-30-2006
A more concise report can be achieved as follows:
Code:
nawk '
    # Compile summary array
    FILENAME=="S.txt" {
        SKeys[$2]=$3
        GlobalKeys[$2]++
    }

    # Compile details array
    FILENAME=="D.txt" {
        DKeys[$2]++
        GlobalKeys[$2]++
    }

    END {
        printf("%-20s %-10s %-10s %-15s\n", "Partner", "Summary", "Detail", "Count Mismatch")
        printf("%-20s %10s %10s %15s\n", "====================", "==========", "==========", "===============")

        # Look for summary records that do not match detail counts
        for (i in GlobalKeys) {
            printf("%-20s %10d %10d %-15s\n", i, SKeys[i], DKeys[i], (DKeys[i] == SKeys[i] ? "" : "<==== Error ==="))
        }
    }
' S.txt D.txt

Note: I added one more "MIDWEST" and a "BACON" record to D.txt to show the mismatch flag.
Report:
Code:
Partner              Summary    Detail     Count Mismatch 
==================== ========== ========== ===============
AIRTRAN                       3          3                
ORBITZ                        0          0                
FRONTIER                      0          0                
BACON                         0          1 <==== Error ===
CAESAR                        2          2                
MIDWEST                       4          5 <==== Error ===

Testing the two methods with 100,000 records produced 6.11 seconds for the shell script and .80 (subsecond) for the awk script.
# 11  
Old 05-30-2006
That is awesome Thomas.....

Please see the report...

PHP Code:
Mismatched summary:
Partner                 Summary     Detail
==================== ========== ==========

Matched summary:
Partner                 Summary     Detail
==================== ========== ==========
AIRTRAN                 3191969    3191969
ORBITZ                  5995609    5995609
FRONTIER                1672209    1672209
CAESAR                        0          0
MIDWEST                 1577373    1577373
BESTWESTERN              582813     582813

real    3m40.28s
user    1m45.91s
sys     0m9.46s 
I have added time to see how much does it take....I tested on the real file and it is perfect...

Please find the code below...I have tried to parameterize it and not sure how we can in nawk..

Code:
#!/usr/bin/ksh

if [ $# -ne 3 ]
then
   echo " "
   echo " Incorrect number of parameters entered..."
   echo " Correct usage: " $0 "<DIR> <SUMMARY FILE> <DETAIL FILE>"
   echo " "
   exit 1
fi


DIR=$1
SUMMARY_FILE=$2
DETAIL_FILE=$3

cd ${DIR}

time {
nawk '
    # Compile summary array
    FILENAME=="$(SUMMARY_FILE}" {
        SKeys[$2]=$3
    }

    # Compile details array
    FILENAME=="$(DETAIL_FILE}" {
        DKeys[$2]++
    }

    END {
        # Look for summary records that do not match detail counts
        for (i in SKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MismatchedKeys[i]++
            } else {
                MatchedKeys[i]++
            }
        }
        # Look for detail counts where a summary record is missing
        for (i in DKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MismatchedKeys[i]++
            } else {
                MatchedKeys[i]++
            }
        }

        # Print a merged set of records
        print "Mismatched summary:"
        printf("%-20s %10s %10s\n", "Partner", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")

        for (i in MismatchedKeys) printf("%-20s %10d %10d\n", i, SKeys[i], DKeys[i])

        print ""
        print "Matched summary:"
        printf("%-20s %10s %10s\n", "Partner", "Summary", "Detail")
        printf("%-20s %10s %10s\n", "====================", "==========", "==========")
        for (i in MatchedKeys) printf("%-20s %10d %10d\n", i, SKeys[i], DKeys[i])
    }' $(SUMMARY_FILE} $(DETAIL_FILE}
}

Does the code abort with status 1 if there is output in Mismatched Summary..

Thank you again for all the help..
# 12  
Old 05-30-2006
To parameterize this correctly, you can add variables to know which file you are looking at:
Code:
nawk -v summary_file=${SUMMARY_FILE} -v detail_file=${DETAIL_FILE} '
    # Compile summary array
    FILENAME==summary_file {
        SKeys[$2]=$3
    }

    # Compile details array
    FILENAME==detail_file {
        DKeys[$2]++
    }

To exit with a non zero result add a flag to the "END" procedure:
Code:
        # Look for summary records that do not match detail counts
        for (i in SKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MismatchedKeys[i]++
                exit_code=1
            } else {
                MatchedKeys[i]++
            }
        }
        # Look for detail counts where a summary record is missing
        for (i in DKeys) {
            if (SKeys[i] != DKeys[i]) {
                # Add a record to the merged set
                MismatchedKeys[i]++
                exit_code=1
            } else {
                MatchedKeys[i]++
            }
        }
        exit (exit_code)
        ...

By the way, I like the last awk script that I provided you since it provides a complete report and it's more concise.
# 13  
Old 05-30-2006
Thank You Thomas...

I used your last script which is more concise...But somehow I am unable to get the parameters passed to the script..

I have also added the exit status as you have mentioned...Can you please take a quick glance and see where I am doing wrong here...

Code:
#!/usr/bin/ksh

if [ $# -ne 3 ]
then
   echo " "
   echo " Incorrect number of parameters entered..."
   echo " Correct usage: " $0 "<DIR> <SUMMARY FILE> <DETAIL FILE>"
   echo " "
   exit 1
fi


DIR=$1
SUMMARY_FILE=$2
DETAIL_FILE=$3

time {
nawk -v directory=${DIR} -v summary_file=${SUMMARY_FILE} -v detail_file=${DETAIL_FILE} '
    # Compile summary array
    FILENAME==summary_file {
        SKeys[$2]=$3
        GlobalKeys[$2]++
    }

    # Compile details array
    FILENAME==detail_file {
        DKeys[$2]++
        GlobalKeys[$2]++
    }

    END {
        printf("%-20s %-10s %-10s %-15s\n", "Partner", "Summary", "Detail", "Count Mismatch")
        printf("%-20s %10s %10s %15s\n", "====================", "==========", "==========", "===============")

        # Look for summary records that do not match detail counts
        for (i in GlobalKeys) {
            printf("%-20s %10d %10d %-15s\n", i, SKeys[i], DKeys[i], (DKeys[i] == SKeys[i] ? "" : "<==== Error ==="))
             _ex=1
            }
    }
' directory/summary_file directory/detail_file
}

# 14  
Old 05-30-2006
Code:
#!/usr/bin/ksh

if [ $# -ne 3 ]
then
   echo " "
   echo " Incorrect number of parameters entered..."
   echo " Correct usage: " $0 "<DIR> <SUMMARY FILE> <DETAIL FILE>"
   echo " "
   exit 1
fi


DIR=$1
SUMMARY_FILE=$2
DETAIL_FILE=$3

time {
nawk -v summary_file=${DIR}/${SUMMARY_FILE} -v detail_file=${DIR}/${DETAIL_FILE}  '
    # Compile summary array
    FILENAME==summary_file {
        SKeys[$2]=$3
        GlobalKeys[$2]++
    }

    # Compile details array
    FILENAME==detail_file {
        DKeys[$2]++
        GlobalKeys[$2]++
    }

    END {
        printf("%-20s %-10s %-10s %-15s\n", "Partner", "Summary", "Detail", "Count Mismatch")
        printf("%-20s %10s %10s %15s\n", "====================", "==========", "==========", "===============")

        # Look for summary records that do not match detail counts
        for (i in GlobalKeys) {
            printf("%-20s %10d %10d %-15s\n", i, SKeys[i], DKeys[i], (DKeys[i] == SKeys[i] ? "" : "<==== Error ==="))
             _ex=1
            }
        exit (_ex)
    }
' ${DIR}/${SUMMARY_FILE} ${DIR}/${DETAIL_FILE}
}

${DIR}, if needed, would be added with -v and as the last line along with ${SUMMARY_FILE}. e.g. ${DIR}/${SUMMARY_FILE}

Also, you need to make the last executable line of the "END" procedure an "exit" as follows: exit (_ex)
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Count the number of files to delete doesnt match

Good evening, need your help please Need to delete certain files before octobre 1 2016, so need to know how many files im going to delete, for instance ls -lrt file_20160*.lis!wc -l but using grep -c to another file called bplist which contains the list of all files backed up doesn match... (7 Replies)
Discussion started by: alexcol
7 Replies

2. UNIX for Dummies Questions & Answers

Grep Files with and without match

Hi There, How do i finding files with match and without match Normally, I will use grep -l 'Hello' grep -L 'Hello World' How do we combined (2 Replies)
Discussion started by: alvinoo
2 Replies

3. Shell Programming and Scripting

Error files count while coping files from source to destination locaton as well count success full

hi All, Any one answer my requirement. I have source location src_dir="/home/oracle/arun/IRMS-CM" My Target location dest_dir="/home/oracle/arun/LiveLink/IRMS-CM/$dc/$pc/$ct" my source text files check with below example.text file content $fn "\t" $dc "\t" $pc "\t" ... (3 Replies)
Discussion started by: sravanreddy
3 Replies

4. UNIX for Dummies Questions & Answers

[Solved] Grep multiple files and display first match

I have a need to grep a large number of files, but only display the first result from each file. I have tried to use grep, but am not limited to it. I can use perl and awk as well. Please help! (9 Replies)
Discussion started by: dbiggied
9 Replies

5. Shell Programming and Scripting

Pattern match using grep between two files

Hello Everyone , I have two files. I want to pick line from file-1 and match with the complete data in file-2 , if there is a match print all the match lines in file 3. Below is the file cat test1.txt vikas vikasjain j ain testt douknow hello@vik@ # 33 ||@@ vcpzxcmvhvdsh... (1 Reply)
Discussion started by: mailvkjain
1 Replies

6. Shell Programming and Scripting

grep - match files containing minimum number of pattern matches

I want to search a bunch of files and list only those containing a minimum number of pattern matches. So if I want to identify files containing 3 (or more) instances of the pattern "said:" and I have file1 that contains the lines: He said: She said: and file2 that contains the lines: He... (3 Replies)
Discussion started by: stumpyuk
3 Replies

7. UNIX for Dummies Questions & Answers

Grep bunch of gzip files to count based on category

Started using unix commands recently. I have 50 gzip files. I want to grep each of these files for a line count based particular category in column 3. How can I do that? For example Sr.No Date City Description Code Address 1 06/09 NY living here 0909 10st st nyc 2 ... (5 Replies)
Discussion started by: jinxx
5 Replies

8. Shell Programming and Scripting

Grep string from logs of last 1 hour on files of 2 different servers and calculate count

Hi, I am trying to grep a particular string from the files of 2 different servers without copying and calculate the total count of its occurence on both files. File structure is same on both servers and for reference as follows: 27-Aug-2010... (4 Replies)
Discussion started by: poweroflinux
4 Replies

9. UNIX for Dummies Questions & Answers

Comparing two files and count number of lines that match

Hello all, I always found help for my problems using the search option, but this time my request is too specific. I have two files that I want to compare. File1 is the index and File2 contains the data: File1: chr1 protein_coding exon 500 600 . + . gene_id "20532";... (0 Replies)
Discussion started by: DerSeb
0 Replies

10. UNIX for Advanced & Expert Users

grep count across multiple files

I have a number of simulation log files and I want to get a total count of the "PASSED" expression in them. If I use grep -c <files>, grep would give a tally for each file. I just want one number, the total count. How do I do that? (4 Replies)
Discussion started by: CrunchMunch
4 Replies
Login or Register to Ask a Question