![]() |
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| grep on multiple words to match text template | rider29 | Shell Programming and Scripting | 6 | 05-23-2008 12:21 PM |
| grep count across multiple files | CrunchMunch | UNIX for Advanced & Expert Users | 4 | 05-15-2008 04:47 AM |
| grep question - match a url | pauljohn | UNIX for Dummies Questions & Answers | 5 | 04-11-2008 12:26 PM |
| Exact Match thru grep ????? | manas_ranjan | UNIX for Advanced & Expert Users | 2 | 08-17-2007 06:57 AM |
| how to use pattern match with grep | rei | UNIX for Dummies Questions & Answers | 5 | 01-05-2007 04:33 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
Grep, count and match two files
I am writing the below script to do a grep and count number of occurances between two tab delimited files.
I am trying to achieve.. 1) Extract column 2 and column 3 from the S.txt file. Put it in a temp pattern file 2) Grep and count column 2 in D.txt file 3) Compare the counts between D.txt and S.txt files. 4) Abort if counts doesn't match Example: APPLE occurs 4 times in D.txt and is a match in S.txt Code:
#!/usr/bin/ksh
SUM_COUNT=`nawk '{if ($0 ~ /^S/) print $2,$3 >"S1.txt" }'` S.txt
for i in S1.txt
do
DETAIL_COUNT=`grep $i D.txt | wc -l`
if [ ${DETAIL_COUNT} -eq ${SUM_COUNT} ]
then
echo "Count between Detail and Summary matches"
else
echo "Count didn't match"
exit
fi
done
S.txt PHP Code:
PHP Code:
|
|
||||
|
Gosh I hope this isn't homework related but this mish mash of commands was too much for me. It's way too easy to do this kind of stuff right in awk by itself. Take a look at this quick example (just so happens I just helped a coworker with a nearly identical business related problem on very large flat files):
Code:
nawk '
# Compile summary array
FILENAME=="S.txt" {
SKeys[$2]=$3
}
# Compile details array
FILENAME=="D.txt" {
DKeys[$2]++
}
END {
# Look for summary records that do not match detail counts
for (i in SKeys) {
if (SKeys[i] != DKeys[i]) {
# Add a record to the merged set
MKeys[i]++
}
}
# Look for detail counts where a summary record is missing
for (i in DKeys) {
if (SKeys[i] != DKeys[i]) {
# Add a record to the merged set
MKeys[i]++
}
}
# Print a merged set of records
printf("%-20s %10s %10s\n", "Fruit", "Summary", "Detail")
printf("%-20s %10s %10s\n", "====================", "==========", "==========")
for (i in MKeys) printf("%-20s %10d %10d\n", i, SKeys[i], DKeys[i])
}
' S.txt D.txt
Code:
Fruit Summary Detail ==================== ========== ========== BACON 0 8 APPLES 4 6 Code:
S APPLES 4 S ORANGES 1 S PEARS 1 S PINEAPPLES 1 S TOMATOES 0 S PEPPERS 1 Code:
D PINEAPPLES D ORANGES D PEARS D APPLES D APPLES D APPLES D BACON D BACON D BACON D APPLES D APPLES D APPLES D PEPPERS D BACON D BACON D BACON D BACON D BACON |
|
||||
|
Thank You Thomas....It is absolutely not a homework problem. It is an ACR validation check for a bank. Actually we have a authorization file which has partner names in the Detail text file and the Summary file with partner names and the count of records for each partner name.
We get this file every week and we need to validate the count of records for each partner name. (The keys in your code). The Detail file is almost 15 million records and the Summary file has about 4 or 5 records. I will take your example and test it out on these huge files tomorrow... |
|
||||
|
try this one,
much simpler one ::: Code:
#! /usr/bin/ksh
index=1
tmp=1
for val in `sed 's/^.* //' d.txt`
do
if [ $tmp -eq 1 ]
then
fruit[$index]=$val
cnt[$index]=0
tmp=$(($tmp + 1))
fi
temp=1
while [ $temp -le $index ]
do
if [ ${fruit[$temp]} = $val ]
then
cnt[$temp]=$((${cnt[$temp]} + 1))
break
fi
temp=$(($temp + 1))
done
if [ $temp -gt $index ]
then
index=$(($index + 1))
fruit[$index]=$val
cnt[$index]=$((${cnt[$index]} + 1))
fi
done
awk '{print $2, $3}' s.txt | while read first second
do
temp=1
while [ $temp -le $index ]
do
if [ ${fruit[$temp]} = $first ]
then
if [ cnt[$temp] -eq $second ]
then
print ${fruit[$temp]} ${cnt[$temp]}
break
else
exit 1
fi
fi
temp=$(($temp + 1))
done
if [ $temp -gt $index ]
then
exit 1
fi
done
exit 0
|
|
||||
|
Quote:
Quote:
|
|
||||
|
Quote:
Quote:
I tried with 150009 records and it completed in less than 21 seconds, may be you could try once and see. Code:
time ./fine.ksh MIDWEST 150004 AIRTRAN 3 CAESAR 2 real 0m20.84s user 0m20.91s sys 0m0.09s Quote:
![]() |
|
||||
|
For what it's worth, and in defense of thinking through the performance implications of a scripting solution before coding one:
Fist, sorry but I couldn't get your modified script to work on my AIX system using 1 million records without changing the first loop as follows: Code:
while read junk val junk junk junk junk junk junk junk junk junk junk
do
...
done < D.txt
Code:
time ksh s2.sh real 1m57.53s user 1m57.26s sys 0m0.10s The awk solution ran in 10 seconds for the same 1 million records: Code:
time { ksh s.sh ; print $? ; }
Partner Summary Detail Count Mismatch
==================== ========== ========== ===============
AIRTRAN 3 314565 <==== Error ===
ORBITZ 0 0
FRONTIER 0 0
BACON 0 104854 <==== Error ===
CAESAR 2 209710 <==== Error ===
THOMAS 44 0 <==== Error ===
MIDWEST 4 419419 <==== Error ===
1
real 0m10.30s
user 0m10.07s
sys 0m0.12s
Shell script solution: Code:
time { ksh s2.sh ; print $? ; }
1
real 0m16.72s
user 0m16.67s
sys 0m0.02s
Code:
time { ksh s.sh ; print $? ; }
Partner Summary Detail Count Mismatch
==================== ========== ========== ===============
AIRTRAN 3 45000 <==== Error ===
ORBITZ 0 0
FRONTIER 0 0
BACON 0 15000 <==== Error ===
CAESAR 2 30000 <==== Error ===
THOMAS 44 0 <==== Error ===
MIDWEST 4 60000 <==== Error ===
1
real 0m1.40s
user 0m1.37s
sys 0m0.02s
To be fair, the shell script really isn't minimizing much from its own potential work load since the detail file must be processed fully since a mismatch can occur anywhere in the file. The summary file only contains 4 or 5 records so a few seconds is saved at best. A potential benefit is gained when the summary file increases but, again, it is fairly minimal. Therefore, the letter of the requirement can still be achieved in the awk solution by adding a test in the END procedure but your performance gain can be measured in milliseconds. I'll leave it out and seek forgiveness instead . |
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|