Challenging Compare and validate question -- plus speed.

05-22-2006

Registered User

95, 0

Join Date: Nov 2005

Last Activity: 21 September 2017, 9:57 PM EDT

Posts: 95

Thanks Given: 1

Thanked 0 Times in 0 Posts

Challenging Compare and validate question -- plus speed.

I have a tab delimited HUGE file (13 million records) with Detail, Metadata and Summary records.

Sample File looks like this

M BESTWESTERN 4 ACTIVITY_CNT_L12 A 3
M AIRTRAN 4 ACTIVITY_CNT_L12 A 3
D BESTWESTERN FIRSTNAME LASTNAME 209 N SANBORN AVE
D BESTWESTERN FIRSTNAME LASTNAME 6997 COUNTY ROAD D
D AIRTRAN FIRSTNAME LASTNAME 6997 COUNTY ROAD D
S BESTWESTERN 2
S AIRTRAN 2

I have split the file into three different files.

Metadata file
Detail file
Summary file

The challenge is to check if the information in Metadata records exist in the Detail record file. The names are not constant and WILL change with every incoming file.

1) The script needs to dynamically check the column in the Metadata record file that contains, for example 'BESTWESTERN' and 'AIRTRAN' and make sure that it also exists in the detail record file.

This is a huge file and need to know the fastest way to process it.

What is the best way to approach this dynamically changing file?
Please advice...

Thank You,
Madhu

madhunk

View Public Profile for madhunk

Find all posts by madhunk

05-22-2006

Registered User

95, 0

Join Date: Nov 2005

Last Activity: 21 September 2017, 9:57 PM EDT

Posts: 95

Thanks Given: 1

Thanked 0 Times in 0 Posts

The only best option that I could find is to do this way...

#!/usr/bin/ksh

# put the second column into a file,
# make it unique values

awk '{ print $2 } ' file1 | sort -u > patternfile
# loop thru the patterns from file1
# look for them in file2, print patterns if not found
while read pattern1 pattern2
do
if [ ${#pattern2} -eq 0 ]; then # skip when pattern2 isn't there
continue
fi
grep "$pattern1" file2 | grep -q "$pattern2"
if [ $? -ne 0 ]; then
echo "$pattern1" "$pattern2"
fi
done < patternfile

If I can change this script so that if it doesn't find the pattern, it aborts. Should it be fine?

madhunk

View Public Profile for madhunk

Find all posts by madhunk

05-22-2006

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

try something like this:
nawk -f mad.awk file1 file1

mad.awk:

Code:

BEGIN {
  outMeta="meta.txt"
  outData="data.txt"
  outSumm="summ.txt"

  stderr="cat 1>&2"
}
FNR==NR{
  if ($1 == "D") data[$2];
  next
}

{
  if ($1 != "D" && ($2 in data) )
     print $0 >> ($1 == "M") ? outMeta : outSumm
  else if ( $1 != "D" )
        printf("WARNING::[%d]: Meta or Summary is NOT in Data: [%s]\n", FNR, $2) | stderr

  if ($1 == "D" )
     print $0 >> outData
}

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

05-22-2006

Registered User

95, 0

Join Date: Nov 2005

Last Activity: 21 September 2017, 9:57 PM EDT

Posts: 95

Thanks Given: 1

Thanked 0 Times in 0 Posts

Thank you for the message. I am not sure if I have communicated correctly. But I am looking to do something like this:

Steps:
1) Put second column from file 1 (tab delimited Metadata file) into a pattern file.
2) Count the number of patterns and print the patterns.
3) Loop through the pattern file from file1 and look for those patterns in file2 (tab delimited Detail records file).
3) If there is no pattern found in file2, print the particular pattern that was not found in file2 and abort.

I could do something like this....But going wrong somewhere...Any ideas will be very much appreciated.

#!/usr/bin/ksh

# put the second column into a file,
# make it unique values

awk '{ print $2 } ' file1 | sort -u > patternfile
# loop thru the patterns from file1
# look for them in file2
while read pattern1
do
/usr/xpg4/bin/grep -q "$pattern1" file2
rc=$?
if [ ${rc} -eq 0 ]; then
echo "Pattern found in file2 -- Successful"
else
echo "Pattern "$Pattern" not found in file2, Failed"
fi
done < patternfile

madhunk

View Public Profile for madhunk

Find all posts by madhunk

05-22-2006

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Quote:

Originally Posted by madhunk

Thank you for the message. I am not sure if I have communicated correctly. But I am looking to do something like this:

Steps:
1) Put second column from file 1 (tab delimited Metadata file) into a pattern file.
2) Count the number of patterns and print the patterns.

why do you need to count them and why do you need to print them?

Quote:

Originally Posted by madhunk

3) Loop through the pattern file from file1 and look for those patterns in file2 (tab delimited Detail records file).

hmmm...... I thouight there was just ONE file.
Now you're saying there're TWO files?

It might be a good idea to post sample INPUT file(s???) [if there're multiples] and instead of outlining the algorithm - outlinie the what needs to be done AND a sample end-result given the sample input/file(s)

Also pls use vB codes when posting code and/quotes - it makes reading the posting much easier.

Quote:

Originally Posted by madhunk

3) If there is no pattern found in file2, print the particular pattern that was not found in file2 and abort.

I could do something like this....But going wrong somewhere...Any ideas will be very much appreciated.

#!/usr/bin/ksh

# put the second column into a file,
# make it unique values

awk '{ print $2 } ' file1 | sort -u > patternfile
# loop thru the patterns from file1
# look for them in file2
while read pattern1
do
/usr/xpg4/bin/grep -q "$pattern1" file2
rc=$?
if [ ${rc} -eq 0 ]; then
echo "Pattern found in file2 -- Successful"
else
echo "Pattern "$Pattern" not found in file2, Failed"
fi
done < patternfile

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

05-22-2006

Registered User

95, 0

Join Date: Nov 2005

Last Activity: 21 September 2017, 9:57 PM EDT

Posts: 95

Thanks Given: 1

Thanked 0 Times in 0 Posts

Hi vgersh99,

The printing is only for display purposes to see how many partner names does the metadata file has...

I am sorry for the miscommunication.

Sample Input File1 (Metadata File)
Sample Input File2 (Detail File)

The Metadata File has names such as ORBITZ, BESTWESTERN and so on. They should also exist in the Detail File. A comparison needs to be made. Incase, they don't exist the script should fail.

The current code cuts the names and puts that into a temporary file. Then it loops and checks the
existence of these names in the Detail file. If any of the names doesn't exist, then the
script should abort.

I am getting confused about the looping process here...Is this the right way to work through the solution?
Moreover, the detail file in reality has 13 million records.

PHP Code:


#!/usr/bin/ksh



# put the second column into a file,

# make it unique values



awk '{ print $2 } ' file1 | sort -u > patternfile

# loop thru the patterns from file1

# look for them in file2

while read pattern1

do

grep "$pattern1" file2

rc=$?

if [ ${rc} -eq 0 ]; then

echo "Pattern found in file2 -- Successful"

else

echo "Pattern "$Pattern" not found in file2, Failed"

fi

done < patternfile

Also attached are the sample files...

Sample Metadata file

M ORBITZ 8 LAST_BOOKED_DATE D
M AIRTRAN 8 TRIPS_YTD A 11
M FRONTIER 5 FLT_COUNT N
M CAESAR 7 DAYSPLAYED A 9
M BESTWESTERN 4 ACTIVITY_CNT_L12 A

Sample Detail file

D BESTWESTERN FIRST LAST 10545 WILLOWS RD NE
D ORBITZ FIRST LAST 550 N CENTRAL ROWIE AZ
D AIRTRAN FIRST LAST 6755B WILLOW BROOK PARK # P
D FRONTIER FIRST LASTNAME PO BOX 370
D CAESAR FIRST LAST 2113 CRIMSCENDDR # 10

I hope I am clear this time...

madhunk

View Public Profile for madhunk

Find all posts by madhunk

05-22-2006

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

ok - not tested.

nawk -f mad.awk DetailFile.txt MetadataFile.txt

mad.awk:

Code:

FNR==NR{
   detail[$2]
   next
}
{ 
  printf("Meta [%s] %s found in Detail-- %s\n",  $2, ($2 in detail) ? "" : "NOT",  ($2 in detail) ? "Successful" : "Failed")
}

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

Shell Programming and Scripting

Challenging Compare and validate question -- plus speed.

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to compare files and validate order of headers

Discussion started by: cmccabe

2. Programming

Generic speed question (PHP vs.)

Discussion started by: MoreCowbell

3. Shell Programming and Scripting

Challenging scenario

Discussion started by: pandeesh

4. Shell Programming and Scripting

Another validate input Question.

Discussion started by: Habitual

5. Filesystems, Disks and Memory

data from blktrace: read speed V.S. write speed

Discussion started by: W.C.C

6. Shell Programming and Scripting

Need help with this challenging code....

Discussion started by: tajdar

7. Shell Programming and Scripting

Compare files question

Discussion started by: jakSun8

8. Shell Programming and Scripting

Challenging!! Help needed

Discussion started by: hcdiss

9. Filesystems, Disks and Memory

dmidecode, RAM speed = "Current Speed: Unknown"

Discussion started by: Santi

10. UNIX for Advanced & Expert Users

Very Challenging Question! Need help bad!

Discussion started by: Sparticus007