Challenging Compare and validate question -- plus speed.


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Challenging Compare and validate question -- plus speed.
# 1  
Old 05-22-2006
Challenging Compare and validate question -- plus speed.

I have a tab delimited HUGE file (13 million records) with Detail, Metadata and Summary records.

Sample File looks like this

M BESTWESTERN 4 ACTIVITY_CNT_L12 A 3
M AIRTRAN 4 ACTIVITY_CNT_L12 A 3
D BESTWESTERN FIRSTNAME LASTNAME 209 N SANBORN AVE
D BESTWESTERN FIRSTNAME LASTNAME 6997 COUNTY ROAD D
D AIRTRAN FIRSTNAME LASTNAME 6997 COUNTY ROAD D
S BESTWESTERN 2
S AIRTRAN 2

I have split the file into three different files.

Metadata file
Detail file
Summary file

The challenge is to check if the information in Metadata records exist in the Detail record file. The names are not constant and WILL change with every incoming file.

1) The script needs to dynamically check the column in the Metadata record file that contains, for example 'BESTWESTERN' and 'AIRTRAN' and make sure that it also exists in the detail record file.

This is a huge file and need to know the fastest way to process it.

What is the best way to approach this dynamically changing file?
Please advice...

Thank You,
Madhu
# 2  
Old 05-22-2006
The only best option that I could find is to do this way...

#!/usr/bin/ksh

# put the second column into a file,
# make it unique values

awk '{ print $2 } ' file1 | sort -u > patternfile
# loop thru the patterns from file1
# look for them in file2, print patterns if not found
while read pattern1 pattern2
do
if [ ${#pattern2} -eq 0 ]; then # skip when pattern2 isn't there
continue
fi
grep "$pattern1" file2 | grep -q "$pattern2"
if [ $? -ne 0 ]; then
echo "$pattern1" "$pattern2"
fi
done < patternfile


If I can change this script so that if it doesn't find the pattern, it aborts. Should it be fine?
# 3  
Old 05-22-2006
try something like this:
nawk -f mad.awk file1 file1

mad.awk:
Code:
BEGIN {
  outMeta="meta.txt"
  outData="data.txt"
  outSumm="summ.txt"

  stderr="cat 1>&2"
}
FNR==NR{
  if ($1 == "D") data[$2];
  next
}

{
  if ($1 != "D" && ($2 in data) )
     print $0 >> ($1 == "M") ? outMeta : outSumm
  else if ( $1 != "D" )
        printf("WARNING::[%d]: Meta or Summary is NOT in Data: [%s]\n", FNR, $2) | stderr

  if ($1 == "D" )
     print $0 >> outData
}

# 4  
Old 05-22-2006
Thank you for the message. I am not sure if I have communicated correctly. But I am looking to do something like this:

Steps:
1) Put second column from file 1 (tab delimited Metadata file) into a pattern file.
2) Count the number of patterns and print the patterns.
3) Loop through the pattern file from file1 and look for those patterns in file2 (tab delimited Detail records file).
3) If there is no pattern found in file2, print the particular pattern that was not found in file2 and abort.

I could do something like this....But going wrong somewhere...Any ideas will be very much appreciated.


#!/usr/bin/ksh

# put the second column into a file,
# make it unique values

awk '{ print $2 } ' file1 | sort -u > patternfile
# loop thru the patterns from file1
# look for them in file2
while read pattern1
do
/usr/xpg4/bin/grep -q "$pattern1" file2
rc=$?
if [ ${rc} -eq 0 ]; then
echo "Pattern found in file2 -- Successful"
else
echo "Pattern "$Pattern" not found in file2, Failed"
fi
done < patternfile
# 5  
Old 05-22-2006
Quote:
Originally Posted by madhunk
Thank you for the message. I am not sure if I have communicated correctly. But I am looking to do something like this:

Steps:
1) Put second column from file 1 (tab delimited Metadata file) into a pattern file.
2) Count the number of patterns and print the patterns.
why do you need to count them and why do you need to print them?
Quote:
Originally Posted by madhunk
3) Loop through the pattern file from file1 and look for those patterns in file2 (tab delimited Detail records file).
hmmm...... I thouight there was just ONE file.
Now you're saying there're TWO files?

It might be a good idea to post sample INPUT file(s???) [if there're multiples] and instead of outlining the algorithm - outlinie the what needs to be done AND a sample end-result given the sample input/file(s)

Also pls use vB codes when posting code and/quotes - it makes reading the posting much easier.

Quote:
Originally Posted by madhunk
3) If there is no pattern found in file2, print the particular pattern that was not found in file2 and abort.

I could do something like this....But going wrong somewhere...Any ideas will be very much appreciated.


#!/usr/bin/ksh

# put the second column into a file,
# make it unique values

awk '{ print $2 } ' file1 | sort -u > patternfile
# loop thru the patterns from file1
# look for them in file2
while read pattern1
do
/usr/xpg4/bin/grep -q "$pattern1" file2
rc=$?
if [ ${rc} -eq 0 ]; then
echo "Pattern found in file2 -- Successful"
else
echo "Pattern "$Pattern" not found in file2, Failed"
fi
done < patternfile
# 6  
Old 05-22-2006
Hi vgersh99,

The printing is only for display purposes to see how many partner names does the metadata file has...

I am sorry for the miscommunication.

Sample Input File1 (Metadata File)
Sample Input File2 (Detail File)

The Metadata File has names such as ORBITZ, BESTWESTERN and so on. They should also exist in the Detail File. A comparison needs to be made. Incase, they don't exist the script should fail.

The current code cuts the names and puts that into a temporary file. Then it loops and checks the
existence of these names in the Detail file. If any of the names doesn't exist, then the
script should abort.

I am getting confused about the looping process here...Is this the right way to work through the solution?
Moreover, the detail file in reality has 13 million records.

PHP Code:
#!/usr/bin/ksh

# put the second column into a file,
# make it unique values

awk '{ print $2 } ' file1 sort -patternfile
# loop thru the patterns from file1
# look for them in file2
while read pattern1
do
grep "$pattern1file2
rc
=$?
if [ ${
rc} -eq 0 ]; then
echo "Pattern found in file2 -- Successful"
else
echo 
"Pattern "$Pattern" not found in file2, Failed"
fi
done 
patternfile 

Also attached are the sample files...

Sample Metadata file

M ORBITZ 8 LAST_BOOKED_DATE D
M AIRTRAN 8 TRIPS_YTD A 11
M FRONTIER 5 FLT_COUNT N
M CAESAR 7 DAYSPLAYED A 9
M BESTWESTERN 4 ACTIVITY_CNT_L12 A

Sample Detail file

D BESTWESTERN FIRST LAST 10545 WILLOWS RD NE
D ORBITZ FIRST LAST 550 N CENTRAL ROWIE AZ
D AIRTRAN FIRST LAST 6755B WILLOW BROOK PARK # P
D FRONTIER FIRST LASTNAME PO BOX 370
D CAESAR FIRST LAST 2113 CRIMSCENDDR # 10

I hope I am clear this time...
# 7  
Old 05-22-2006
ok - not tested.

nawk -f mad.awk DetailFile.txt MetadataFile.txt

mad.awk:
Code:
FNR==NR{
   detail[$2]
   next
}
{ 
  printf("Meta [%s] %s found in Detail-- %s\n",  $2, ($2 in detail) ? "" : "NOT",  ($2 in detail) ? "Successful" : "Failed")
}

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to compare files and validate order of headers

The below awk verifies the count and order of each text file in the directory. The script does execute and produce output, however the order of the headers are not compared to key. The portion in bold is supposed to do that. If the order of the headers in each text file is the same as key, then... (0 Replies)
Discussion started by: cmccabe
0 Replies

2. Programming

Generic speed question (PHP vs.)

Hi, On a hosted linux environment which I have very little control over, I have a PHP script that takes in X number of floats, performs Y number of simple recursive arithmetic calculations, and produces some output for display to the user. When I first created the script, 'X' and 'Y' were... (4 Replies)
Discussion started by: MoreCowbell
4 Replies

3. Shell Programming and Scripting

Challenging scenario

Hi, My input file contains 1,2 2,4 3,6 4,9 9,10 My expected output is 1,10 2,10 3,6 4,1 9,10 (6 Replies)
Discussion started by: pandeesh
6 Replies

4. Shell Programming and Scripting

Another validate input Question.

I'm writing a bash shell script to 'help' me post to susepaste (I can NEVER remember the time options). Here's the code: #!/bin/bash ########## # # Project : personal script. # Started : Wed Aug 03, 2011 # Author : Habitual # Description : susepaste c-li script with user... (5 Replies)
Discussion started by: Habitual
5 Replies

5. Filesystems, Disks and Memory

data from blktrace: read speed V.S. write speed

I analysed disk performance with blktrace and get some data: read: 8,3 4 2141 2.882115217 3342 Q R 195732187 + 32 8,3 4 2142 2.882116411 3342 G R 195732187 + 32 8,3 4 2144 2.882117647 3342 I R 195732187 + 32 8,3 4 2145 ... (1 Reply)
Discussion started by: W.C.C
1 Replies

6. Shell Programming and Scripting

Need help with this challenging code....

Hello All, I am new to this forum, and the reason I came here is to seek solution from the experts. I have written following wrapper script, it was running fine from past couple of months, until last week. When one of the function in the script which suppose to login through ssh to the... (2 Replies)
Discussion started by: tajdar
2 Replies

7. Shell Programming and Scripting

Compare files question

Hi all, How do I compare contents of entire two files except for the first line is each of them? I am sure first lines from both my files are going to be different so I want to ignore them. Is there a easier way than creating temporary files by cutting out the first line and then comparing... (1 Reply)
Discussion started by: jakSun8
1 Replies

8. Shell Programming and Scripting

Challenging!! Help needed

Hi, I have a script xyz.ksh which accpets two parameters the format of first one is :X_TABLENAME_Y and second one is a digit. I can extract a table name from that parameter and store it in a variable var_tblnm, so if i pass a parameter X_TABLE1_Y the value in var_tblenm is "TABLE1" now i have... (1 Reply)
Discussion started by: hcdiss
1 Replies

9. Filesystems, Disks and Memory

dmidecode, RAM speed = "Current Speed: Unknown"

Hello, I have a Supermicro server with a P4SCI mother board running Debian Sarge 3.1. This is the "dmidecode" output related to RAM info: RAM speed information is incomplete.. "Current Speed: Unknown", is there anyway/soft to get the speed of installed RAM modules? thanks!! Regards :)... (0 Replies)
Discussion started by: Santi
0 Replies

10. UNIX for Advanced & Expert Users

Very Challenging Question! Need help bad!

I am in desperate need of an answer to this question. I have looked everywhere (even the man pages) and found very little. Solaris has the concept of "plumbing" a network interface. What does this mean? I would be really greatful to whoever could help me answer this question. I am so... (1 Reply)
Discussion started by: Sparticus007
1 Replies
Login or Register to Ask a Question