File delta detection


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting File delta detection
# 1  
Old 08-10-2012
File delta detection

Hello,

I need to compare two flat files (ASCII format), say file OLD and file NEW. Both have similar structure. These files are | delimitted files and have around few million of records (lines) each. Each file has same set columns and same set of key columns (i.e. the 3rd and 5th column of the record). I need to compare all the records in OLD and NEW file on the basis of values present in key columns, and store the result into DELTA file. If key set is present in NEW but not in OLD, then append 'I' in the record (NEW file record) and store in DELTA files and if key set is present in OLD file but not in NEW then append 'D' to the record (ODL file record) and store in DELTA file, also if same key set is present in both OLD and NEW file then append 'U' to the record (NEW file record) and store in DELTA file. There can be any number of fields(columns) in one record (row). Key set will be provided as the input to the program.

Please help me on the same.

Thanks,
Manu
# 2  
Old 08-10-2012
Examples please. Sounds like sth. diff could handle perfectly if lines consisted of keys only. How about cutting the keys to temp files, diff these, and go back to the original files with the result?
# 3  
Old 08-10-2012
Hi,

Please see the example below:

Column 3 and Column 5 represents the key set.

OLD
Code:
a|ae|1|ecg|10|@jks*|xyz
b|dm|2|bp|20|$5lw!|qwe
c|co|3|mb|30|&dhf!|gfh
d|te|4|value|40|@kdk+|kdd

NEW
Code:
a|ae|1|ecg|10|@jks*|xyz
b|dm|2|bp|20|$lwa!|sdf
d|te|4|value|40|@kdk-|kdd
f|sv|5|jdjd|50|^ghd2|uyy


DELTA
Code:
b|dm|2|bp|20|$lwa!|sdf|u
c|co|3|mb|30|&dhf!|gfh|d
d|te|4|value|40|@kdk-|kdd|u
f|sv|5|jdjd|50|^ghd2|uyy|i

One more condition to add, we don't have to include those records in DELTA file, which are identical in OLD and NEW files (means we don't need to include identical lines of NEW and OLD file into DELTA file, see example above for 1st line in both the files)

I tried, with sorting on key columns and then finding the difference.... but I am not good enough in UNIX. Smilie

Thanks,
Manu
# 4  
Old 08-10-2012
Let me try to paraphrase your requirement:
If lines are identical, skip.
If different, but keys are same: output NEW's line to DELTA, adding a "u".
If keys differ, output NEW's line adding "i" and output OLD's line adding "d".

What are the files' sorting criteria? I'm afraid we're going to lose sync once the deviations occur. What will the maximum count be between lines with identical key pairs?
# 5  
Old 08-10-2012
Yes, you got the requirements right.

I was uniquly sorting on basis of key columns, like if we have 3rd and 5th as key column, I will use sort command like
Code:
sort -k3,3 -k5,5 | uniq

Code:
What will the maximum count be between lines with identical key pairs?

I am not sure what above means.

Before comparing files with each other, we should see in same files, if 3 and 5 are key columns, and one file has same records many times like

Code:
[1]|[2]|[3]|[4]|[5]
 A    B    1   C    2
 A    B    1   C    2
 A    B    1   C    2
 D    E    3   F    4

then
Code:
 A    B    1   C    2

should be considered as single record.

and if

Code:
[1]|[2]|[3]|[4]|[5]
 A    B    1   C    2
 A    B    1   D    2
 A    E    1   F    2
 D    E    3   F    4

then records

Code:
 A    B    1   C    2
 A    B    1   D    2
 A    E    1   F    2

should treated as error records and moved to .err file.

Let me know if that what you want to ask.

Regards,
Manu
# 6  
Old 08-10-2012
Quote:
Originally Posted by manubatham20

Code:
What will the maximum count be between lines with identical key pairs?

I am not sure what above means.
How large will the maximum leap in the keys, will e.g. key 3 leap from 3 to 5 or to 15 between to adjacent lines?

Quote:
Code:
[1]|[2]|[3]|[4]|[5]
 A    B    1   C    2
 A    B    1   C    2
 A    B    1   C    2
 D    E    3   F    4

This should not occur if you used uniq after sorting.
# 7  
Old 08-10-2012
It depends on data, and we can't determine that.
Login or Register to Ask a Question

Previous Thread | Next Thread

7 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Delta from the first digit

Thanks of your suggestions i was able to calculate the delta between some numbers in a column file with . awk 'BEGIN{last=0}{delta=$1-last; last=$1; print $0" "delta}' the file was like 499849120.00 500201312.00 500352416.00 500402784.00 500150944.00 499849120.00 500150944.00... (3 Replies)
Discussion started by: Board27
3 Replies

2. Shell Programming and Scripting

GPS extracts fix delta time

Hello all, I am currently trying to find the delta time from some GPS log. I am using the following script with awk. But the script result shows some incorrect values (delta time some time = 0.2 but when I check it manually it is equal to 0.1) My final goal is to get a script that print... (7 Replies)
Discussion started by: redafenj
7 Replies

3. Shell Programming and Scripting

Comparing delta values of one xml file in other xml file

Hi All, I have two xml files. One is having below input <NameValuePair> <name>Daemon</name> <value>tcp:7474</value> </NameValuePair> <NameValuePair> <name>Network</name> <value></value> </NameValuePair> ... (2 Replies)
Discussion started by: sharsour
2 Replies

4. Shell Programming and Scripting

help with email to be triggered based on fatal error detection from batch run log file neded

Hi, I require need help in two aspects actually: 1) Fatal error that gets generated as %F% from a log file say ABClog.dat to trigger a mail. At present I manually grep the log file as <grep %F% ABClog.dat| cut-d "%" -f1>. The idea is to use this same logic to grep the log file which is... (1 Reply)
Discussion started by: zico1986
1 Replies

5. Programming

Parallel Processing Detection and Program Return Value Detection

Hey, for the purpose of a research project I need to know if a specific type of parallel processing is being utilized by any user-run programs. Is there a way to detect whether a program either returns a value to another program at the end of execution, or just utilizes any form of parallel... (4 Replies)
Discussion started by: azar.zorn
4 Replies

6. Shell Programming and Scripting

Creating DELTA file in UNIX

I have a fixed length file (854 characters file). Our project will start getting this file soon. On the first day this file will have 100000 records. From the next day the file will have all the records from previous day + some new records (there will be few additions + few changes in day1... (13 Replies)
Discussion started by: varunrbs
13 Replies

7. Shell Programming and Scripting

File detection then run script

I am currently running 4 scripts to complete a job for me. Each script requires the finished file of the one before it. For example the first script gets the finished file called model.x, then i would like script2 to start in and use model.x as the input and get model_min.x as the finished... (5 Replies)
Discussion started by: olifu02
5 Replies
Login or Register to Ask a Question