File delta detection

08-10-2012

Registered User

38, 1

Join Date: Feb 2008

Last Activity: 24 May 2017, 5:28 PM EDT

Location: Minneapolis, MN, USA

Posts: 38

Thanks Given: 7

Thanked 1 Time in 1 Post

File delta detection

Hello,

I need to compare two flat files (ASCII format), say file OLD and file NEW. Both have similar structure. These files are | delimitted files and have around few million of records (lines) each. Each file has same set columns and same set of key columns (i.e. the 3rd and 5th column of the record). I need to compare all the records in OLD and NEW file on the basis of values present in key columns, and store the result into DELTA file. If key set is present in NEW but not in OLD, then append 'I' in the record (NEW file record) and store in DELTA files and if key set is present in OLD file but not in NEW then append 'D' to the record (ODL file record) and store in DELTA file, also if same key set is present in both OLD and NEW file then append 'U' to the record (NEW file record) and store in DELTA file. There can be any number of fields(columns) in one record (row). Key set will be provided as the input to the program.

Please help me on the same.

Thanks,
Manu

manubatham20

View Public Profile for manubatham20

Find all posts by manubatham20

08-10-2012

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Examples please. Sounds like sth. diff could handle perfectly if lines consisted of keys only. How about cutting the keys to temp files, diff these, and go back to the original files with the result?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-10-2012

Registered User

38, 1

Join Date: Feb 2008

Last Activity: 24 May 2017, 5:28 PM EDT

Location: Minneapolis, MN, USA

Posts: 38

Thanks Given: 7

Thanked 1 Time in 1 Post

Hi,

Please see the example below:

Column 3 and Column 5 represents the key set.

OLD

Code:

a|ae|1|ecg|10|@jks*|xyz
b|dm|2|bp|20|$5lw!|qwe
c|co|3|mb|30|&dhf!|gfh
d|te|4|value|40|@kdk+|kdd

NEW

Code:

a|ae|1|ecg|10|@jks*|xyz
b|dm|2|bp|20|$lwa!|sdf
d|te|4|value|40|@kdk-|kdd
f|sv|5|jdjd|50|^ghd2|uyy

DELTA

Code:

b|dm|2|bp|20|$lwa!|sdf|u
c|co|3|mb|30|&dhf!|gfh|d
d|te|4|value|40|@kdk-|kdd|u
f|sv|5|jdjd|50|^ghd2|uyy|i

One more condition to add, we don't have to include those records in DELTA file, which are identical in OLD and NEW files (means we don't need to include identical lines of NEW and OLD file into DELTA file, see example above for 1st line in both the files)

I tried, with sorting on key columns and then finding the difference.... but I am not good enough in UNIX.

Thanks,
Manu

manubatham20

View Public Profile for manubatham20

Find all posts by manubatham20

08-10-2012

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Let me try to paraphrase your requirement:
If lines are identical, skip.
If different, but keys are same: output NEW's line to DELTA, adding a "u".
If keys differ, output NEW's line adding "i" and output OLD's line adding "d".

What are the files' sorting criteria? I'm afraid we're going to lose sync once the deviations occur. What will the maximum count be between lines with identical key pairs?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-10-2012

Registered User

38, 1

Join Date: Feb 2008

Last Activity: 24 May 2017, 5:28 PM EDT

Location: Minneapolis, MN, USA

Posts: 38

Thanks Given: 7

Thanked 1 Time in 1 Post

Yes, you got the requirements right.

I was uniquly sorting on basis of key columns, like if we have 3rd and 5th as key column, I will use sort command like

Code:

sort -k3,3 -k5,5 | uniq

Code:

What will the maximum count be between lines with identical key pairs?

I am not sure what above means.

Before comparing files with each other, we should see in same files, if 3 and 5 are key columns, and one file has same records many times like

Code:

[1]|[2]|[3]|[4]|[5]
 A    B    1   C    2
 A    B    1   C    2
 A    B    1   C    2
 D    E    3   F    4

then

Code:

 A    B    1   C    2

should be considered as single record.

and if

Code:

[1]|[2]|[3]|[4]|[5]
 A    B    1   C    2
 A    B    1   D    2
 A    E    1   F    2
 D    E    3   F    4

then records

Code:

 A    B    1   C    2
 A    B    1   D    2
 A    E    1   F    2

should treated as error records and moved to .err file.

Let me know if that what you want to ask.

Regards,
Manu

manubatham20

View Public Profile for manubatham20

Find all posts by manubatham20

08-10-2012

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Quote:

Originally Posted by manubatham20

Code:

What will the maximum count be between lines with identical key pairs?

I am not sure what above means.

How large will the maximum leap in the keys, will e.g. key 3 leap from 3 to 5 or to 15 between to adjacent lines?

Quote:

Code:

[1]|[2]|[3]|[4]|[5]
 A    B    1   C    2
 A    B    1   C    2
 A    B    1   C    2
 D    E    3   F    4

This should not occur if you used uniq after sorting.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-10-2012

Registered User

38, 1

Join Date: Feb 2008

Last Activity: 24 May 2017, 5:28 PM EDT

Location: Minneapolis, MN, USA

Posts: 38

Thanks Given: 7

Thanked 1 Time in 1 Post

It depends on data, and we can't determine that.

manubatham20

View Public Profile for manubatham20

Find all posts by manubatham20

Shell Programming and Scripting

File delta detection

7 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Delta from the first digit

Discussion started by: Board27

2. Shell Programming and Scripting

GPS extracts fix delta time

Discussion started by: redafenj

3. Shell Programming and Scripting

Comparing delta values of one xml file in other xml file

Discussion started by: sharsour

4. Shell Programming and Scripting

help with email to be triggered based on fatal error detection from batch run log file neded

Discussion started by: zico1986

5. Programming

Parallel Processing Detection and Program Return Value Detection

Discussion started by: azar.zorn

6. Shell Programming and Scripting

Creating DELTA file in UNIX

Discussion started by: varunrbs

7. Shell Programming and Scripting

File detection then run script

Discussion started by: olifu02