Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column

04-23-2013

Registered User

2, 0

Join Date: Apr 2013

Last Activity: 26 April 2013, 12:03 AM EDT

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column

Hi Experts ,

we have a CDC file where we need to get the latest record of the Key columns
Key Columns will be CDC_FLAG and SRC_PMTN_I
and fetch the latest record from the CDC_PRCS_TS

Can we do it with a single awk command.
Please help.

Code:

CDC_PRCS_TS|CDC_SEQ_I|CDC_FLAG|CDC_UPDT_USER|SRC_PMTN_I|TGT_PMTN_I|PMTN_N|PMTN_DESC_T
2013-03-27 10:32:30|0|I|NOT SET   |124|215|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:31|0|I|NOT SET   |124|215|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:32|0|I|NOT SET   |124|215|PRMO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:33|0|I|NOT SET   |124|215|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:30|0|U|NOT SET   |125|216|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:31|0|U|NOT SET   |125|216|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:32|0|U|NOT SET   |125|216|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:33|0|U|NOT SET   |125|216|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:30|0|U|NOT SET   |125|216|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:30|0|I|NOT SET   |126|217|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:30|0|U|NOT SET   |127|768|MDS PROJECT|UAT PHASE
2013-03-27 10:32:30|0|U|NOT SET   |128|454|MDS PROJECT|UAT PHASE
2013-03-27 10:32:30|0|U|NOT SET   |129|234|PROJECT|UAT PHASE
2013-03-27 10:32:30|0|D|NOT SET   |130|123|PROMO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:30|0|D|NOT SET   |131|212|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:30|0|D|NOT SET   |132|213|PROO-2:PMI PROJECT|INITIAL PHASE

Last edited by Franklin52; 04-26-2013 at 03:27 AM.. Reason: Please use code tags

vijaykodukula

View Public Profile for vijaykodukula

Find all posts by vijaykodukula

04-25-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Can we assume that CDC_FLAG will always be the 3rd field, SRC_PMTN_I will always be the 5th field, and CDC_PRCS_TS will always be the 1st field; or do we have to match the strings against the header line to determine which fields to use?

Do the output records need to be in the same order as they appeared in the input file or can the output be in random order except that the 1st output line must be the 1st input line (the headings)?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-26-2013

Registered User

2, 0

Join Date: Apr 2013

Last Activity: 26 April 2013, 12:03 AM EDT

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi Don,

We have multiple files .Each file has different layout and has its own awk command to identify the fields for the Key columns.This is one sample.

Advance thanks a lot for the help.

we are using this for one of the sample file.

sort -r -t'|' -k1 "sample.txt"|awk '!x[$2,$4,$5]++' FS='|'|sort -t'|' -k2

Regards

vijaykodukula

View Public Profile for vijaykodukula

Find all posts by vijaykodukula

04-26-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by vijaykodukula

Hi,
Please use CODE tags (rather than the color blue) to display sample input, output, and code.

You didn't really answer my questions. If you're trying to get a single script that can handle any of your files, how will that script know which fields need to be used to select entries, which field will be used to select the latest entry, and which field should be used to sort the output?

Is it always fields with the headers CDC_FLAG and SRC_PMTN_I as keys, always the field with the header CDC_PRCS_TS to select a line out of all lines with matching keys, and always sort the output with CDC_SEQ_I as the primary sort key and CDC_PRCS_TS as the secondary sort key?

Do you have a file that contains filenames that will be processed and the associated field numbers (or names) from which the above data can be extracted by the script?

The sample input file you showed is not sorted in the order that you said should be the output sort order. It seems to be sorted with field 5 as the primary sort key and field 1 as the secondary sort key, not field 2 as the primary sort key. Is this a mistake in your specification or is the script supposed to reorder the output as well as eliminate duplicates?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Remove duplicates in a dataframe (table) keeping all the different cells of just one of the columns

Discussion started by: pedro88

2. Shell Programming and Scripting

awk to Sum columns when other column has duplicates and append one column value to another with Care

Discussion started by: as7951

3. UNIX for Beginners Questions & Answers

Sort and remove duplicates in directory based on first 5 columns:

Discussion started by: gnnsprapa

4. UNIX for Dummies Questions & Answers

Display latest record from file based on multiple columns combination

Discussion started by: tmalik79

5. Shell Programming and Scripting

Remove the time from the date column

Discussion started by: villain41

6. Shell Programming and Scripting

Removing duplicates in fixed width file which has multiple key columns

Discussion started by: saj

7. Shell Programming and Scripting

finding duplicates in csv based on key columns

Discussion started by: baskivs

8. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Discussion started by: script_op2a

9. Shell Programming and Scripting

Search based on 1,2,4,5 columns and remove duplicates in the same file.

Discussion started by: onesuri

10. Shell Programming and Scripting

Remove duplicates based on the two key columns

Discussion started by: kmsekhar