Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column Post: 302799171

Sponsored Content

Top Forums Shell Programming and Scripting Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column Post 302799171 by Don Cragun on Friday 26th of April 2013 01:01:17 AM

04-26-2013

Registered User

Quote:

Originally Posted by vijaykodukula

Hi Don,

We have multiple files .Each file has different layout and has its own awk command to identify the fields for the Key columns.This is one sample.

Advance thanks a lot for the help.

we are using this for one of the sample file.

sort -r -t'|' -k1 "sample.txt"|awk '!x[$2,$4,$5]++' FS='|'|sort -t'|' -k2

Regards

Hi,
Please use CODE tags (rather than the color blue) to display sample input, output, and code.

You didn't really answer my questions. If you're trying to get a single script that can handle any of your files, how will that script know which fields need to be used to select entries, which field will be used to select the latest entry, and which field should be used to sort the output?

Is it always fields with the headers CDC_FLAG and SRC_PMTN_I as keys, always the field with the header CDC_PRCS_TS to select a line out of all lines with matching keys, and always sort the output with CDC_SEQ_I as the primary sort key and CDC_PRCS_TS as the secondary sort key?

Do you have a file that contains filenames that will be processed and the associated field numbers (or names) from which the above data can be extracted by the script?

The sample input file you showed is not sorted in the order that you said should be the output sort order. It seems to be sorted with field 5 as the primary sort key and field 1 as the secondary sort key, not field 2 as the primary sort key. Is this a mistake in your specification or is the script supposed to reorder the output as well as eliminate duplicates?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicates based on the two key columns

Hi All, I needs to fetch unique records based on a keycolumn(ie., first column1) and also I needs to get the records which are having max value on column2 in sorted manner... and duplicates have to store in another output file. Input : Input.txt 1234,0,x 1234,1,y 5678,10,z 9999,10,k...

2. Shell Programming and Scripting

Search based on 1,2,4,5 columns and remove duplicates in the same file.

Hi, I am unable to search the duplicates in a file based on the 1st,2nd,4th,5th columns in a file and also remove the duplicates in the same file. Source filename: Filename.csv "1","ccc","information","5000","temp","concept","new" "1","ddd","information","6000","temp","concept","new"...

3. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Given a file such as this I need to remove the duplicates. 00060011 PAUL BOWSTEIN ad_waq3_921_20100826_010517.txt 00060011 PAUL BOWSTEIN ad_waq3_921_20100827_010528.txt 0624-01 RUT CORPORATION ad_sade3_10_20100827_010528.txt 0624-01 RUT CORPORATION ...

4. Shell Programming and Scripting

finding duplicates in csv based on key columns

Hi team, I have 20 columns csv files. i want to find the duplicates in that file based on the column1 column10 column4 column6 coulnn8 coulunm2 . if those columns have same values . then it should be a duplicate record. can one help me on finding the duplicates, Thanks in advance. ...

5. Shell Programming and Scripting

Removing duplicates in fixed width file which has multiple key columns

Hi All , I have a requirement where I need to remove duplicates from a fixed width file which has multiple key columns .Also , need to capture the duplicate records into another file . File has 8 columns. Key columns are col1 and col2. Col1 has the length of 8 col 2 has the length of 3. ...

6. Shell Programming and Scripting

Remove the time from the date column

Hi, I have file named file1.txt with below contents cat file1.txt 1/29/2014 0:00,706886 1/30/2014 0:00,791265 1/31/2014 0:00,987087 2/1/2014 0:00,1098572 2/2/2014 0:00,572477 2/3/2014 0:00,701715 I want to display as below 1/29/2014,706886 1/30/2014,791265 1/31/2014,987087...

7. UNIX for Dummies Questions & Answers

Display latest record from file based on multiple columns combination

I have requirement to print latest record from file based on multiple columns combination. EWAPE EW1SLE0000 EW1SOMU01 ABORTED 03/16/2015 100004 03/16/2015 100005 001 EWAPE EW1SLE0000 EW1SOMU01 ABORTED 03/18/2015 140003 03/18/2015 140004 001 EWAPE EW1SLE0000 EW1SOMU01 ABORTED 03/18/2015 220006...

8. UNIX for Beginners Questions & Answers

Sort and remove duplicates in directory based on first 5 columns:

I have /tmp dir with filename as: 010020001_S-FOR-Sort-SYEXC_20160229_2212101.marker 010020001_S-FOR-Sort-SYEXC_20160229_2212102.marker 010020001-S-XOR-Sort-SYEXC_20160229_2212104.marker 010020001-S-XOR-Sort-SYEXC_20160229_2212105.marker 010020001_S-ZOR-Sort-SYEXC_20160229_2212106.marker...

9. Shell Programming and Scripting

awk to Sum columns when other column has duplicates and append one column value to another with Care

Hi Experts, Please bear with me, i need help I am learning AWk and stuck up in one issue. First point : I want to sum up column value for column 7, 9, 11,13 and column15 if rows in column 5 are duplicates.No action to be taken for rows where value in column 5 is unique. Second point : For...

10. UNIX for Beginners Questions & Answers

Remove duplicates in a dataframe (table) keeping all the different cells of just one of the columns

Hello all, I need to filter a dataframe composed of several columns of data to remove the duplicates according to one of the columns. I did it with pandas. In the main time, I need that the last column that contains all different data ( not redundant) is conserved in the output like this: A ...

LEARN ABOUT PLAN9

join

JOIN(1) 						      General Commands Manual							   JOIN(1)

NAME

       join - relational database operator

SYNOPSIS

       join [ options ] file1 file2

DESCRIPTION

       Join forms, on the standard output, a join of the two relations specified by the lines of file1 and file2.  If one of the file names is the
       standard input is used.

       File1 and file2 must be sorted in increasing ASCII collating sequence on the fields on which they are to be joined, normally the  first	in
       each line.

       There  is  one line in the output for each pair of lines in file1 and file2 that have identical join fields.  The output line normally con-
       sists of the common field, then the rest of the line from file1, then the rest of the line from file2.

       Input fields are normally separated spaces or tabs; output fields by space.  In this case, multiple separators count as	one,  and  leading
       separators are discarded.

       The following options are recognized, with POSIX syntax.

       -a n   In addition to the normal output, produce a line for each unpairable line in file n, where n is 1 or 2.

       -v n   Like -a, omitting output for paired lines.

       -e s   Replace empty output fields by string s.

       -1 m
       -2 m   Join on the mth field of file1 or file2.

       -jn m  Archaic equivalent for -n m.

       -ofields
	      Each  output  line  comprises the designated fields.  The comma-separated field designators are either 0, meaning the join field, or
	      have the form n.m, where n is a file number and m is a field number.  Archaic usage allows separate arguments for field designators.

       -tc    Use character c as the only separator (tab character) on input and output.  Every appearance of c in a line is significant.

EXAMPLES

       sort /adm/users | join -t: -a 1 -e "" - bdays
	      Add birthdays to password information, leaving unknown birthdays empty.  The layout of is given in users(6); bdays  contains  sorted
	      lines like

       tr : ' ' </adm/users | sort -k 3 3 >temp
       join -1 3 -2 3 -o 1.1,2.1 temp temp | awk '$1 < $2'
	      Print all pairs of users with identical userids.

SOURCE

       /sys/src/cmd/join.c

SEE ALSO

       sort(1), comm(1), awk(1)

BUGS

       With default field separation, the collating sequence is that of sort -b -ky,y; with -t, the sequence is that of sort -tx -ky,y.
       One of the files must be randomly accessible.

																	   JOIN(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicates based on the two key columns

Discussion started by: kmsekhar

2. Shell Programming and Scripting

Search based on 1,2,4,5 columns and remove duplicates in the same file.

Discussion started by: onesuri

3. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Discussion started by: script_op2a

4. Shell Programming and Scripting

finding duplicates in csv based on key columns

Discussion started by: baskivs