awk/sed to get unique row

awk/sed to get unique row

Hello ALL,

I have very huge file almost 25G size
contents of the file are "|" delimited columns on each row

1396745|1078529|KDS|2011-04-21 00:00:00.0|1100|30|2|2011-04-20 22:35:24.0|2011-04-20 22:35:24.0|0|2011-04-21 00:00:00.0|1100|2222434|2011-04-21 11:00:00.0|0|0|2011-06-29 00:05:10
1396745|1078529|KDS|2011-04-21 00:00:00.0|1100|30|2|2011-04-20  22:35:24.0|2011-04-20 22:35:24.0|0|2011-04-21 00:00:00.0|1100|2222434|2011-04-21 11:00:00.0|0|0|2011-06-29 00:20:10

col1, col2 combination is the key

i need unique row based on these two columns


1396745|1078529|KDS|2011-04-21 00:00:00.0|1100|30|2|2011-04-20 22:35:24.0|2011-04-20 22:35:24.0|0|2011-04-21 00:00:00.0|1100|2222434|2011-04-21 11:00:00.0|0|0|2011-06-29 00:20:10

i need the one with higher timestamp too

i dont want to load 25 gig file with duplicates in to DB
So please suggest a awk/sed to remove the duplicates

Is the data ordered by the key values (the first and the second field) and the timestamp in the last column?

In this case something like this should work:

awk -F\| 'END {
  if (prev)
    print prev
!key[$1, $2]++ && NR > 1 {
  print prev
  prev = x
{ prev = $0 }' infile

Last edited by radoulov; 06-30-2011 at 11:13 AM..
awk -F'|' '!(key[$1$2]){print;key[$1$2]=1}' yourfile

Last edited by pludi; 06-30-2011 at 11:10 AM.. Reason: correction
This User Gave Thanks to pludi For This Post:
Originally Posted by radoulov
Is the data ordered (by the key values (the first and the second field) and the timestamp in the last column?
no the first two columns are not sorted ,
most chances are the timestamp(last) column will be sorted but even if the getting higher timestamp causes long script then we can eliminate it
ignoring the timestamp in the last column:

awk -F\| '!key[$1, $2]++' infile

This User Gave Thanks to radoulov For This Post:
And this should handle the timestamp too:

awk -F\| 'END {
  for (R in rec)
    print rec[R]
$NF > max[$1, $2] { 
    max[$1, $2] = $NF
    rec[$1, $2] = $0
    }' infile

Last edited by radoulov; 06-30-2011 at 11:23 AM.. Reason: Refactoring.
Originally Posted by pludi
awk -F'|' '!(key[$1$2]){print;key[$1$2]=1}' yourfile

thanks Pludi
One question does this give higher timestamp
