awk/sed to get unique row

06-30-2011

Registered User

37, 6

Join Date: Jun 2010

Last Activity: 5 August 2013, 1:54 PM EDT

Posts: 37

Thanks Given: 19

Thanked 6 Times in 3 Posts

awk/sed to get unique row

Hello ALL,

I have very huge file almost 25G size
contents of the file are "|" delimited columns on each row

Code:

eg:
1396745|1078529|KDS|2011-04-21 00:00:00.0|1100|30|2|2011-04-20 22:35:24.0|2011-04-20 22:35:24.0|0|2011-04-21 00:00:00.0|1100|2222434|2011-04-21 11:00:00.0|0|0|2011-06-29 00:05:10
1396745|1078529|KDS|2011-04-21 00:00:00.0|1100|30|2|2011-04-20  22:35:24.0|2011-04-20 22:35:24.0|0|2011-04-21 00:00:00.0|1100|2222434|2011-04-21 11:00:00.0|0|0|2011-06-29 00:20:10

col1, col2 combination is the key

i need unique row based on these two columns

Code:

eg:

1396745|1078529|KDS|2011-04-21 00:00:00.0|1100|30|2|2011-04-20 22:35:24.0|2011-04-20 22:35:24.0|0|2011-04-21 00:00:00.0|1100|2222434|2011-04-21 11:00:00.0|0|0|2011-06-29 00:20:10

i need the one with higher timestamp too

i dont want to load 25 gig file with duplicates in to DB
So please suggest a awk/sed to remove the duplicates

Thanks

posner

View Public Profile for posner

Find all posts by posner

06-30-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Is the data ordered by the key values (the first and the second field) and the timestamp in the last column?

In this case something like this should work:

Code:

awk -F\| 'END {
  if (prev)
    print prev
  } 
!key[$1, $2]++ && NR > 1 {
  print prev
  prev = x
  }
{ prev = $0 }' infile

Last edited by radoulov; 06-30-2011 at 11:13 AM..

radoulov

View Public Profile for radoulov

Find all posts by radoulov

06-30-2011

Registered User

5,521, 335

Join Date: Dec 2008

Last Activity: 28 March 2014, 8:35 AM EDT

Location: Vienna, Austria, Earth

Posts: 5,521

Thanks Given: 38

Thanked 335 Times in 308 Posts

Code:

awk -F'|' '!(key[$1$2]){print;key[$1$2]=1}' yourfile

Last edited by pludi; 06-30-2011 at 11:10 AM.. Reason: correction

This User Gave Thanks to pludi For This Post:

pludi

View Public Profile for pludi

Find all posts by pludi

06-30-2011

Registered User

37, 6

Join Date: Jun 2010

Last Activity: 5 August 2013, 1:54 PM EDT

Posts: 37

Thanks Given: 19

Thanked 6 Times in 3 Posts

Quote:

Originally Posted by radoulov

Is the data ordered (by the key values (the first and the second field) and the timestamp in the last column?

no the first two columns are not sorted ,
most chances are the timestamp(last) column will be sorted but even if the getting higher timestamp causes long script then we can eliminate it

posner

View Public Profile for posner

Find all posts by posner

06-30-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Well,
ignoring the timestamp in the last column:

Code:

awk -F\| '!key[$1, $2]++' infile

This User Gave Thanks to radoulov For This Post:

radoulov

View Public Profile for radoulov

Find all posts by radoulov

06-30-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

And this should handle the timestamp too:

Code:

awk -F\| 'END {
  for (R in rec)
    print rec[R]
  }
$NF > max[$1, $2] { 
    max[$1, $2] = $NF
    rec[$1, $2] = $0
    }' infile

Last edited by radoulov; 06-30-2011 at 11:23 AM.. Reason: Refactoring.

radoulov

View Public Profile for radoulov

Find all posts by radoulov

06-30-2011

Registered User

37, 6

Join Date: Jun 2010

Last Activity: 5 August 2013, 1:54 PM EDT

Posts: 37

Thanks Given: 19

Thanked 6 Times in 3 Posts

Quote:

Originally Posted by pludi

Code:

awk -F'|' '!(key[$1$2]){print;key[$1$2]=1}' yourfile

thanks Pludi
One question does this give higher timestamp

posner

View Public Profile for posner

Find all posts by posner

Shell Programming and Scripting

awk/sed to get unique row

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Unique values in a row sum the next column in UNIX

Discussion started by: reks

2. Shell Programming and Scripting

Reading and appending a row from file1 to file2 using awk or sed

Discussion started by: ida1215

3. Shell Programming and Scripting

Awk/sed script for transposing any number of rows with header row

Discussion started by: tntelle

4. Shell Programming and Scripting

Print unique names in each row of a specific column using awk

Discussion started by: quincyjones

5. Shell Programming and Scripting

Need an awk / sed / or perl one-liner to remove last 4 characters with non-unique pattern.

Discussion started by: right_coaster

6. Shell Programming and Scripting

Combining multiple rows in single row based on certain condition using awk or sed

Discussion started by: samuelray

7. Shell Programming and Scripting

Replace last row of a column in bash/awk/sed

Discussion started by: jhunter87

8. Shell Programming and Scripting

Concatenating column values with unique id into single row

Discussion started by: jsaravana

9. Shell Programming and Scripting

shell script(Preferably awk or sed) to print selected number of columns from each row

Discussion started by: ks_reddy

10. Shell Programming and Scripting

Add row, awk, sed ?

Discussion started by: patrykxes