How to delete duplicate records based on key

12-15-2008

Registered User

28, 0

Join Date: Apr 2008

Last Activity: 11 October 2019, 2:18 AM EDT

Posts: 28

Thanks Given: 0

Thanked 0 Times in 0 Posts

How to delete duplicate records based on key

For example suppose I have a file which contains data as:
$cat data
800,2
100,9
700,3
100,9
200,8
100,3

Now I want the output as
200,8
700,3
800,2

Key is first three characters, I don't want any reords which are having duplicate keys.

Like sort +0.0 -0.3 data can we use similarly in uniq command?

Actual file contains more than 3 million records so I think any shell script will take lots of processing time. I want some fast command.

Please share your thoughts!

Thanks
Sumit

sumitc

View Public Profile for sumitc

Find all posts by sumitc

12-15-2008

Registered User

21, 1

Join Date: Dec 2008

Last Activity: 7 December 2010, 9:55 PM EST

Location: CHINA

Posts: 21

Thanks Given: 0

Thanked 1 Time in 1 Post

Code:

awk -F "," ' {
  cnt[$1] ++
  sav[$1] = $0
} 
END {
  for (x in sav)
     if (cnt[x] == 1)
       print sav[x]
}' your-file

if you have enough memory, this may works.
maybe the following is useful in memory tensive situation

Code:

awk -F "," '
NR == FNR {
  cnt[$1] ++
}
NR != FNR {
  if (cnt[$1] == 1)
    print $0
}' your-file your-file

Last edited by ivhb; 12-15-2008 at 11:24 AM..

ivhb

View Public Profile for ivhb

Find all posts by ivhb

12-15-2008

Registered User

28, 0

Join Date: Apr 2008

Last Activity: 11 October 2019, 2:18 AM EDT

Posts: 28

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks a lot for your amazing code!

But it worked for sample data I have given.

Your first code is giving following error:
awk: 0602-590 Internal software error in the tostring function on

and second code really worked:
It took 3 min 16 seconds to process 3407871 records

Really cool! I was breaking my head in sort and uniq command !

Once again thank you!

Regards
Sumit

sumitc

View Public Profile for sumitc

Find all posts by sumitc

12-15-2008

Registered User

2,898, 136

Join Date: Mar 2007

Last Activity: 11 July 2016, 2:55 PM EDT

Location: Toronto, Canada

Posts: 2,898

Thanks Given: 0

Thanked 136 Times in 120 Posts

Code:

awk '{ x[substr($0,1,3)]++; y[substr($0,1,3)] = $0 }
END { for ( n in x ) if ( x[n] == 1 ) print y[n] }' data | sort

cfajohnson

View Public Profile for cfajohnson

Find all posts by cfajohnson

12-15-2008

Registered User

28, 0

Join Date: Apr 2008

Last Activity: 11 October 2019, 2:18 AM EDT

Posts: 28

Thanks Given: 0

Thanked 0 Times in 0 Posts

thank you it also worked!

sumitc

View Public Profile for sumitc

Find all posts by sumitc

12-15-2008

Registered User

28, 0

Join Date: Apr 2008

Last Activity: 11 October 2019, 2:18 AM EDT

Posts: 28

Thanks Given: 0

Thanked 0 Times in 0 Posts

But can you guys please explain the codes, so that I can understand what exactly it is doing? I really appreciate ur help!

Regards
Sumit

sumitc

View Public Profile for sumitc

Find all posts by sumitc

12-15-2008

Registered User

28, 0

Join Date: Apr 2008

Last Activity: 11 October 2019, 2:18 AM EDT

Posts: 28

Thanks Given: 0

Thanked 0 Times in 0 Posts

Only second code is working-- first and third one is giving error --
"awk: 0602-590 Internal software error in the tostring function on"

Thanks
sumit

sumitc

View Public Profile for sumitc

Find all posts by sumitc

Shell Programming and Scripting

How to delete duplicate records based on key

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Delete duplicate row based on criteria

Discussion started by: shash

2. Shell Programming and Scripting

Removing specific records from files when duplicate key

Discussion started by: tinytimmay

3. UNIX for Dummies Questions & Answers

Delete records from a big file based on some condition

Discussion started by: hemantraijain

4. Shell Programming and Scripting

Removing duplicate records in a file based on single column explanation

Discussion started by: cokedude

5. Shell Programming and Scripting

Removing duplicate records in a file based on single column

Discussion started by: G.K.K

6. Shell Programming and Scripting

awk - splitting 1 large file into multiple based on same key records

Discussion started by: kam66

7. UNIX for Dummies Questions & Answers

forming duplicate rows based on value of a key

Discussion started by: ruby_sgp

8. UNIX for Dummies Questions & Answers

Delete lines with duplicate strings based on date

Discussion started by: mattv

9. Shell Programming and Scripting

how to delete duplicate rows based on last column

Discussion started by: reva

10. Shell Programming and Scripting

Delete Duplicate records from a tilde delimited file

Discussion started by: irshadm