How to delete duplicate records based on key


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to delete duplicate records based on key
# 1  
Old 12-15-2008
How to delete duplicate records based on key

For example suppose I have a file which contains data as:
$cat data
800,2
100,9
700,3
100,9
200,8
100,3

Now I want the output as
200,8
700,3
800,2

Key is first three characters, I don't want any reords which are having duplicate keys.

Like sort +0.0 -0.3 data can we use similarly in uniq command?

Actual file contains more than 3 million records so I think any shell script will take lots of processing time. I want some fast command.

Please share your thoughts!

Thanks
Sumit
# 2  
Old 12-15-2008
Code:
awk -F "," ' {
  cnt[$1] ++
  sav[$1] = $0
} 
END {
  for (x in sav)
     if (cnt[x] == 1)
       print sav[x]
}' your-file

if you have enough memory, this may works.
maybe the following is useful in memory tensive situation

Code:
awk -F "," '
NR == FNR {
  cnt[$1] ++
}
NR != FNR {
  if (cnt[$1] == 1)
    print $0
}' your-file your-file


Last edited by ivhb; 12-15-2008 at 11:24 AM..
# 3  
Old 12-15-2008
Thanks a lot for your amazing code!

But it worked for sample data I have given.

Your first code is giving following error:
awk: 0602-590 Internal software error in the tostring function on

and second code really worked:
It took 3 min 16 seconds to process 3407871 records Smilie

Really cool! I was breaking my head in sort and uniq command !

Once again thank you!

Regards
Sumit
# 4  
Old 12-15-2008
Code:
awk '{ x[substr($0,1,3)]++; y[substr($0,1,3)] = $0 }
END { for ( n in x ) if ( x[n] == 1 ) print y[n] }' data | sort

# 5  
Old 12-15-2008
thank you it also worked!
# 6  
Old 12-15-2008
But can you guys please explain the codes, so that I can understand what exactly it is doing? I really appreciate ur help!

Regards
Sumit
# 7  
Old 12-15-2008
Only second code is working-- first and third one is giving error --
"awk: 0602-590 Internal software error in the tostring function on"

Thanks
sumit
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Delete duplicate row based on criteria

Hi, I have an input file as shown below: 20140102;13:30;FR-AUD-LIBOR-1W;2.495 20140103;13:30;FR-AUD-LIBOR-1W;2.475 20140106;13:30;FR-AUD-LIBOR-1W;2.495 20140107;13:30;FR-AUD-LIBOR-1W;2.475 20140108;13:30;FR-AUD-LIBOR-1W;2.475 20140109;13:30;FR-AUD-LIBOR-1W;2.475... (2 Replies)
Discussion started by: shash
2 Replies

2. Shell Programming and Scripting

Removing specific records from files when duplicate key

Hello I have been trying to remove a row from a file which has the same first three columns as another row - I have tried lots of different combinations of suggestion on this forum but can't get it exactly right. what I have is 900 - 1000 = 0 900 - 1000 = 2562 1000 - 1100 = 0 1000 - 1100... (7 Replies)
Discussion started by: tinytimmay
7 Replies

3. UNIX for Dummies Questions & Answers

Delete records from a big file based on some condition

Hi, To load a big file in a table,I have a make sure that all rows in the file has same number of the columns . So in my file if I am getting any rows which have columns not equal to 6 , I need to delete it . Delimiter is space and columns are optionally enclosed by "". This can be ... (1 Reply)
Discussion started by: hemantraijain
1 Replies

4. Shell Programming and Scripting

Removing duplicate records in a file based on single column explanation

I was reading this thread. It looks like a simpler way to say this is to only keep uniq lines based on field or column 1. https://www.unix.com/shell-programming-scripting/165717-removing-duplicate-records-file-based-single-column.html Can someone explain this command please? How are there no... (5 Replies)
Discussion started by: cokedude
5 Replies

5. Shell Programming and Scripting

Removing duplicate records in a file based on single column

Hi, I want to remove duplicate records including the first line based on column1. For example inputfile(filer.txt): ------------- 1,3000,5000 1,4000,6000 2,4000,600 2,5000,700 3,60000,4000 4,7000,7777 5,999,8888 expected output: ---------------- 3,60000,4000 4,7000,7777... (5 Replies)
Discussion started by: G.K.K
5 Replies

6. Shell Programming and Scripting

awk - splitting 1 large file into multiple based on same key records

Hello gurus, I am new to "awk" and trying to break a large file having 4 million records into several output files each having half million but at the same time I want to keep the similar key records in the same output file, not to exist accross the files. e.g. my data is like: Row_Num,... (6 Replies)
Discussion started by: kam66
6 Replies

7. UNIX for Dummies Questions & Answers

forming duplicate rows based on value of a key

if the key (A or B or ...others) has 4 in its 3rd column the 1st A row has to form 4 dupicates along with the all the values of A in 4th column (2.9, 3.8, 4.2) . Hope I explain the question clearly. Cheers Ruby input "A" 1 4 2.9 "A" 2 5 ... (7 Replies)
Discussion started by: ruby_sgp
7 Replies

8. UNIX for Dummies Questions & Answers

Delete lines with duplicate strings based on date

Hey all, a relative bash/script newbie trying solve a problem. I've got a text file with lots of lines that I've been able to clean up and format with awk/sed/cut, but now I'd like to remove the lines with duplicate usernames based on time stamp. Here's what the data looks like 2007-11-03... (3 Replies)
Discussion started by: mattv
3 Replies

9. Shell Programming and Scripting

how to delete duplicate rows based on last column

hii i have a huge amt of data stored in a file.Here in this file i need to remove duplicates rows in such a way that the last column has different data & i must check for greatest among last colmn data & print the largest data along with other entries but just one of other duplicate entries is... (16 Replies)
Discussion started by: reva
16 Replies

10. Shell Programming and Scripting

Delete Duplicate records from a tilde delimited file

Hi All, I want to delete duplicate records from a tilde delimited file. Criteria is considering the first 2 fields, the combination of which has to be unique, below is a sample of records in the input file 1620000010338~2446694087~0~20061130220000~A00BCC1CT... (5 Replies)
Discussion started by: irshadm
5 Replies
Login or Register to Ask a Question