Removing duplicates


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing duplicates
# 8  
Old 09-14-2005
Wow. Thanks guys. I tried Perderabo's solution and it worked perfectly.
I wasn't sure if a simple code like that would work but it does and I'm a little unsure why it does work...glad it does but not sure why it does. Smilie

I'll test the other codes as well to see out of curiosity.
Thanks.
Gianni
# 9  
Old 09-14-2005
Quote:
Originally Posted by jim mcnamara
Or an even more cryptic version:
Code:
awk '!x[$1]++' filename > newfile

All this does is create an associative array. The first time it encounters the array element it will be zero, so it will print the whole record. If the element is not zero we have seen it before, so do not print it. $1 is the first field in the record.
How cool is awk?!

I've seen this technique before but thought I would test it on 1 million lines in a data file. If finished in half of the time than the sort -mu command. awk also eliminated duplicates 1 million lines apart as you would expect based on the logic. The sort2-mu command assumes that the file is already sorted and a duplicate 1 million lines apart is ignored.
# 10  
Old 09-14-2005
I tried the different solutions and the one that comes closest is Perderabo's.
The only time it doesn't work is if there are any blanks in the first set of alphanumerics ( which I just found out is possible).
How would I modify any of the above solutions to look at, say, characters 1 thru 30, out of a 100 character record for exact matches and keep first occurrence and remove the rest of the duplicates?

Here's some records that I found that's causing me to be back at square one...

92247140 1203QA RRN ..
92247140 1203QA RRP ...
92247140 1203QB RRP ...

Do I have to do an awk on this one with substrings? I tested Jim's solution also and it was fast..unfortunately it found a little more dups than I'd hope due to the way the records come in, otherwise, it I'd use it.

Thanks,
Gianni
# 11  
Old 09-14-2005
Jim's awk solution will work using substring:
Code:
awk '!a[substr($0,1,15)]++' inputfile

and it still runs in 12 seconds for a 3.3 million lines worth of data (your three lines repeated and unsorted).

My test result:
Code:
92247140 1203QA RRN ..
92247140 1203QB RRP ...

If you need to retail the rest of the line for each unique key, the awk script would have to be modified a bit.
# 12  
Old 09-14-2005
it might be better to think of your lines in terms of 'fields' - In case your 'fields' might become varying in length.

Right now all your fields are of the same length and 'substr($0,1,15)' seems to be refering to the first two fields. This is what makes your line/record unique.

If that's the case:
Code:
awk '!a[$1,$2]++' inputfile

# 13  
Old 09-14-2005
Quote:
Originally Posted by vgersh99
it might be better to think of your lines in terms of 'fields' - In case your 'fields' might become varying in length.

Right now all your fields are of the same length and 'substr($0,1,15)' seems to be refering to the first two fields. This is what makes your line/record unique.

If that's the case:
Code:
awk '!a[$1,$2]++' inputfile

Fields 1 and 2 won't work in the OP's case since it loses its uniqueness on character strings without the space:
Code:
47147140631204DC ADK
47147140631204DC ALK

Quote:
Originally Posted by giannicello
How would I modify any of the above solutions to look at, say, characters 1 thru 30, out of a 100 character record for exact matches and keep first occurrence and remove the rest of the duplicates?
If 1 through 30 is a static rule then you can use the substr. If the key length varies then that's another problem.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing duplicates from new file

i hav two files like i want to remove/delete all the duplicate lines in file2 which are viz unix,unix2,unix3 (2 Replies)
Discussion started by: sagar_1986
2 Replies

2. Shell Programming and Scripting

Removing duplicates except the last occurrence

Hi All, i have a file like below, @DB_FCTS\src\Data\Scripts\Delete_CU_OM_BIL_PRT_STMT_TYP.sql @DB_FCTS\src\Data\Scripts\Delete_CDP_BILL_LBL_MSG.sql @DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql @DB_FCTS\src\Data\Scripts\Insert_CU_OM_LBL_MSG.sql... (11 Replies)
Discussion started by: mechvijays
11 Replies

3. UNIX for Dummies Questions & Answers

Removing duplicates from a file

Hi All, I am merging files coming from 2 different systems ,while doing that I am getting duplicates entries in the merged file I,01,000131,764,2,4.00 I,01,000131,765,2,4.00 I,01,000131,772,2,4.00 I,01,000131,773,2,4.00 I,01,000168,762,2,2.00 I,01,000168,763,2,2.00... (5 Replies)
Discussion started by: Sri3001
5 Replies

4. Shell Programming and Scripting

Help in removing duplicates

I have an input file abc.txt with info like: abcd rateuse inklite robet rateuse abcd I need to remove duplicates from the file (eg: abcd,rateuse) from the file and need to place the contents in same file abc.txt if needed can be placed in another file. can anyone help me in this :( (4 Replies)
Discussion started by: rkrish
4 Replies

5. Emergency UNIX and Linux Support

Removing all the duplicates

i want to remove all the duplictaes in a file.I dont want even a single entry. For the input data: 12345|12|34 12345|13|23 3456|12|90 15670|12|13 12345|10|14 3456|12|13 i need the below data in one file 15670|12|13 and the below data in another file (9 Replies)
Discussion started by: pandeesh
9 Replies

6. Shell Programming and Scripting

Removing duplicates

I have a test file with the following 2 columns: Col 1 | Col 2 T1 | 1 <= remove T5 | 1 T4 | 2 T1 | 3 T3 | 3 T4 | 1 <= remove T1 | 2 <= remove T3 ... (7 Replies)
Discussion started by: gctex
7 Replies

7. UNIX for Advanced & Expert Users

removing duplicates.

Hi All In unix ,we have a file ,there we have to remove the duplicates by using one specific column. Can any body tell me the command. ex: file1 id,name 1,ww 2,qwq 2,asas 3,asa 4,asas 4,asas o/p: 1,ww 2,qwq 3,asa (7 Replies)
Discussion started by: raju4u
7 Replies

8. Shell Programming and Scripting

Removing duplicates

Hi, I have a file in the below format., test test (10) to to (25) see see (45) and i need the output in the format of test 10 to 25 see 45 Some one help me? (6 Replies)
Discussion started by: imdadulla
6 Replies

9. Shell Programming and Scripting

removing duplicates

Hi I have a file that are a list of people & their credentials i recieve frequently The issue is that whne I catnet this list that duplicat entries exists & are NOT CONSECUTIVE (i.e. uniq -1 may not weork here ) I'm trying to write a scrip that will remove duplicate entries the script can... (5 Replies)
Discussion started by: stevie_velvet
5 Replies

10. UNIX for Dummies Questions & Answers

removing duplicates and sort -k

Hello experts, I am trying to remove all lines in a csv file where the 2nd columns is a duplicate. I am try to use sort with the key parameter sort -u -k 2,2 File.csv > Output.csv File.csv File Name|Document Name|Document Title|Organization Word Doc 1.doc|Word Document|Sample... (3 Replies)
Discussion started by: orahi001
3 Replies
Login or Register to Ask a Question