Removing duplicates

09-14-2005

Registered User

190, 0

Join Date: Sep 2001

Last Activity: 21 August 2015, 10:59 AM EDT

Location: Chicago

Posts: 190

Thanks Given: 7

Thanked 0 Times in 0 Posts

Wow. Thanks guys. I tried Perderabo's solution and it worked perfectly.
I wasn't sure if a simple code like that would work but it does and I'm a little unsure why it does work...glad it does but not sure why it does.

I'll test the other codes as well to see out of curiosity.
Thanks.
Gianni

giannicello

View Public Profile for giannicello

Find all posts by giannicello

09-14-2005

Registered User

683, 5

Join Date: Jan 2005

Last Activity: 27 September 2011, 12:36 PM EDT

Posts: 683

Thanks Given: 0

Thanked 5 Times in 5 Posts

Quote:

Originally Posted by jim mcnamara

Or an even more cryptic version:

Code:

awk '!x[$1]++' filename > newfile

All this does is create an associative array. The first time it encounters the array element it will be zero, so it will print the whole record. If the element is not zero we have seen it before, so do not print it. $1 is the first field in the record.

How cool is awk?!

I've seen this technique before but thought I would test it on 1 million lines in a data file. If finished in half of the time than the sort -mu command. awk also eliminated duplicates 1 million lines apart as you would expect based on the logic. The sort2-mu command assumes that the file is already sorted and a duplicate 1 million lines apart is ignored.

tmarikle

View Public Profile for tmarikle

Find all posts by tmarikle

09-14-2005

Registered User

190, 0

Join Date: Sep 2001

Last Activity: 21 August 2015, 10:59 AM EDT

Location: Chicago

Posts: 190

Thanks Given: 7

Thanked 0 Times in 0 Posts

I tried the different solutions and the one that comes closest is Perderabo's.
The only time it doesn't work is if there are any blanks in the first set of alphanumerics ( which I just found out is possible).
How would I modify any of the above solutions to look at, say, characters 1 thru 30, out of a 100 character record for exact matches and keep first occurrence and remove the rest of the duplicates?

Here's some records that I found that's causing me to be back at square one...

92247140 1203QA RRN ..
92247140 1203QA RRP ...
92247140 1203QB RRP ...

Do I have to do an awk on this one with substrings? I tested Jim's solution also and it was fast..unfortunately it found a little more dups than I'd hope due to the way the records come in, otherwise, it I'd use it.

Thanks,
Gianni

giannicello

View Public Profile for giannicello

Find all posts by giannicello

09-14-2005

Registered User

683, 5

Join Date: Jan 2005

Last Activity: 27 September 2011, 12:36 PM EDT

Posts: 683

Thanks Given: 0

Thanked 5 Times in 5 Posts

Jim's awk solution will work using substring:

Code:

awk '!a[substr($0,1,15)]++' inputfile

and it still runs in 12 seconds for a 3.3 million lines worth of data (your three lines repeated and unsorted).

My test result:

Code:

92247140 1203QA RRN ..
92247140 1203QB RRP ...

If you need to retail the rest of the line for each unique key, the awk script would have to be modified a bit.

tmarikle

View Public Profile for tmarikle

Find all posts by tmarikle

09-14-2005

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

it might be better to think of your lines in terms of 'fields' - In case your 'fields' might become varying in length.

Right now all your fields are of the same length and 'substr($0,1,15)' seems to be refering to the first two fields. This is what makes your line/record unique.

If that's the case:

Code:

awk '!a[$1,$2]++' inputfile

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

09-14-2005

Registered User

683, 5

Join Date: Jan 2005

Last Activity: 27 September 2011, 12:36 PM EDT

Posts: 683

Thanks Given: 0

Thanked 5 Times in 5 Posts

Quote:

Originally Posted by vgersh99

Code:

awk '!a[$1,$2]++' inputfile

Fields 1 and 2 won't work in the OP's case since it loses its uniqueness on character strings without the space:

Code:

47147140631204DC ADK
47147140631204DC ALK

Quote:

Originally Posted by giannicello

How would I modify any of the above solutions to look at, say, characters 1 thru 30, out of a 100 character record for exact matches and keep first occurrence and remove the rest of the duplicates?

If 1 through 30 is a static rule then you can use the substr. If the key length varies then that's another problem.

tmarikle

View Public Profile for tmarikle

Find all posts by tmarikle

Shell Programming and Scripting