The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
Google UNIX.COM


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
finding duplicates in columns and removing lines totus Shell Programming and Scripting 17 5 Days Ago 08:27 AM
removing duplicates based on key pukars4u Shell Programming and Scripting 1 05-21-2008 12:50 PM
removing duplicates from a file trichyselva UNIX for Dummies Questions & Answers 2 03-25-2008 07:49 AM
removing duplicates and sort -k orahi001 UNIX for Dummies Questions & Answers 3 01-25-2008 06:59 AM
Removing duplicates [sort , uniq] sharatz83 Shell Programming and Scripting 4 07-14-2006 02:12 PM

Reply
 
Submit Tools LinkBack Thread Tools Search this Thread Display Modes
  #8  
Old 09-14-2005
Registered User
 

Join Date: Sep 2001
Location: Phoenix
Posts: 76
Wow. Thanks guys. I tried Perderabo's solution and it worked perfectly.
I wasn't sure if a simple code like that would work but it does and I'm a little unsure why it does work...glad it does but not sure why it does.

I'll test the other codes as well to see out of curiosity.
Thanks.
Gianni
Reply With Quote
Forum Sponsor
  #9  
Old 09-14-2005
Registered User
 

Join Date: Jan 2005
Posts: 682
Quote:
Originally Posted by jim mcnamara
Or an even more cryptic version:
Code:
awk '!x[$1]++' filename > newfile
All this does is create an associative array. The first time it encounters the array element it will be zero, so it will print the whole record. If the element is not zero we have seen it before, so do not print it. $1 is the first field in the record.
How cool is awk?!

I've seen this technique before but thought I would test it on 1 million lines in a data file. If finished in half of the time than the sort -mu command. awk also eliminated duplicates 1 million lines apart as you would expect based on the logic. The sort2-mu command assumes that the file is already sorted and a duplicate 1 million lines apart is ignored.
Reply With Quote
  #10  
Old 09-14-2005
Registered User
 

Join Date: Sep 2001
Location: Phoenix
Posts: 76
I tried the different solutions and the one that comes closest is Perderabo's.
The only time it doesn't work is if there are any blanks in the first set of alphanumerics ( which I just found out is possible).
How would I modify any of the above solutions to look at, say, characters 1 thru 30, out of a 100 character record for exact matches and keep first occurrence and remove the rest of the duplicates?

Here's some records that I found that's causing me to be back at square one...

92247140 1203QA RRN ..
92247140 1203QA RRP ...
92247140 1203QB RRP ...

Do I have to do an awk on this one with substrings? I tested Jim's solution also and it was fast..unfortunately it found a little more dups than I'd hope due to the way the records come in, otherwise, it I'd use it.

Thanks,
Gianni
Reply With Quote
  #11  
Old 09-14-2005
Registered User
 

Join Date: Jan 2005
Posts: 682
Jim's awk solution will work using substring:
Code:
awk '!a[substr($0,1,15)]++' inputfile
and it still runs in 12 seconds for a 3.3 million lines worth of data (your three lines repeated and unsorted).

My test result:
Code:
92247140 1203QA RRN ..
92247140 1203QB RRP ...
If you need to retail the rest of the line for each unique key, the awk script would have to be modified a bit.
Reply With Quote
  #12  
Old 09-14-2005
vgersh99's Avatar
Moderator
 

Join Date: Feb 2005
Location: Boston, MA
Posts: 3,029
it might be better to think of your lines in terms of 'fields' - In case your 'fields' might become varying in length.

Right now all your fields are of the same length and 'substr($0,1,15)' seems to be refering to the first two fields. This is what makes your line/record unique.

If that's the case:
Code:
awk '!a[$1,$2]++' inputfile
Reply With Quote
  #13  
Old 09-14-2005
Registered User
 

Join Date: Jan 2005
Posts: 682
Quote:
Originally Posted by vgersh99
it might be better to think of your lines in terms of 'fields' - In case your 'fields' might become varying in length.

Right now all your fields are of the same length and 'substr($0,1,15)' seems to be refering to the first two fields. This is what makes your line/record unique.

If that's the case:
Code:
awk '!a[$1,$2]++' inputfile
Fields 1 and 2 won't work in the OP's case since it loses its uniqueness on character strings without the space:
Code:
47147140631204DC ADK
47147140631204DC ALK
Quote:
Originally Posted by giannicello
How would I modify any of the above solutions to look at, say, characters 1 thru 30, out of a 100 character record for exact matches and keep first occurrence and remove the rest of the duplicates?
If 1 through 30 is a static rule then you can use the substr. If the key length varies then that's another problem.
Reply With Quote
Google The UNIX and Linux Forums
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes




All times are GMT -7. The time now is 05:03 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited.
The UNIX and Linux Forums Content Copyright ©1993-2008. All Rights Reserved.Ad Management by RedTyger Visit The Complex Event Processing Blog

Content Relevant URLs by vBSEO 3.2.0