Removing Dupes from huge file- awk/perl/uniq

04-14-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Thanks that is very useful to know, that this cannot be relied upon to work on any system. So it is appears to be an extension and not standard to perform a stable sort. I guess a possible indicator might then be if a particular sort supports a "stable sort" option in the first place. For example if this option is called "-s" then if this works:

Code:

sort -st, -k1,3 infile | awk -F, '{n=$1 FS $2 FS $3}p!=n;{p=n}'

then probably this works too:

Code:

sort -ut, -k1,3 infile

I checked some man pages and systems that provide a stable "-u", it says:

Code:

-u      with -c, check for strict ordering; without -c, output only the first of an equal run

instead of something like this

Code:

-u      [..] If used with the -c option, check that there are no lines with duplicate keys, in addition to checking that the input file is sorted.

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

04-14-2012

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Quote:

Originally Posted by Scrutinizer

Thanks that is very useful to know, that this cannot be relied upon to work on any system.

Just tested it on a Solaris 10 box with same results...looks like the legacy unix offerings dont have a stable sort whereas linux has it.

Quote:

Originally Posted by Scrutinizer

So it is appears to be an extension and not standard to perform a stable sort. I guess a possible indicator might then be if a particular sort supports a "stable sort" option in the first place. For example if this option is called "-s" then if this works:

Code:

sort -st, -k1,3 infile | awk -F, '{n=$1 FS $2 FS $3}p!=n;{p=n}'

then probably this works too:

Code:

sort -ut, -k1,3 infile

None of the legacy unix flavors HPUX Solaris or AIX have the "-s" switch so looks like linux may have optimized it to spit out the first record of a group.

Quote:

Originally Posted by Scrutinizer

I checked some man pages and systems that provide a stable "-u", it says:

Code:

-u      with -c, check for strict ordering; without -c, output only the first of an equal run

That must only be on linux system...

Quote:

Originally Posted by Scrutinizer

instead of something like this

Code:

-u      [..] If used with the -c option, check that there are no lines with duplicate keys, in addition to checking that the input file is sorted.

Yes that is pretty much the case on all of AIX Solaris and HPUX...so looks like sort will break if the code is to be deployed on different platforms...unless it is different flavors of linux.

shamrock

View Public Profile for shamrock

Find all posts by shamrock

04-14-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Quote:

Originally Posted by shamrock

Just tested it on a Solaris 10 box with same results...looks like the legacy unix offerings dont have a stable sort whereas linux has it.

None of the legacy unix flavors HPUX Solaris or AIX have the "-s" switch so looks like linux may have optimized it to spit out the first record of a group.

That must only be on linux system...

Yes that is pretty much the case on all of AIX Solaris and HPUX...so looks like sort will break if the code is to be deployed on different platforms...unless it is different flavors of linux.

Not just Linux systems, but on systems that use GNU sort, so also freeBSD and OSX and any system that has GNU sort installed. But it does not work with POSIX sort, so in conclusion it is not a solution for UNIX systems in general... Thanks again Shamrock, for your effort...

Last edited by Scrutinizer; 04-14-2012 at 05:35 PM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

04-14-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

So then I guess a portable solution would be to force a stable sort, for example like this:

Code:

nl -s, infile | sort -t, -k2,4 -k1 | cut -f2- -d, | awk -F, '{n=$1 F2 $2 FS $3}p!=n;{p=n}'

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

Shell Programming and Scripting

Removing Dupes from huge file- awk/perl/uniq

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing White spaces from a huge file

Discussion started by: amvip

2. Shell Programming and Scripting

Help with Perl script for identifying dupes in column1

Discussion started by: gimley

3. UNIX for Advanced & Expert Users

Performance problem with removing duplicates in a huge file (50+ GB)

Discussion started by: Kannan K

4. Shell Programming and Scripting

Removing dupes within 2 delimited areas in a large dictionary file

Discussion started by: gimley

5. Shell Programming and Scripting

Help with removing duplicate entries with awk or Perl

Discussion started by: Amit Pande

6. Shell Programming and Scripting

Fetching record based on Uniq Key from huge file.

Discussion started by: lathigara

7. Shell Programming and Scripting

Awk to Count Multiple patterns in a huge file

Discussion started by: reach.sree@gmai

8. Shell Programming and Scripting

Help in modifying existing Perl Script to produce report of dupes

Discussion started by: gimley

9. Shell Programming and Scripting

Using an awk script to identify dupes in two files

Discussion started by: gimley

10. Shell Programming and Scripting

Removing duplicates [sort , uniq]

Discussion started by: sharatz83