Removing Dupes from huge file- awk/perl/uniq Post: 302623749

Sponsored Content

Top Forums Shell Programming and Scripting Removing Dupes from huge file- awk/perl/uniq Post 302623749 by shamrock on Saturday 14th of April 2012 01:29:48 PM

04-14-2012

Registered User

Quote:

Originally Posted by Scrutinizer

Thanks that is very useful to know, that this cannot be relied upon to work on any system.

Just tested it on a Solaris 10 box with same results...looks like the legacy unix offerings dont have a stable sort whereas linux has it.

Quote:

Originally Posted by Scrutinizer

So it is appears to be an extension and not standard to perform a stable sort. I guess a possible indicator might then be if a particular sort supports a "stable sort" option in the first place. For example if this option is called "-s" then if this works:

Code:

sort -st, -k1,3 infile | awk -F, '{n=$1 FS $2 FS $3}p!=n;{p=n}'

then probably this works too:

Code:

sort -ut, -k1,3 infile

None of the legacy unix flavors HPUX Solaris or AIX have the "-s" switch so looks like linux may have optimized it to spit out the first record of a group.

Quote:

Originally Posted by Scrutinizer

I checked some man pages and systems that provide a stable "-u", it says:

Code:

-u      with -c, check for strict ordering; without -c, output only the first of an equal run

That must only be on linux system...

Quote:

Originally Posted by Scrutinizer

instead of something like this

Code:

-u      [..] If used with the -c option, check that there are no lines with duplicate keys, in addition to checking that the input file is sorted.

Yes that is pretty much the case on all of AIX Solaris and HPUX...so looks like sort will break if the code is to be deployed on different platforms...unless it is different flavors of linux.

shamrock

View Public Profile for shamrock

Find all posts by shamrock

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing duplicates [sort , uniq]

Hey Guys, I have file which looks like this, Contig201#numbPA Contig1452#nmdynD6PA dm022p15.r#CG6461PA dm005e16.f#SpatPA IGU001_0015_A06.f#CG17593PA I need to remove duplicates based on the chracter matching upto '#'. for example if we consider this.. Contig201#numbPA...

2. Shell Programming and Scripting

Using an awk script to identify dupes in two files

Hello, I have two files. File1 or the master file contains two columns separated by a delimiter: a=b b=d e=f g=h File 2 which is the file to be processed has only a single column a h c b What I need is an awk script to identify unique names from file 2 which are not found in the...

3. Shell Programming and Scripting

Help in modifying existing Perl Script to produce report of dupes

Hello, I have a large amount of data with the following structure: Word=Transliterated word I have written a Perl Script (reproduced below) which goes through the full file and identifies all dupes on the right hand side. It creates successfully a new file with two headers: Singletons and Dupes....

4. Shell Programming and Scripting

Awk to Count Multiple patterns in a huge file

5. Shell Programming and Scripting

Fetching record based on Uniq Key from huge file.

Hi i want to fetch 100k record from a file which is looking like as below. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ...

6. Shell Programming and Scripting

Help with removing duplicate entries with awk or Perl

Hi, I have a file which looks like:ke this : chr1 11127067 11132181 89 chr1 11128023 11128311 chr1 11130990 11131025 chr1 11127067 11132181 89 chr1 11128023 11128311 chr1 11131583...

7. Shell Programming and Scripting

Removing dupes within 2 delimited areas in a large dictionary file

Hello, I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header : #DATA #VALID 1 and ends with a footer as shown below #END The data between the Header and the Footer consists of...

8. UNIX for Advanced & Expert Users

Performance problem with removing duplicates in a huge file (50+ GB)

I'm trying to remove duplicate data from an input file with unsorted data which is of size >50GB and write the unique records to a new file. I'm trying and already tried out a variety of options posted in similar threads/forums. But no luck so far.. Any suggestions please ? Thanks !!

9. Shell Programming and Scripting

Help with Perl script for identifying dupes in column1

Dear all, I have a large dictionary database which has the following structure source word=target word e.g. book=livre Since the database is very large in spite of all the care taken, it so happens that at times the source word is repeated e.g. book=livre book=tome Since I want to...

10. Shell Programming and Scripting

Removing White spaces from a huge file

I am trying to remove whitespaces from a file containing sample data as: 457 <EOFD> Mar 1 2007 12:00:00:000AM <EOFD> Mar 31 2007 12:00:00:000AM <EOFD> system <EORD> 458 <EOFD> Mar 1 2007 12:00:00:000AM<EOFD>agf <EOFD> Apr 20 2007 9:10:56:036PM <EOFD> prodiws<EORD> . Basically these...

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing duplicates [sort , uniq]

Discussion started by: sharatz83

2. Shell Programming and Scripting

Using an awk script to identify dupes in two files

Discussion started by: gimley

3. Shell Programming and Scripting

Help in modifying existing Perl Script to produce report of dupes

Discussion started by: gimley

4. Shell Programming and Scripting

Awk to Count Multiple patterns in a huge file

Discussion started by: reach.sree@gmai

5. Shell Programming and Scripting

Fetching record based on Uniq Key from huge file.

Discussion started by: lathigara

6. Shell Programming and Scripting

Help with removing duplicate entries with awk or Perl

Discussion started by: Amit Pande

7. Shell Programming and Scripting

Removing dupes within 2 delimited areas in a large dictionary file

Discussion started by: gimley

8. UNIX for Advanced & Expert Users

Performance problem with removing duplicates in a huge file (50+ GB)

Discussion started by: Kannan K

9. Shell Programming and Scripting

Help with Perl script for identifying dupes in column1

Discussion started by: gimley

10. Shell Programming and Scripting

Removing White spaces from a huge file

Discussion started by: amvip