Removing Dupes from huge file- awk/perl/uniq

04-13-2012

Registered User

5, 0

Join Date: Apr 2012

Last Activity: 15 May 2012, 4:54 AM EDT

Posts: 5

Thanks Given: 1

Thanked 0 Times in 0 Posts

Removing Dupes from huge file- awk/perl/uniq

Hi,

I have the following command in place

Code:

nawk -F, '!a[$1,$2,$3]++' file > file.uniq

It has been working perfectly as per requirements, by removing duplicates by taking into consideration only first 3 fields. Recently it has started giving below error:

Code:

bash-3.2$   nawk -F, '!a[$1,$2,$3]++' OTCTempD.dat > OTCTemp.uniq
nawk: symbol table overflow at 4044735840353890OTC
 input record number 5.42076e+07, file OTCTempD.dat
 source line number 1

More information:
1. No of records in file:

Code:

bash-3.2$ cat OTCTempD.dat | wc -l
 179128368

2. Size of the file :

Code:

-rw-r--r--   1 magt2    grip     7338355879 Apr 12 14:08 OTCTempD.dat

Contents of file:

Code:

a,b,c,2,2,3
a,b,c,1,2,3
a,b,E,1,2,3
a,b,c,1,2,3

Output should be:

Code:

a,b,c,2,2,3    //take first record only out of dupes 
a,b,E,1,2,3

Now how to resolve this.
Does awk uses some kind of memory and its exceeding its limit?
What should be the best approach to acheive the desired purpose?
Cant we use uniq directly to get remove such criteria based dupes?

Kindly Suggest.

Moderator's Comments:

Welcome to the UNIX and Linux Forums. Please use code tags. Video tutorial on how to use them

Last edited by Scrutinizer; 04-13-2012 at 06:20 AM..

makn

View Public Profile for makn

Find all posts by makn

04-13-2012

Registered User

1,119, 264

Join Date: Oct 2011

Last Activity: 14 August 2020, 12:53 PM EDT

Location: London, UK

Posts: 1,119

Thanks Given: 134

Thanked 264 Times in 247 Posts

What's your OS? There are limitations, but they'll depending on which (n)awk you're using.

Is the file sorted?

CarloM

View Public Profile for CarloM

Find all posts by CarloM

04-13-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Can't you use unique sort?

Code:

sort -ut, -k1,3 infile > outfile

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

04-13-2012

Registered User

5, 0

Join Date: Apr 2012

Last Activity: 15 May 2012, 4:54 AM EDT

Posts: 5

Thanks Given: 1

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by CarloM

What's your OS? There are limitations, but they'll depending on which (n)awk you're using.

Is the file sorted?

File is not sorted

Version:
i86pcSystem = SunOS
Node = gmmagappu1
Release = 5.10
KernelID = Generic_144489-17
Machine = i86pc

---------- Post updated at 07:26 PM ---------- Previous update was at 07:25 PM ----------

Quote:

Originally Posted by Scrutinizer

Can't you use unique sort?

Code:

sort -ut, -k1,3 infile > outfile

This is giving No space left on device error. I ll try this by clearing some space but can u pls explain how this actually works.

should not it be "-k1,2,3" ... as we consider first three fields of a line to find dupes.

makn

View Public Profile for makn

Find all posts by makn

04-13-2012

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

nawk does use memory for a hash table, and you are exceeding the limit. sort uses a temporary file(s) and they are written to whatever directory TMPDIR is pointed to. find a filesystem with free space and use a directory there that you have full access to. The files are temporary and exist only during sorting.

Scrutinizer's syntax is correct for the sort command you should use.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

04-13-2012

Registered User

1,801, 116

Join Date: Oct 2003

Last Activity: 15 May 2015, 11:55 AM EDT

Location: 54.23, -4.53

Posts: 1,801

Thanks Given: 1

Thanked 116 Times in 101 Posts

Using a2p to convert the awk to perl...

Code:

perl -e '$[=1; $FS=","; while(<>){ @Fld=split(/[,\n]/,$_,-1); print $_ if !$a{$Fld[1],$Fld[2],$Fld[3]}++ }' file > file.uniq

Ygor

View Public Profile for Ygor

Find all posts by Ygor

04-13-2012

Registered User

6,402, 678

Join Date: Mar 2008

Last Activity: 8 June 2016, 9:58 PM EDT

Posts: 6,402

Thanks Given: 288

Thanked 678 Times in 647 Posts

The nawk in post #1 and the sort in post #2 give different results on my system. This is because the nawk always keeps the first of the duplicate key records but the sort selects a random one.

Shell tools were never designed to process multi-gigabyte files. Do you have a database engine and access to a programmer?

This User Gave Thanks to methyl For This Post:

methyl

View Public Profile for methyl

Find all posts by methyl

Shell Programming and Scripting

Removing Dupes from huge file- awk/perl/uniq

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing White spaces from a huge file

Discussion started by: amvip

2. Shell Programming and Scripting

Help with Perl script for identifying dupes in column1

Discussion started by: gimley

3. UNIX for Advanced & Expert Users

Performance problem with removing duplicates in a huge file (50+ GB)

Discussion started by: Kannan K

4. Shell Programming and Scripting

Removing dupes within 2 delimited areas in a large dictionary file

Discussion started by: gimley

5. Shell Programming and Scripting

Help with removing duplicate entries with awk or Perl

Discussion started by: Amit Pande

6. Shell Programming and Scripting

Fetching record based on Uniq Key from huge file.

Discussion started by: lathigara

7. Shell Programming and Scripting

Awk to Count Multiple patterns in a huge file

Discussion started by: reach.sree@gmai

8. Shell Programming and Scripting

Help in modifying existing Perl Script to produce report of dupes

Discussion started by: gimley

9. Shell Programming and Scripting

Using an awk script to identify dupes in two files

Discussion started by: gimley

10. Shell Programming and Scripting

Removing duplicates [sort , uniq]

Discussion started by: sharatz83