Removing Dupes from huge file- awk/perl/uniq


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing Dupes from huge file- awk/perl/uniq
# 1  
Old 04-13-2012
Removing Dupes from huge file- awk/perl/uniq

Hi,

I have the following command in place

Code:
nawk -F, '!a[$1,$2,$3]++' file > file.uniq

It has been working perfectly as per requirements, by removing duplicates by taking into consideration only first 3 fields. Recently it has started giving below error:

Code:
bash-3.2$   nawk -F, '!a[$1,$2,$3]++' OTCTempD.dat > OTCTemp.uniq
nawk: symbol table overflow at 4044735840353890OTC
 input record number 5.42076e+07, file OTCTempD.dat
 source line number 1

More information:
1. No of records in file:
Code:
bash-3.2$ cat OTCTempD.dat | wc -l
 179128368

2. Size of the file :
Code:
-rw-r--r--   1 magt2    grip     7338355879 Apr 12 14:08 OTCTempD.dat

Contents of file:
Code:
a,b,c,2,2,3
a,b,c,1,2,3
a,b,E,1,2,3
a,b,c,1,2,3

Output should be:
Code:
a,b,c,2,2,3    //take first record only out of dupes 
a,b,E,1,2,3

Now how to resolve this.
Does awk uses some kind of memory and its exceeding its limit?
What should be the best approach to acheive the desired purpose?
Cant we use uniq directly to get remove such criteria based dupes?

Kindly Suggest.

Moderator's Comments:
Mod Comment Welcome to the UNIX and Linux Forums. Please use code tags. Video tutorial on how to use them

Last edited by Scrutinizer; 04-13-2012 at 06:20 AM..
# 2  
Old 04-13-2012
What's your OS? There are limitations, but they'll depending on which (n)awk you're using.

Is the file sorted?
# 3  
Old 04-13-2012
Can't you use unique sort?
Code:
sort -ut, -k1,3 infile > outfile

This User Gave Thanks to Scrutinizer For This Post:
# 4  
Old 04-13-2012
Quote:
Originally Posted by CarloM
What's your OS? There are limitations, but they'll depending on which (n)awk you're using.

Is the file sorted?
File is not sorted

Version:
i86pcSystem = SunOS
Node = gmmagappu1
Release = 5.10
KernelID = Generic_144489-17
Machine = i86pc

---------- Post updated at 07:26 PM ---------- Previous update was at 07:25 PM ----------

Quote:
Originally Posted by Scrutinizer
Can't you use unique sort?
Code:
sort -ut, -k1,3 infile > outfile

This is giving No space left on device error. I ll try this by clearing some space but can u pls explain how this actually works.

should not it be "-k1,2,3" ... as we consider first three fields of a line to find dupes.
# 5  
Old 04-13-2012
nawk does use memory for a hash table, and you are exceeding the limit. sort uses a temporary file(s) and they are written to whatever directory TMPDIR is pointed to. find a filesystem with free space and use a directory there that you have full access to. The files are temporary and exist only during sorting.

Scrutinizer's syntax is correct for the sort command you should use.
# 6  
Old 04-13-2012
Using a2p to convert the awk to perl...
Code:
perl -e '$[=1; $FS=","; while(<>){ @Fld=split(/[,\n]/,$_,-1); print $_ if !$a{$Fld[1],$Fld[2],$Fld[3]}++ }' file > file.uniq

# 7  
Old 04-13-2012
The nawk in post #1 and the sort in post #2 give different results on my system. This is because the nawk always keeps the first of the duplicate key records but the sort selects a random one.

Shell tools were never designed to process multi-gigabyte files. Do you have a database engine and access to a programmer?
This User Gave Thanks to methyl For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing White spaces from a huge file

I am trying to remove whitespaces from a file containing sample data as: 457 <EOFD> Mar 1 2007 12:00:00:000AM <EOFD> Mar 31 2007 12:00:00:000AM <EOFD> system <EORD> 458 <EOFD> Mar 1 2007 12:00:00:000AM<EOFD>agf <EOFD> Apr 20 2007 9:10:56:036PM <EOFD> prodiws<EORD> . Basically these... (11 Replies)
Discussion started by: amvip
11 Replies

2. Shell Programming and Scripting

Help with Perl script for identifying dupes in column1

Dear all, I have a large dictionary database which has the following structure source word=target word e.g. book=livre Since the database is very large in spite of all the care taken, it so happens that at times the source word is repeated e.g. book=livre book=tome Since I want to... (7 Replies)
Discussion started by: gimley
7 Replies

3. UNIX for Advanced & Expert Users

Performance problem with removing duplicates in a huge file (50+ GB)

I'm trying to remove duplicate data from an input file with unsorted data which is of size >50GB and write the unique records to a new file. I'm trying and already tried out a variety of options posted in similar threads/forums. But no luck so far.. Any suggestions please ? Thanks !! (9 Replies)
Discussion started by: Kannan K
9 Replies

4. Shell Programming and Scripting

Removing dupes within 2 delimited areas in a large dictionary file

Hello, I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header : #DATA #VALID 1 and ends with a footer as shown below #END The data between the Header and the Footer consists of... (6 Replies)
Discussion started by: gimley
6 Replies

5. Shell Programming and Scripting

Help with removing duplicate entries with awk or Perl

Hi, I have a file which looks like:ke this : chr1 11127067 11132181 89 chr1 11128023 11128311 chr1 11130990 11131025 chr1 11127067 11132181 89 chr1 11128023 11128311 chr1 11131583... (22 Replies)
Discussion started by: Amit Pande
22 Replies

6. Shell Programming and Scripting

Fetching record based on Uniq Key from huge file.

Hi i want to fetch 100k record from a file which is looking like as below. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ... (17 Replies)
Discussion started by: lathigara
17 Replies

7. Shell Programming and Scripting

Awk to Count Multiple patterns in a huge file

Hi, I have a file that is 430K lines long. It has records like below |site1|MAP |site2|MAP |site1|MODAL |site2|MAP |site2|MODAL |site2|LINK |site1|LINK My task is to count the number of time MAP, MODAL, LINK occurs for a single site and write new records like below to a new file ... (5 Replies)
Discussion started by: reach.sree@gmai
5 Replies

8. Shell Programming and Scripting

Help in modifying existing Perl Script to produce report of dupes

Hello, I have a large amount of data with the following structure: Word=Transliterated word I have written a Perl Script (reproduced below) which goes through the full file and identifies all dupes on the right hand side. It creates successfully a new file with two headers: Singletons and Dupes.... (5 Replies)
Discussion started by: gimley
5 Replies

9. Shell Programming and Scripting

Using an awk script to identify dupes in two files

Hello, I have two files. File1 or the master file contains two columns separated by a delimiter: a=b b=d e=f g=h File 2 which is the file to be processed has only a single column a h c b What I need is an awk script to identify unique names from file 2 which are not found in the... (6 Replies)
Discussion started by: gimley
6 Replies

10. Shell Programming and Scripting

Removing duplicates [sort , uniq]

Hey Guys, I have file which looks like this, Contig201#numbPA Contig1452#nmdynD6PA dm022p15.r#CG6461PA dm005e16.f#SpatPA IGU001_0015_A06.f#CG17593PA I need to remove duplicates based on the chracter matching upto '#'. for example if we consider this.. Contig201#numbPA... (4 Replies)
Discussion started by: sharatz83
4 Replies
Login or Register to Ask a Question