Performance problem with removing duplicates in a huge file (50+ GB)

01-07-2013

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

What about a regular sort and then awk:

Code:

sort infile | awk '$1!=p; {p=$1}'

--
Never mind, It doesn't make much of a difference...

Last edited by Scrutinizer; 01-07-2013 at 03:59 PM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

01-07-2013

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Using mmap64() means you only need components of a hash map of offsets on the heap. If the hash is new, it is unique so far, can be written and the offset needs to be remembered under that hash. If not new, the actual keys are compared between the two mmap'd locations. If it is a miss, it is added as a second record in the hash bucket. The empty hash map is an array on N null hash bucket pointers, where keys are hashed modulo N. If you write your own container, you can make N modulo 2 so the modulus operation is a mask of low bits. On 50 GB file with 50 byte records and a 4 GHZ CPU, every 4 cycles per record is a second of run time.

Conversely, some sorts can use parallel resources. Professional tools like abInitio divide up files for parallel processing. While record boundaries are not predictable, the threads know that they are responsible for only new lines above their starting point and possibly past their end point. Eventually, merging the sorts generally bottlenecks things, unless input threads send records to N key-segmented output sort threads. Then, each output is just concatenated. SyncSort makes an app out of it.

This User Gave Thanks to DGPickett For This Post:

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

01-09-2013

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Probably an iteration?
Assuming awk can handle 10000 lines.

Code:

awk '{if (NR<10000) {if (! s[$0]++) print >> uniq} else {if (! $0 in s) print}}' < file1 > file2

Then repeat with

Code:

awk '...' < file2 > file1

...

The result is in the file uniq.

Last edited by MadeInGermany; 01-09-2013 at 05:53 PM..

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

UNIX for Advanced & Expert Users

Performance problem with removing duplicates in a huge file (50+ GB)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing White spaces from a huge file

Discussion started by: amvip

2. Shell Programming and Scripting

Removing duplicates from new file

Discussion started by: sagar_1986

3. Shell Programming and Scripting

Removing duplicates from new file

Discussion started by: sagar_1986

4. UNIX for Dummies Questions & Answers

Removing duplicates from a file

Discussion started by: Sri3001

5. HP-UX

Performance issue with 'grep' command for huge file size

Discussion started by: arb_1984

6. Shell Programming and Scripting

formatting a file and removing duplicates

Discussion started by: kylle345

7. Shell Programming and Scripting

Removing Duplicates from file

Discussion started by: tinufarid

8. Shell Programming and Scripting

Removing duplicates from log file?

Discussion started by: Ilja

9. UNIX for Dummies Questions & Answers

removing duplicates of a pattern from a file

Discussion started by: ashisharora

10. UNIX for Dummies Questions & Answers

removing duplicates from a file

Discussion started by: trichyselva