Remove dupes in a large file

10-13-2018

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Remove dupes in a large file

I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job

Code:

!x[$0]++

The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Any solution to speed up the AWk script or a Perl script would be appreciated.
I work under Windows Environment and hence Unix tools don't work
Many thanks
p.S. I have checked an earlier solution available in the repository but it is just as slow if not slower.

gimley

View Public Profile for gimley

Find all posts by gimley

10-13-2018

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Quote:

Originally Posted by gimley

I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job

Code:

!x[$0]++

The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.

Hi,

I presume you mean you want to dedupe the file (because that is what your script does and that is in the title), not necessarily sort it.

You can try the difference between

Code:

awk '!X[$0]++' file > file.dedup

and

Code:

sort -u file > file.deduped_sort

The awk version is typically a lot faster because the file does not have to be sorted. Whether the file is sorted or not should make no difference for the awk command.
They both dedupe, but the second one sorts as well.

------- Edit ---------

I just did a test with a 1.6 GiB file and it took under 3 minutes to dedup it, so I would examine what you are doing exactly.

Are you deduping and then sorting?
Are you running out of memory and is your system paging/swapping?

Otherwise can you post the exact script/command that you are using?

Last edited by Scrutinizer; 10-13-2018 at 07:14 AM..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

10-13-2018

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

In case there is a RAM shortage, the following variant helps (saves some bytes per line).

Code:

awk '!($0 in X) { print; X[$0] }' file > file.dedup

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

10-13-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Hi MadeInGermany,

mind to explain that approach? Is that because X[$0]++ becomes a number and consumes a "float" 's space, whereas X[$0] has just an index but points to nowhere?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

10-14-2018

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Exactly, X[$0]++ holds a number value; i.e. each new line consumes a number's space.

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

Shell Programming and Scripting

Remove dupes in a large file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Modify script to remove dupes with two delimiters

Discussion started by: gimley

2. Shell Programming and Scripting

Removing dupes within 2 delimited areas in a large dictionary file

Discussion started by: gimley

3. Shell Programming and Scripting

remove large portion of web page code between two tags

Discussion started by: georgi58

4. Shell Programming and Scripting

Removing Dupes from huge file- awk/perl/uniq

Discussion started by: makn

5. UNIX for Dummies Questions & Answers

Filtering F-Dupes

Discussion started by: furashgf

6. Shell Programming and Scripting

How to remove a subset of data from a large dataset based on values on one line

Discussion started by: davegen

7. Shell Programming and Scripting

Remove Duplicate Filenames in 2 very large directories

Discussion started by: jaysunn

8. Shell Programming and Scripting

remove a specific line in a LARGE file

Discussion started by: blubbiblubbkekz

9. Shell Programming and Scripting

Sed or awk script to remove text / or perform calculations from large CSV files

Discussion started by: metronomadic

10. Shell Programming and Scripting

remove a large number of user from oracle

Discussion started by: upengan78