Honey, I broke awk! (duplicate line removal in 30M line 3.7GB csv file)

03-27-2014

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

My original code spits out about 2.5 Gb (which is less than the full dataset) after about 70 min and then never completes.

I tried Chubler_X's code above and terminated it after 16 hours with no output.

One way I can definitely split without duplicates spanning split files is to split on date. It only takes about one minute to split the data in two with the following code:

Code:

awk -F, -v startDT="$startDate" -v endDT="$endDate" '
            BEGIN { s=mktime(startDT); e=mktime(endDT)}
            NR == 1 {print}
            NR > 1 { t = $6; gsub(/\-|:/, " ", t); t = mktime(t); if ( s <= t && t <= e) {print} }'

Mike

Last edited by Michael Stora; 03-27-2014 at 01:28 PM..

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

03-27-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by Michael Stora

Is there a way to do a checksum or fairly robust hash in awk? That might be the best way to shorten the array names which appears to what is killing awk.

I take it sorting is absolutely out of the question...? It would safely handle files of arbitrary size.

Perl would be better for comparing via hashes. Doing an md5 or the like in awk would mean calling an external md5 utility 30 million times, where you can at least get a built-in module for Perl.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

03-27-2014

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

Would definately want a built-in esp in a Windows environment where it is very efficient to create new threads but new processes come with a lot of extra overhead.

Sorting is not out of the question unless it has serious performance problems just like the duplicate removal does.

Mike

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

03-27-2014

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Did you try my perl code? I have fairly high hopes it could do the job.

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

03-27-2014

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Still no reply -- is this 32 bit windows running cygwin running whatever? If so, the 4G address space can make hash tools fail, often not gracefully, and often well before the 4G nominal limit, like 1.7G, stumbling over some signed int4 in the process and the address space usage of code and other data. It sounds like you need a 64 bit CPU and O/S.

Once you run past the RAM, the sequential reading and writing of sort may outperform the random activity of hash. Also, not all hash are written for dynamic expansion of hash bucket count, so the amount of linear searching inside the bucket may increase. In rouguewave, for instance, you should set the bucket count according to the size of the set at the start. Extendible hashing - Wikipedia, the free encyclopedia

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

03-27-2014

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

Quote:

Originally Posted by DGPickett

Sorry, It did not appear to me that you were asking me a question in your response.
I am running 64-bit windows with dual i7-2860QM CPUs.
It looks like CYGWIN 1.7.17 is 32-bit. 64-bit started with 1.7.22 in july of last year. I will try it.

Mike

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

03-27-2014

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Agreed it would be a very tight squeeze to solve in-memory on a 32bit environment with 30M records we would only get about 50bytes per record to play with. This is why large datasets are usually stored in databases.

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

Shell Programming and Scripting

Honey, I broke awk! (duplicate line removal in 30M line 3.7GB csv file)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Printing string from last field of the nth line of file to start (or end) of each line (awk I think)

Discussion started by: samonl

2. UNIX for Dummies Questions & Answers

Using awk to remove duplicate line if field is empty

Discussion started by: tugar

3. Shell Programming and Scripting

Duplicate line removal matching some columns only

Discussion started by: Michael Stora

4. Shell Programming and Scripting

awk concatenate every line of a file in a single line

Discussion started by: sdf

5. Shell Programming and Scripting

Read csv file line by line

Discussion started by: venu

6. Shell Programming and Scripting

Updating a line in a large csv file, with sed/awk?

Discussion started by: trey85stang

7. Shell Programming and Scripting

reading a file inside awk and processing line by line

Discussion started by: Anteus

8. Shell Programming and Scripting

awk script to remove duplicate rows in line

Discussion started by: kiranmosarla

9. Shell Programming and Scripting

Awk not working due to missing new line character at last line of file

Discussion started by: pinnacle

10. Shell Programming and Scripting

Removal of Duplicate Entries from the file

Discussion started by: ravi_rn