Dedup a large file(30M rows)

09-25-2012

Registered User

15, 0

Join Date: Dec 2010

Last Activity: 28 September 2012, 12:18 PM EDT

Posts: 15

Thanks Given: 2

Thanked 0 Times in 0 Posts

Dedup a large file(30M rows)

Hi, I have a large file with number of records in there. I need some help to find only first row based on a key and ignore other rows with the same key. I tried few things but file is huge(30 million rows). So need some solution that is very efficient.

e.g

Code:

Junk|Apple|7|Random|data|here...
Junk|Apple|1|Random|data|here...
Junk|Apple|5|Random|data|here...
Junk|Orange|1|Random|data|here...
Junk|Orange|9|Random|data|here...

Here second field is the key. So I want only first record with 'Apple' and then first record with next key (in this case Orange). So output shall be

Code:

Junk|Apple|7|Random|data|here...
Junk|Orange|1|Random|data|here.

Since the file is large, I need help with some solution that do not run out memory.

Thank you...

Last edited by Corona688; 09-25-2012 at 04:05 PM..

ran123

View Public Profile for ran123

Find all posts by ran123

09-25-2012

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Code:

awk -F\| '!a[$2]++' file

If records with the same key are always contiguous (as in your example), an even more efficient solution is possible.

Code:

awk -F\| '$2 != o; {o=$2}' file

For the corner case of the first record, that implementation assumes that the key field is not empty.

Regards,
Alister

Last edited by alister; 09-25-2012 at 03:10 PM..

alister

View Public Profile for alister

Find all posts by alister

09-25-2012

Registered User

177, 61

Join Date: Mar 2012

Last Activity: 2 November 2013, 1:26 AM EDT

Location: In books/UNIX.com

Posts: 177

Thanks Given: 16

Thanked 61 Times in 60 Posts

Code:

perl -F'\|' -alne  '{if(!$hash{$F[1]}){$hash{$F[1]}++;print $_;}}' input_file

Solution cut down :

Code:

perl -F'\|' -alne  '{if(!$hash{$F[1]}++){print}}' input_file

Last edited by msabhi; 09-25-2012 at 03:37 PM..

msabhi

View Public Profile for msabhi

Find all posts by msabhi

Shell Programming and Scripting

Dedup a large file(30M rows)

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Honey, I broke awk! (duplicate line removal in 30M line 3.7GB csv file)

Discussion started by: Michael Stora

2. Shell Programming and Scripting

Moving or copying first rows and last rows into another file

Discussion started by: kylle345

3. UNIX for Dummies Questions & Answers

merging rows into new file based on rows and first column

Discussion started by: A-V

4. Shell Programming and Scripting

Large file - columns into rows etc

Discussion started by: Myrona

5. Shell Programming and Scripting

delete rows in a file based on the rows of another file

Discussion started by: Muthuraj K

6. Shell Programming and Scripting

Deleting specific rows in large files having rows greater than 100000

Discussion started by: manish2009

7. Shell Programming and Scripting

Performance issue in UNIX while generating .dat file from large text file

Discussion started by: KRAMA

8. Shell Programming and Scripting

How to delete rows by RowNumber from a Large text file

Discussion started by: ppat7046

9. AIX

sort and dedup problem

Discussion started by: ChicagoBlues