Dedup a large file(30M rows)


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Dedup a large file(30M rows)
# 1  
Old 09-25-2012
Dedup a large file(30M rows)

Hi, I have a large file with number of records in there. I need some help to find only first row based on a key and ignore other rows with the same key. I tried few things but file is huge(30 million rows). So need some solution that is very efficient.

e.g
Code:
Junk|Apple|7|Random|data|here...
Junk|Apple|1|Random|data|here...
Junk|Apple|5|Random|data|here...
Junk|Orange|1|Random|data|here...
Junk|Orange|9|Random|data|here...

Here second field is the key. So I want only first record with 'Apple' and then first record with next key (in this case Orange). So output shall be
Code:
Junk|Apple|7|Random|data|here...
Junk|Orange|1|Random|data|here.

Since the file is large, I need help with some solution that do not run out memory.

Thank you...

Last edited by Corona688; 09-25-2012 at 04:05 PM..
# 2  
Old 09-25-2012
Code:
awk -F\| '!a[$2]++' file


If records with the same key are always contiguous (as in your example), an even more efficient solution is possible.
Code:
awk -F\| '$2 != o; {o=$2}' file

For the corner case of the first record, that implementation assumes that the key field is not empty.

Regards,
Alister

Last edited by alister; 09-25-2012 at 03:10 PM..
# 3  
Old 09-25-2012
Code:
perl -F'\|' -alne  '{if(!$hash{$F[1]}){$hash{$F[1]}++;print $_;}}' input_file

Solution cut down :

Code:
perl -F'\|' -alne  '{if(!$hash{$F[1]}++){print}}' input_file


Last edited by msabhi; 09-25-2012 at 03:37 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Honey, I broke awk! (duplicate line removal in 30M line 3.7GB csv file)

I have a script that builds a database ~30 million lines, ~3.7 GB .cvs file. After multiple optimzations It takes about 62 min to bring in and parse all the files and used to take 10 min to remove duplicates until I was requested to add another column. I am using the highly optimized awk code: awk... (34 Replies)
Discussion started by: Michael Stora
34 Replies

2. Shell Programming and Scripting

Moving or copying first rows and last rows into another file

Hi I would like to move the first 1000 rows of my file into an output file and then move the last 1000 rows into another output file. Any help would be great Thanks (6 Replies)
Discussion started by: kylle345
6 Replies

3. UNIX for Dummies Questions & Answers

merging rows into new file based on rows and first column

I have 2 files, file01= 7 columns, row unknown (but few) file02= 7 columns, row unknown (but many) now I want to create an output with the first field that is shared in both of them and then subtract the results from the rest of the fields and print there e.g. file 01 James|0|50|25|10|50|30... (1 Reply)
Discussion started by: A-V
1 Replies

4. Shell Programming and Scripting

Large file - columns into rows etc

I have done a couple of searches on this and have found many threads but I don't think I've found one that is useful to me - probably because I have very basic comprehension of perl and beginners shell so trying to manipulate a script already posted maybe beyond my capabilities.... Anyway - I... (26 Replies)
Discussion started by: Myrona
26 Replies

5. Shell Programming and Scripting

delete rows in a file based on the rows of another file

I need to delete rows based on the number of lines in a different file, I have a piece of code with me working but when I merge with my C application, it doesnt work. sed '1,'\"`wc -l < /tmp/fileyyyy`\"'d' /tmp/fileA > /tmp/filexxxx Can anyone give me an alternate solution for the above (2 Replies)
Discussion started by: Muthuraj K
2 Replies

6. Shell Programming and Scripting

Deleting specific rows in large files having rows greater than 100000

Hi Guys, I need help in modifying a large text file containing more than 1-2 lakh rows of data using unix commands. I am quite new to the unix language the text file contains data in a pipe delimited format sdfsdfs sdfsdfsd START_ROW sdfsd|sdfsdfsd|sdfsdfasdf|sdfsadf|sdfasdf... (9 Replies)
Discussion started by: manish2009
9 Replies

7. Shell Programming and Scripting

Performance issue in UNIX while generating .dat file from large text file

Hello Gurus, We are facing some performance issue in UNIX. If someone had faced such kind of issue in past please provide your suggestions on this . Problem Definition: /Few of load processes of our Finance Application are facing issue in UNIX when they uses a shell script having below... (19 Replies)
Discussion started by: KRAMA
19 Replies

8. Shell Programming and Scripting

How to delete rows by RowNumber from a Large text file

Friends, I have text file with 700,000 rows. Once I load this file to our database via our cutom process, it logs the row number for rejected rows. How do I delete rows from a Large text file based on the Row Number? Thanks, Prashant (8 Replies)
Discussion started by: ppat7046
8 Replies

9. AIX

sort and dedup problem

I have a file with contents: 1|4|oho hosfadu| 1|3|sdfsd fds| 2|2|sdfg| 2|1|sdf a| 3|5|ouhuh hu| I would like to do three things to it; 1- first, sort it on the first two fields 2- get a unique count on the first field 3- and write the first two unique rows (uniqueness based off the... (4 Replies)
Discussion started by: ChicagoBlues
4 Replies
Login or Register to Ask a Question