Checking file for duplicates


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Checking file for duplicates
# 1  
Old 05-19-2010
Checking file for duplicates

Hi all,

I am due to start receiving a weekly csv containing around 6 million rows. I need to do some processing on this file and then send it on elsewhere.

My problem is that after week 1 the files that I will receive are likely to contain data already received in previous files and I need to strip this data out before sending on.

Initially my plan was to keep a list of each row of data sent and then to check if each row in a new file is already present in my sent list. However it soon became clear that at week 2 I would be checking each of 6 million rows to see if they appeared on a list of 6 million already sent, but at week 5 would be checking against 30 million rows.

I was hoping that someone may have a more efficient way to achieve this.

It is likely that the data will start to be purged after week10 so I would say a max sent list of around 60 million rows.

Any ideas would be appreciated
# 2  
Old 05-19-2010
Being a theoretical question as opposed to a real how-to question, what are you already doing to process the first file? Bear in mind that what you're asking for is more of a framework question than it is a scripting issue.

There are plenty of ways to skin the cat, but which way have you started to do it? No sense in us providing a method that doesn't fit your approach.
# 3  
Old 05-19-2010
We must assume you have a mainstream database engine which can handle CSV files and a computer which has capacity for this task. You really don't mention much about the data or the computer.

At a design level, each record must contain a unique key and whatever information is a parameter to the "purge". Unless you know the source "purge" rules your database of "data already processed" will just grow.


It would make more sense to fix the data feed design at at source. A convention is to mark the record in the source database with a unique extract run reference to prevent repeat extracts - whilst also allowing a rerun.
# 4  
Old 05-19-2010
I've had to create a differential list before for a similar task.

Records 1 to 100 would be sent, followed by 90 - 300, followed by 250 - whatever.

Each time I would create a list of the last N lines captured. In my case, 5 was sufficient, you may need more or less. I would then search for the last lines that I've captured and process from there. Upon completion, I create my new 'last N lines' and repeat for the next time 'round.

Eventually, the solution is to fix the distribution method to be consistent.
# 5  
Old 05-19-2010
Please show us the structure of your csv file (fields, keys...)

Jean-Pierre.
# 6  
Old 05-19-2010
CSV is txt file, is command diff not suitable for you ?
# 7  
Old 05-20-2010
Quote:
Originally Posted by methyl
We must assume you have a mainstream database engine which can handle CSV files and a computer which has capacity for this task. You really don't mention much about the data or the computer.

At a design level, each record must contain a unique key and whatever information is a parameter to the "purge". Unless you know the source "purge" rules your database of "data already processed" will just grow.



It would make more sense to fix the data feed design at at source. A convention is to mark the record in the source database with a unique extract run reference to prevent repeat extracts - whilst also allowing a rerun.
Unique key - this is not required, record as a whole could be unique and there is no need for unique columns in it. Worst, after applying normalization or some form of transformation, 2 records could be unique or not be so.

Fixing the problem at scope, is definitely out of question, though I agree it makes sense to do that, most of the time, its completely out of scope. Not everything flows in line. We need to work on what has been received or what could be potentially received in a real world scenario.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing duplicates from new file

i hav two files like i want to remove/delete all the duplicate lines in file2 which are viz unix,unix2,unix3.I have tried previous post also,but in that complete line must be similar.In this case i have to verify first column only regardless what is the content in succeeding columns. (3 Replies)
Discussion started by: sagar_1986
3 Replies

2. UNIX for Dummies Questions & Answers

Removing duplicates from a file

Hi All, I am merging files coming from 2 different systems ,while doing that I am getting duplicates entries in the merged file I,01,000131,764,2,4.00 I,01,000131,765,2,4.00 I,01,000131,772,2,4.00 I,01,000131,773,2,4.00 I,01,000168,762,2,2.00 I,01,000168,763,2,2.00... (5 Replies)
Discussion started by: Sri3001
5 Replies

3. UNIX for Dummies Questions & Answers

Remove duplicates from a file

Can u tell me how to remove duplicate records from a file? (11 Replies)
Discussion started by: saga20
11 Replies

4. Programming

[Solved] Removing duplicates from the file and saving as new file

Dear All I have 200 data files and each files has many duplicates. I am looking for the automated awk script such that it checks and removes the duplicates from the each file and saving them as new files for all 200 files in the respective folder. For example my data looks like this.. ... (12 Replies)
Discussion started by: bala06
12 Replies

5. Shell Programming and Scripting

Remove the partial duplicates by checking the length of a field

Hi Folks - I'm quite new to awk and didn't come across such issues before. The problem statement is that, I've a file with duplicate records in 3rd and 4th fields. The sample is as below: aaaaaa|a12|45|56 abbbbaaa|a12|45|56 bbaabb|b1|51|45 bbbbbabbb|b2|51|45 aaabbbaaaa|a11|45|56 ... (3 Replies)
Discussion started by: asyed
3 Replies

6. Shell Programming and Scripting

Duplicates in an XML file

Hi All, I have an xml file that contains information like this <ID>574922<COMMENT>TEXT TEXT TEXT</COMMENT></ID> <ID>574922<COMMENT>TEXT TEXT TEXT</COMMENT></ID> <ID>412659<COMMENT>TEXT TEXT TEXT TEXT TEXT</COMMENT></ID> <ID>873520<COMMENT>TEXT</COMMENT></ID>... (5 Replies)
Discussion started by: TasosARISFC
5 Replies

7. Shell Programming and Scripting

Removing Duplicates from file

Hi Experts, Please check the following new requirement. I got data like the following in a file. FILE_HEADER 01cbbfde7898410| 3477945| home| 1 01cbc275d2c122| 3478234| WORK| 1 01cbbe4362743da| 3496386| Rich Spare| 1 01cbc275d2c122| 3478234| WORK| 1 This is pipe separated file with... (3 Replies)
Discussion started by: tinufarid
3 Replies

8. UNIX for Dummies Questions & Answers

CSV file:Find duplicates, save original and duplicate records in a new file

Hi Unix gurus, Maybe it is too much to ask for but please take a moment and help me out. A very humble request to you gurus. I'm new to Unix and I have started learning Unix. I have this project which is way to advanced for me. File format: CSV file File has four columns with no header... (8 Replies)
Discussion started by: arvindosu
8 Replies

9. Shell Programming and Scripting

Remove duplicates from a file

Hi, I need to remove duplicates from a file. The file will be like this 0003 10101 20100120 abcdefghi 0003 10101 20100121 abcdefghi 0003 10101 20100122 abcdefghi 0003 10102 20100120 abcdefghi 0003 10103 20100120 abcdefghi 0003 10103 20100121 abcdefghi Here if the first colum and... (6 Replies)
Discussion started by: gpaulose
6 Replies

10. UNIX for Dummies Questions & Answers

Avoid Duplicates in a file

Hi Gurus, I had a question regarding avoiding duplicates.i have a file abc.txt abc.txt ------- READER_1_1_1> HIER_28056 XML Reader: Error occurred while parsing:; line number ; column number READER_1_3_1> Sun Mar 23 23:52:48 2008 READER_1_3_1> HIER_28056 XML Reader: Error occurred while... (7 Replies)
Discussion started by: pssandeep
7 Replies
Login or Register to Ask a Question