Checking file for duplicates

05-19-2010

Registered User

34, 0

Join Date: Aug 2008

Last Activity: 14 April 2016, 5:28 AM EDT

Posts: 34

Thanks Given: 1

Thanked 0 Times in 0 Posts

Checking file for duplicates

Hi all,

I am due to start receiving a weekly csv containing around 6 million rows. I need to do some processing on this file and then send it on elsewhere.

My problem is that after week 1 the files that I will receive are likely to contain data already received in previous files and I need to strip this data out before sending on.

Initially my plan was to keep a list of each row of data sent and then to check if each row in a new file is already present in my sent list. However it soon became clear that at week 2 I would be checking each of 6 million rows to see if they appeared on a list of 6 million already sent, but at week 5 would be checking against 30 million rows.

I was hoping that someone may have a more efficient way to achieve this.

It is likely that the data will start to be purged after week10 so I would say a max sent list of around 60 million rows.

Any ideas would be appreciated

pxy2d1

View Public Profile for pxy2d1

Find all posts by pxy2d1

05-19-2010

Registered User

383, 29

Join Date: Mar 2008

Last Activity: 27 March 2017, 3:48 PM EDT

Location: Surrounded...

Posts: 383

Thanks Given: 1

Thanked 29 Times in 28 Posts

Being a theoretical question as opposed to a real how-to question, what are you already doing to process the first file? Bear in mind that what you're asking for is more of a framework question than it is a scripting issue.

There are plenty of ways to skin the cat, but which way have you started to do it? No sense in us providing a method that doesn't fit your approach.

curleb

View Public Profile for curleb

Find all posts by curleb

05-19-2010

Registered User

6,402, 678

Join Date: Mar 2008

Last Activity: 8 June 2016, 9:58 PM EDT

Posts: 6,402

Thanks Given: 288

Thanked 678 Times in 647 Posts

We must assume you have a mainstream database engine which can handle CSV files and a computer which has capacity for this task. You really don't mention much about the data or the computer.

At a design level, each record must contain a unique key and whatever information is a parameter to the "purge". Unless you know the source "purge" rules your database of "data already processed" will just grow.

It would make more sense to fix the data feed design at at source. A convention is to mark the record in the source database with a unique extract run reference to prevent repeat extracts - whilst also allowing a rerun.

methyl

View Public Profile for methyl

Find all posts by methyl

05-19-2010

Registered User

317, 0

Join Date: Apr 2008

Last Activity: 22 May 2013, 8:38 AM EDT

Location: Calgary

Posts: 317

Thanks Given: 0

Thanked 0 Times in 0 Posts

I've had to create a differential list before for a similar task.

Records 1 to 100 would be sent, followed by 90 - 300, followed by 250 - whatever.

Each time I would create a list of the last N lines captured. In my case, 5 was sufficient, you may need more or less. I would then search for the last lines that I've captured and process from there. Upon completion, I create my new 'last N lines' and repeat for the next time 'round.

Eventually, the solution is to fix the distribution method to be consistent.

avronius

View Public Profile for avronius

Find all posts by avronius

05-19-2010

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

Please show us the structure of your csv file (fields, keys...)

Jean-Pierre.

aigles

View Public Profile for aigles

Find all posts by aigles

05-19-2010

Registered User

2,759, 420

Join Date: Jun 2006

Last Activity: 13 September 2015, 8:58 PM EDT

Posts: 2,759

Thanks Given: 44

Thanked 420 Times in 408 Posts

CSV is txt file, is command diff not suitable for you ?

rdcwayx

View Public Profile for rdcwayx

Find all posts by rdcwayx

05-20-2010

Registered User

3,216, 33

Join Date: Mar 2005

Last Activity: 4 September 2020, 7:11 AM EDT

Location: classification algos

Posts: 3,216

Thanks Given: 19

Thanked 33 Times in 30 Posts

Quote:

Originally Posted by methyl

We must assume you have a mainstream database engine which can handle CSV files and a computer which has capacity for this task. You really don't mention much about the data or the computer.

At a design level, each record must contain a unique key and whatever information is a parameter to the "purge". Unless you know the source "purge" rules your database of "data already processed" will just grow.

It would make more sense to fix the data feed design at at source. A convention is to mark the record in the source database with a unique extract run reference to prevent repeat extracts - whilst also allowing a rerun.

Unique key - this is not required, record as a whole could be unique and there is no need for unique columns in it. Worst, after applying normalization or some form of transformation, 2 records could be unique or not be so.

Fixing the problem at scope, is definitely out of question, though I agree it makes sense to do that, most of the time, its completely out of scope. Not everything flows in line. We need to work on what has been received or what could be potentially received in a real world scenario.

matrixmadhan

View Public Profile for matrixmadhan

Find all posts by matrixmadhan

Shell Programming and Scripting

Checking file for duplicates

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing duplicates from new file

Discussion started by: sagar_1986

2. UNIX for Dummies Questions & Answers

Removing duplicates from a file

Discussion started by: Sri3001

3. UNIX for Dummies Questions & Answers

Remove duplicates from a file

Discussion started by: saga20

4. Programming

[Solved] Removing duplicates from the file and saving as new file

Discussion started by: bala06

5. Shell Programming and Scripting

Remove the partial duplicates by checking the length of a field

Discussion started by: asyed

6. Shell Programming and Scripting

Duplicates in an XML file

Discussion started by: TasosARISFC

7. Shell Programming and Scripting

Removing Duplicates from file

Discussion started by: tinufarid

8. UNIX for Dummies Questions & Answers

CSV file:Find duplicates, save original and duplicate records in a new file

Discussion started by: arvindosu

9. Shell Programming and Scripting

Remove duplicates from a file

Discussion started by: gpaulose

10. UNIX for Dummies Questions & Answers

Avoid Duplicates in a file

Discussion started by: pssandeep