CSV file:Find duplicates, save original and duplicate records in a new file

07-05-2011

Registered User

4, 0

Join Date: Jul 2011

Last Activity: 5 July 2011, 4:50 PM EDT

Posts: 4

Thanks Given: 4

Thanked 0 Times in 0 Posts

CSV file:Find duplicates, save original and duplicate records in a new file

Hi Unix gurus,

Maybe it is too much to ask for but please take a moment and help me out. A very humble request to you gurus. I'm new to Unix and I have started learning Unix. I have this project which is way to advanced for me.

File format: CSV file
File has four columns with no header
File Size is 120GB.

Here are a few sample rows:

Code:

72426459560          2010-06-2 ABC                           LC11100619758

95327GNFA4S          2010-06-2 XYZ                           97BCX3AMD10G

95327GNFA4S          2010-06-2 XYZ                           97BCX3AMKLMO

900278VGA4T          2010-06-2 KLM                            QVA697C8LAYMACBF

900278VG567          2010-06-2 LUF                            QVA697C8LAYMACBF

There are duplicates in column 1 and 4 (I know this for a fact).
I would like to find all the duplicates in column 1 and 4. In the example above, I want rows 2 and 3 (since the columns 1 has duplicates) and also rows 4 and 5 (since column four has duplicates).

If this is too complicated, may be I can look for duplicates in column 1 first and save a new file and then look for duplicates in column 4. (Since I am new to Unix, may be thats the way to go)

I want to save all the duplicates with original records (as in the example above) in a new CSV file.

---------- Post updated at 01:59 PM ---------- Previous update was at 01:56 PM ----------

For more clarity: My results would look like this:

Code:

95327GNFA4S 2010-06-2 XYZ 97BCX3AMD10G

95327GNFA4S 2010-06-2 XYZ 97BCX3AMKLMO

900278VGA4T 2010-06-2 KLM QVA697C8LAYMACBF

900278VG567 2010-06-2 LUF QVA697C8LAYMACBF

arvindosu

View Public Profile for arvindosu

Find all posts by arvindosu

07-05-2011

Registered User

2,524, 241

Join Date: Dec 2007

Last Activity: 17 March 2020, 2:04 PM EDT

Posts: 2,524

Thanks Given: 173

Thanked 241 Times in 206 Posts

Answered a kind-of similar question recently

Take a look at:
https://www.unix.com/unix-dummies-que...#post302534704
where I find duplicates and then do something with the duplicates via a grep command with a file containing matching data

This User Gave Thanks to joeyg For This Post:

joeyg

View Public Profile for joeyg

Find all posts by joeyg

07-05-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Most of the ways to do this don't work on a 120 GIGABYTE datafile.

Finding duplicates in particular, means checking the row against all other rows. The easy/fast/efficient ways depend on having enough memory to store at least the relevant fraction of data.

I'll have to think about this. Is your OS 32 or 64 bit? How much memory does it have? Do you have a lot of free disk space?

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

07-05-2011

Registered User

4, 0

Join Date: Jul 2011

Last Activity: 5 July 2011, 4:50 PM EDT

Posts: 4

Thanks Given: 4

Thanked 0 Times in 0 Posts

I'm on 64 bit Mac OS. 8GB RAM and plenty of hard disk space. By the way, for something else, I had split this file into files of 12 to 15GB size based on column 2. I could use Unix code on these 12 to 15GB files and that will fulfill my needs. Thanks a ton for looking into my question. I appreciate your effort.

---------- Post updated at 02:45 PM ---------- Previous update was at 02:43 PM ----------

Please ignore my previous post. Here is the correct text:

I'm on 64 bit Mac OS. 8GB RAM and plenty of hard disk space. By the way, for something else, I had split this file into files of 12 to 15GB size based on column 3. I could use Unix code on these 12 to 15GB files and that will fulfill my needs. Thanks a ton for looking into my question. I appreciate your effort.

I changed column 2 to column 3 in this post.

arvindosu

View Public Profile for arvindosu

Find all posts by arvindosu

07-05-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

15 gigs is still too large to do a sort on.

Here's a smaller-scale version that gets column 3. I set the split size to 100K so I could do sensible testing on my 5 meg test data.

Code:

#!/bin/sh

COL=3

# Break file into sortable chunks xaa - xzz
# Try between 500M and 2G for your gigantic file.
split -C 100K < megadata.txt

# Sort each individual file on given column
for FILE in x??
do
        sort -k $COL < "${FILE}" > "${FILE}.tmp"
        rm -f "${FILE}"
        mv "${FILE}.tmp" "${FILE}"
        shift
done

# Merge all the sorted files into one great big list.
# run it through awk, which will print only the first row which has any
# particular value for column COL.
sort -k $COL -m x?? | awk -v COL=${COL} '{
        if(LAST != $COL)        print;
        LAST=$COL;      }' > megasorted.txt

# Delete splitted files
# rm -f x??

It will take only the first row to have a particular value for column COL and no others.

Last edited by Corona688; 07-05-2011 at 05:00 PM..

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

07-05-2011

Registered User

4, 0

Join Date: Jul 2011

Last Activity: 5 July 2011, 4:50 PM EDT

Posts: 4

Thanks Given: 4

Thanked 0 Times in 0 Posts

Thanks a ton, Corona688! I'm not sure if I was clear in my previous post. Basically, I also have files of about 15GB size. So if I take one of these files, the sample data would look like this-

72426459560 2010-06-2 ABC LC11100619758

95327GNFA4S 2010-06-2 ABC 97BCX3AMD10G

95327GNFA4S 2010-06-2 ABC 97BCX3AMKLMO

900278VGA4T 2010-06-2 ABC QVA697C8LAYMACBF

900278VG567 2010-06-2 ABC QVA697C8LAYMACBF

(column 3 would be the same for the entire 15GB file)

From this file: I want to find duplicates in column one and four. The output would look something like this:

5327GNFA4S 2010-06-2 ABC 97BCX3AMD10G

95327GNFA4S 2010-06-2 ABC 97BCX3AMKLMO

900278VGA4T 2010-06-2 ABC QVA697C8LAYMACBF

900278VG567 2010-06-2 ABC QVA697C8LAYMACBF

I can use a code that will find me duplicates only in column 1 and save it in file ABC.txt. Then rerun the same code to find duplicates in column 4 and save it in ABC2.txt. Two different files..

I want to save all the duplicates with original records (as in the example above) in a new CSV file.

Last edited by arvindosu; 07-05-2011 at 05:16 PM..

arvindosu

View Public Profile for arvindosu

Find all posts by arvindosu

07-05-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Does your data really have all those blank lines in it?

---------- Post updated at 02:27 PM ---------- Previous update was at 02:24 PM ----------

Assuming it doesn't actually have all those extra blank lines:

Code:

#!/bin/sh

COL=4

# Break file into livable chunks
split -C 100K < megadata.txt

for FILE in ???
do
        sort -k $COL < "${FILE}" > "${FILE}.tmp"
        rm -f "${FILE}"
        mv "${FILE}.tmp" "${FILE}"
        shift
done

sort -k $COL -m ??? | awk -v COL=${COL} '{
        if($COL == LAST)
        {
                if(orig)
                {       print orig;     orig="";        }

                print;
        }
        else
        {
                LAST=$COL;
                orig=$0;
        }

                }' > output.txt
rm -f x??

Will find

Code:

900278VG567 2010-06-2 ABC QVA697C8LAYMACBF
900278VGA4T 2010-06-2 ABC QVA697C8LAYMACBF

based on your input data.

I don't know of a way to do both columns at once. That'd bring you back to the original problem of needing to store everything in memory at once to tell if there were duplicates.

Last edited by Corona688; 07-05-2011 at 05:45 PM..

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

UNIX for Dummies Questions & Answers

CSV file:Find duplicates, save original and duplicate records in a new file

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

CSV File:Filter duplicate records from column1 & another column having unique record

Discussion started by: as7951

2. Shell Programming and Scripting

Filter duplicate records from csv file with condition on one column

Discussion started by: as7951

3. Shell Programming and Scripting

Save output of updated csv file as csv file itself, part 2

Discussion started by: refrain

4. Shell Programming and Scripting

Save output of updated csv file as csv file itself

Discussion started by: refrain

5. Shell Programming and Scripting

FILE_ID extraction from file name and save it in CSV file after looping through each folders

Discussion started by: princetd001

6. Shell Programming and Scripting

Deleting duplicate records from file 1 if records from file 2 match

Discussion started by: vestport

7. Shell Programming and Scripting

Find Duplicate records in first Column in File

Discussion started by: Murugesh

8. Shell Programming and Scripting

find out duplicate records in file?

Discussion started by: tiger2000

9. Shell Programming and Scripting

How to find Duplicate Records in a text file

Discussion started by: G.Aavudai