find duplicate records... again

01-26-2009

Registered User

9, 0

Join Date: Jul 2008

Last Activity: 17 July 2009, 5:47 AM EDT

Posts: 9

Thanks Given: 0

Thanked 0 Times in 0 Posts

find duplicate records... again

Hi all:

Let's suppose I have a file like this (but with many more records).

Code:

XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
    969.8    958.4   3.6320  34.8630
    985.5    973.9   3.6130  34.8600
    998.7    986.9   3.6070  34.8610
   1003.6    991.7   3.6240  34.8660
**
XX ME   342 8689 2006  7  6 3c  60.065  -38.617  2890 0001   74   4 7603  8
    960.9    949.6   3.6020  34.8580
    976.5    965.0   3.5870  34.8580
    991.6    979.9   3.5800  34.8580
   1002.8    990.9   3.5760  34.8580
   1003.9    992.0   3.5760  34.8590
**
XX ME   342 9690 2006  7  7 3c  60.100  -38.669  2876 0001   74   4 7603  8
    975.3    963.8   3.5820  34.8580
    992.3    980.6   3.5660  34.8570
   1003.3    991.4   3.5640  34.8580
   1004.4    992.5   3.5630  34.8590
**
XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
      1.6      1.6   8.9330  34.9230
     13.5     13.4   8.4880  34.9200
**

That is a sequence of records, each composed by: a header line, the data list and an end-of-record delimiter ('**').

I'd like to:
1- retain the unique data, that is, excluding duplicate records. This should be done comparing the fields 5, 6, 7, 9 and 10 of the header lines.
2.- list ALL the duplicates (for further examination).

In the example above, it should return:

Code:

XX ME   342 8689 2006  7  6 3c  60.065  -38.617  2890 0001   74   4 7603  8
    960.9    949.6   3.6020  34.8580
    976.5    965.0   3.5870  34.8580
    991.6    979.9   3.5800  34.8580
   1002.8    990.9   3.5760  34.8580
   1003.9    992.0   3.5760  34.8590
**
XX ME   342 9690 2006  7  7 3c  60.100  -38.669  2876 0001   74   4 7603  8
    975.3    963.8   3.5820  34.8580
    992.3    980.6   3.5660  34.8570
   1003.3    991.4   3.5640  34.8580
   1004.4    992.5   3.5630  34.8590
**

for the unique, and

Code:

XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
     969.8    958.4   3.6320  34.8630
     985.5    973.9   3.6130  34.8600
     998.7    986.9   3.6070  34.8610
    1003.6    991.7   3.6240  34.8660
 **
 XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
       1.6      1.6   8.9330  34.9230
      13.5     13.4   8.4880  34.9200
 **

for the dupes. Is there a simple way to achieve this?

Thanks,

r.-

rleal

View Public Profile for rleal

Find all posts by rleal

01-27-2009

Registered User

1,801, 116

Join Date: Oct 2003

Last Activity: 15 May 2015, 11:55 AM EDT

Location: 54.23, -4.53

Posts: 1,801

Thanks Given: 1

Thanked 116 Times in 101 Posts

Try...

Code:

gawk 'BEGIN{RS="\\*\\*\n+";ORS="**\n"}
      NR==FNR{a[$5,$6,$7,$9,$10]++;next}
      {print $0 > FILENAME "." (a[$5,$6,$7,$9,$10]==1?"uniq":"dupe")}' file file

Tested...

Code:

$ head -1000 file.*
==> file.dupe <==
XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
    969.8    958.4   3.6320  34.8630
    985.5    973.9   3.6130  34.8600
    998.7    986.9   3.6070  34.8610
   1003.6    991.7   3.6240  34.8660
**
XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
      1.6      1.6   8.9330  34.9230
     13.5     13.4   8.4880  34.9200
**

==> file.uniq <==
XX ME   342 8689 2006  7  6 3c  60.065  -38.617  2890 0001   74   4 7603  8
    960.9    949.6   3.6020  34.8580
    976.5    965.0   3.5870  34.8580
    991.6    979.9   3.5800  34.8580
   1002.8    990.9   3.5760  34.8580
   1003.9    992.0   3.5760  34.8590
**
XX ME   342 9690 2006  7  7 3c  60.100  -38.669  2876 0001   74   4 7603  8
    975.3    963.8   3.5820  34.8580
    992.3    980.6   3.5660  34.8570
   1003.3    991.4   3.5640  34.8580
   1004.4    992.5   3.5630  34.8590
**
$

Ygor

View Public Profile for Ygor

Find all posts by Ygor

01-27-2009

Registered User

9, 0

Join Date: Jul 2008

Last Activity: 17 July 2009, 5:47 AM EDT

Posts: 9

Thanks Given: 0

Thanked 0 Times in 0 Posts

Great, that did it seamlessly! I try to understand how this piece of code works but it's far beyond my skills.

Now, let me ask for a step further. If I have this list of duplicate records :

Code:

58 JH     0  650 1996  6 14 4b  60.000   -6.250   783 0000   28   4 7600  6
    950.0    938.9  -9.9000  34.9112
    972.0    960.6  -9.9000  34.9117
**
RU P5     0   94 1993  4 28 4b  60.000   -5.500   878 0000   15   6 7600  5
    606.0    599.4   7.5300  35.1760    6.591    0.990
    758.0    749.5   0.8000  34.9130    7.074    1.020
**
58 JH     0  650 1996  6 14 4c  60.000   -6.250   783 0000   98   4 7600  6
    962.0    950.7  -9.9000  34.9108
    972.0    960.6  -9.9000  34.9117
**
90 AM   264 9854 1990  4 18 3c  60.000   -7.002   483 0001   42   4 7600  7
    394.0    389.9   6.8000  35.1780
    404.0    399.8   6.7400  35.1690
    414.0    409.7   6.5600  35.1590
**
06 AZ   290 1741 1996  7  9 3c  60.000   -6.845   489 0001   45   4 7600  6
    420.0    415.6   8.7735  35.2983
    430.0    425.5   8.7678  35.2970
    439.0    434.4   8.7582  35.2979
**
XX UN   104 2267 1999 10  2 3u  60.420   -8.580   485 0001    5   3 7600  8
     74.0     73.3  10.4000
    104.0    103.0   9.7000
**
XX IN   104 2286 1999 10  2 3u  60.420   -8.580   485 0001    6   3 7600  8
     74.0     73.3  10.4000
    104.0    103.0   9.7000
**
74 XX 10251 9893 1949  7 30 6b  60.000   -5.420   784 0000   13   4 7600  5
    505.5    500.0   7.9600  35.2200
    596.7    590.0   6.5200  35.1600
**
74 SC  1335   74 1949  7 30 6b  60.000   -5.420   784 0000   13   4 7600  5
    404.3    400.0   8.3900  35.2400
    505.5    500.0   7.1800  35.1900
    596.7    590.0   6.5200  35.1600
**
90 P5 12461 2819 1993  4 28 6b  60.000   -5.500   878 0000   15   6 7600  5
    606.8    600.0   7.5300  35.1800    6.390    0.990
    758.8    750.0   0.8000  34.9100    6.850    1.020
**
06 AZ 10389 5882 1996  7  9 6c  60.000   -6.845   489 0000   50   4 7600  6
    427.6    423.0   8.7777  35.2983
    436.7    432.0   8.7670  35.2970
    443.8    439.0   8.7582  35.2979
**
58 GS  3233  869 1990  4 18 6c  60.000   -7.002   483 0000   42   4 7600  7
    404.0    399.8   6.7400  35.1690
    414.0    409.7   6.5600  35.1590
**

I want to retain only one (or more) of the dupes, sending the remaining to another file. The criteria to retain one record would be:

if the second characters on $8 of the "header" are different, retain both
else retain the one with greater first character on $8
else retain the one with greater $13
else retain the one with $1~XX
else retain the one with $1~UN

In this case the output should be something like:

Code:

58 JH     0  650 1996  6 14 4b  60.000   -6.250   783 0000   28   4 7600  6
    950.0    938.9  -9.9000  34.9112
    972.0    960.6  -9.9000  34.9117
**
RU P5     0   94 1993  4 28 4b  60.000   -5.500   878 0000   15   6 7600  5
    606.0    599.4   7.5300  35.1760    6.591    0.990
    758.0    749.5   0.8000  34.9130    7.074    1.020
**
58 JH     0  650 1996  6 14 4c  60.000   -6.250   783 0000   98   4 7600  6
    962.0    950.7  -9.9000  34.9108
    972.0    960.6  -9.9000  34.9117
**
90 AM   264 9854 1990  4 18 3c  60.000   -7.002   483 0001   42   4 7600  7
    394.0    389.9   6.8000  35.1780
    404.0    399.8   6.7400  35.1690
    414.0    409.7   6.5600  35.1590
**
06 AZ   290 1741 1996  7  9 3c  60.000   -6.845   489 0001   45   4 7600  6
    420.0    415.6   8.7735  35.2983
    430.0    425.5   8.7678  35.2970
    439.0    434.4   8.7582  35.2979
**
XX IN   104 2286 1999 10  2 3u  60.420   -8.580   485 0001    6   3 7600  8
     74.0     73.3  10.4000
    104.0    103.0   9.7000
**
74 SC  1335   74 1949  7 30 6b  60.000   -5.420   784 0000   13   4 7600  5
    404.3    400.0   8.3900  35.2400
    505.5    500.0   7.1800  35.1900
    596.7    590.0   6.5200  35.1600
**

and and the rejected:

Code:

90 P5 12461 2821 1993  4 28 6b  60.000   -6.500   458 0000   13   6 7600  6
    303.2    300.0   8.0500  35.2200    6.290    0.860
    404.3    400.0   7.9900  35.2100    6.280    0.890
    460.0    455.0   7.5400  35.1800    6.360    0.910
**
06 AZ 10389 5882 1996  7  9 6c  60.000   -6.845   489 0000   50   4 7600  6
    427.6    423.0   8.7777  35.2983
    436.7    432.0   8.7670  35.2970
    443.8    439.0   8.7582  35.2979
**
58 GS  3233  869 1990  4 18 6c  60.000   -7.002   483 0000   42   4 7600  7
    404.0    399.8   6.7400  35.1690
    414.0    409.7   6.5600  35.1590
**
XX UN   104 2267 1999 10  2 3u  60.420   -8.580   485 0001    5   3 7600  8
     74.0     73.3  10.4000
    104.0    103.0   9.7000
**
74 XX 10251 9893 1949  7 30 6b  60.000   -5.420   784 0000   13   4 7600  5
    505.5    500.0   7.9600  35.2200
    596.7    590.0   6.5200  35.1600
**

I hope you can help me. Thanks,

r.-

rleal

View Public Profile for rleal

Find all posts by rleal

01-27-2009

Registered User

1,801, 116

Join Date: Oct 2003

Last Activity: 15 May 2015, 11:55 AM EDT

Location: 54.23, -4.53

Posts: 1,801

Thanks Given: 1

Thanked 116 Times in 101 Posts

While I'm happy to help, I don't have time to do it all for you. Try to understand the awk code provided and modify it to fit your requirements. Others may help if you get stuck.

Ygor

View Public Profile for Ygor

Find all posts by Ygor

01-28-2009

Registered User

9, 0

Join Date: Jul 2008

Last Activity: 17 July 2009, 5:47 AM EDT

Posts: 9

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by Ygor

While I'm happy to help, I don't have time to do it all for you. Try to understand the awk code provided and modify it to fit your requirements. Others may help if you get stuck.

Thanks. I understand and appreciate your help.
Regs,

r.-

rleal

View Public Profile for rleal

Find all posts by rleal

Shell Programming and Scripting

find duplicate records... again

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Duplicate records

Discussion started by: jiam912

2. Shell Programming and Scripting

Deleting duplicate records from file 1 if records from file 2 match

Discussion started by: vestport

3. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Discussion started by: machomaddy

4. UNIX for Dummies Questions & Answers

Need to keep duplicate records

Discussion started by: pandeesh

5. UNIX for Dummies Questions & Answers

CSV file:Find duplicates, save original and duplicate records in a new file

Discussion started by: arvindosu

6. UNIX for Dummies Questions & Answers

Getting non-duplicate records

Discussion started by: rs123

7. Shell Programming and Scripting

Find Duplicate records in first Column in File

Discussion started by: Murugesh

8. Shell Programming and Scripting

find out duplicate records in file?

Discussion started by: tiger2000

9. Shell Programming and Scripting

How to find Duplicate Records in a text file

Discussion started by: G.Aavudai

10. Shell Programming and Scripting

Records Duplicate

Discussion started by: ganesh123