find duplicate records... again


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting find duplicate records... again
# 1  
Old 01-26-2009
find duplicate records... again

Hi all:

Let's suppose I have a file like this (but with many more records).

Code:
XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
    969.8    958.4   3.6320  34.8630
    985.5    973.9   3.6130  34.8600
    998.7    986.9   3.6070  34.8610
   1003.6    991.7   3.6240  34.8660
**
XX ME   342 8689 2006  7  6 3c  60.065  -38.617  2890 0001   74   4 7603  8
    960.9    949.6   3.6020  34.8580
    976.5    965.0   3.5870  34.8580
    991.6    979.9   3.5800  34.8580
   1002.8    990.9   3.5760  34.8580
   1003.9    992.0   3.5760  34.8590
**
XX ME   342 9690 2006  7  7 3c  60.100  -38.669  2876 0001   74   4 7603  8
    975.3    963.8   3.5820  34.8580
    992.3    980.6   3.5660  34.8570
   1003.3    991.4   3.5640  34.8580
   1004.4    992.5   3.5630  34.8590
**
XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
      1.6      1.6   8.9330  34.9230
     13.5     13.4   8.4880  34.9200
**

That is a sequence of records, each composed by: a header line, the data list and an end-of-record delimiter ('**').

I'd like to:
1- retain the unique data, that is, excluding duplicate records. This should be done comparing the fields 5, 6, 7, 9 and 10 of the header lines.
2.- list ALL the duplicates (for further examination).

In the example above, it should return:

Code:
XX ME   342 8689 2006  7  6 3c  60.065  -38.617  2890 0001   74   4 7603  8
    960.9    949.6   3.6020  34.8580
    976.5    965.0   3.5870  34.8580
    991.6    979.9   3.5800  34.8580
   1002.8    990.9   3.5760  34.8580
   1003.9    992.0   3.5760  34.8590
**
XX ME   342 9690 2006  7  7 3c  60.100  -38.669  2876 0001   74   4 7603  8
    975.3    963.8   3.5820  34.8580
    992.3    980.6   3.5660  34.8570
   1003.3    991.4   3.5640  34.8580
   1004.4    992.5   3.5630  34.8590
**

for the unique, and

Code:
XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
     969.8    958.4   3.6320  34.8630
     985.5    973.9   3.6130  34.8600
     998.7    986.9   3.6070  34.8610
    1003.6    991.7   3.6240  34.8660
 **
 XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
       1.6      1.6   8.9330  34.9230
      13.5     13.4   8.4880  34.9200
 **

for the dupes. Is there a simple way to achieve this?

Thanks,

r.-
# 2  
Old 01-27-2009
Try...
Code:
gawk 'BEGIN{RS="\\*\\*\n+";ORS="**\n"}
      NR==FNR{a[$5,$6,$7,$9,$10]++;next}
      {print $0 > FILENAME "." (a[$5,$6,$7,$9,$10]==1?"uniq":"dupe")}' file file

Tested...
Code:
$ head -1000 file.*
==> file.dupe <==
XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
    969.8    958.4   3.6320  34.8630
    985.5    973.9   3.6130  34.8600
    998.7    986.9   3.6070  34.8610
   1003.6    991.7   3.6240  34.8660
**
XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
      1.6      1.6   8.9330  34.9230
     13.5     13.4   8.4880  34.9200
**

==> file.uniq <==
XX ME   342 8689 2006  7  6 3c  60.065  -38.617  2890 0001   74   4 7603  8
    960.9    949.6   3.6020  34.8580
    976.5    965.0   3.5870  34.8580
    991.6    979.9   3.5800  34.8580
   1002.8    990.9   3.5760  34.8580
   1003.9    992.0   3.5760  34.8590
**
XX ME   342 9690 2006  7  7 3c  60.100  -38.669  2876 0001   74   4 7603  8
    975.3    963.8   3.5820  34.8580
    992.3    980.6   3.5660  34.8570
   1003.3    991.4   3.5640  34.8580
   1004.4    992.5   3.5630  34.8590
**
$

# 3  
Old 01-27-2009
Great, that did it seamlessly! I try to understand how this piece of code works but it's far beyond my skills.

Now, let me ask for a step further. If I have this list of duplicate records :

Code:
58 JH     0  650 1996  6 14 4b  60.000   -6.250   783 0000   28   4 7600  6
    950.0    938.9  -9.9000  34.9112
    972.0    960.6  -9.9000  34.9117
**
RU P5     0   94 1993  4 28 4b  60.000   -5.500   878 0000   15   6 7600  5
    606.0    599.4   7.5300  35.1760    6.591    0.990
    758.0    749.5   0.8000  34.9130    7.074    1.020
**
58 JH     0  650 1996  6 14 4c  60.000   -6.250   783 0000   98   4 7600  6
    962.0    950.7  -9.9000  34.9108
    972.0    960.6  -9.9000  34.9117
**
90 AM   264 9854 1990  4 18 3c  60.000   -7.002   483 0001   42   4 7600  7
    394.0    389.9   6.8000  35.1780
    404.0    399.8   6.7400  35.1690
    414.0    409.7   6.5600  35.1590
**
06 AZ   290 1741 1996  7  9 3c  60.000   -6.845   489 0001   45   4 7600  6
    420.0    415.6   8.7735  35.2983
    430.0    425.5   8.7678  35.2970
    439.0    434.4   8.7582  35.2979
**
XX UN   104 2267 1999 10  2 3u  60.420   -8.580   485 0001    5   3 7600  8
     74.0     73.3  10.4000
    104.0    103.0   9.7000
**
XX IN   104 2286 1999 10  2 3u  60.420   -8.580   485 0001    6   3 7600  8
     74.0     73.3  10.4000
    104.0    103.0   9.7000
**
74 XX 10251 9893 1949  7 30 6b  60.000   -5.420   784 0000   13   4 7600  5
    505.5    500.0   7.9600  35.2200
    596.7    590.0   6.5200  35.1600
**
74 SC  1335   74 1949  7 30 6b  60.000   -5.420   784 0000   13   4 7600  5
    404.3    400.0   8.3900  35.2400
    505.5    500.0   7.1800  35.1900
    596.7    590.0   6.5200  35.1600
**
90 P5 12461 2819 1993  4 28 6b  60.000   -5.500   878 0000   15   6 7600  5
    606.8    600.0   7.5300  35.1800    6.390    0.990
    758.8    750.0   0.8000  34.9100    6.850    1.020
**
06 AZ 10389 5882 1996  7  9 6c  60.000   -6.845   489 0000   50   4 7600  6
    427.6    423.0   8.7777  35.2983
    436.7    432.0   8.7670  35.2970
    443.8    439.0   8.7582  35.2979
**
58 GS  3233  869 1990  4 18 6c  60.000   -7.002   483 0000   42   4 7600  7
    404.0    399.8   6.7400  35.1690
    414.0    409.7   6.5600  35.1590
**

I want to retain only one (or more) of the dupes, sending the remaining to another file. The criteria to retain one record would be:

if the second characters on $8 of the "header" are different, retain both
else retain the one with greater first character on $8
else retain the one with greater $13
else retain the one with $1~XX
else retain the one with $1~UN

In this case the output should be something like:
Code:
58 JH     0  650 1996  6 14 4b  60.000   -6.250   783 0000   28   4 7600  6
    950.0    938.9  -9.9000  34.9112
    972.0    960.6  -9.9000  34.9117
**
RU P5     0   94 1993  4 28 4b  60.000   -5.500   878 0000   15   6 7600  5
    606.0    599.4   7.5300  35.1760    6.591    0.990
    758.0    749.5   0.8000  34.9130    7.074    1.020
**
58 JH     0  650 1996  6 14 4c  60.000   -6.250   783 0000   98   4 7600  6
    962.0    950.7  -9.9000  34.9108
    972.0    960.6  -9.9000  34.9117
**
90 AM   264 9854 1990  4 18 3c  60.000   -7.002   483 0001   42   4 7600  7
    394.0    389.9   6.8000  35.1780
    404.0    399.8   6.7400  35.1690
    414.0    409.7   6.5600  35.1590
**
06 AZ   290 1741 1996  7  9 3c  60.000   -6.845   489 0001   45   4 7600  6
    420.0    415.6   8.7735  35.2983
    430.0    425.5   8.7678  35.2970
    439.0    434.4   8.7582  35.2979
**
XX IN   104 2286 1999 10  2 3u  60.420   -8.580   485 0001    6   3 7600  8
     74.0     73.3  10.4000
    104.0    103.0   9.7000
**
74 SC  1335   74 1949  7 30 6b  60.000   -5.420   784 0000   13   4 7600  5
    404.3    400.0   8.3900  35.2400
    505.5    500.0   7.1800  35.1900
    596.7    590.0   6.5200  35.1600
**

and and the rejected:
Code:
90 P5 12461 2821 1993  4 28 6b  60.000   -6.500   458 0000   13   6 7600  6
    303.2    300.0   8.0500  35.2200    6.290    0.860
    404.3    400.0   7.9900  35.2100    6.280    0.890
    460.0    455.0   7.5400  35.1800    6.360    0.910
**
06 AZ 10389 5882 1996  7  9 6c  60.000   -6.845   489 0000   50   4 7600  6
    427.6    423.0   8.7777  35.2983
    436.7    432.0   8.7670  35.2970
    443.8    439.0   8.7582  35.2979
**
58 GS  3233  869 1990  4 18 6c  60.000   -7.002   483 0000   42   4 7600  7
    404.0    399.8   6.7400  35.1690
    414.0    409.7   6.5600  35.1590
**
XX UN   104 2267 1999 10  2 3u  60.420   -8.580   485 0001    5   3 7600  8
     74.0     73.3  10.4000
    104.0    103.0   9.7000
**
74 XX 10251 9893 1949  7 30 6b  60.000   -5.420   784 0000   13   4 7600  5
    505.5    500.0   7.9600  35.2200
    596.7    590.0   6.5200  35.1600
**

I hope you can help me. Thanks,

r.-
# 4  
Old 01-27-2009
While I'm happy to help, I don't have time to do it all for you. Try to understand the awk code provided and modify it to fit your requirements. Others may help if you get stuck.
# 5  
Old 01-28-2009
Quote:
Originally Posted by Ygor
While I'm happy to help, I don't have time to do it all for you. Try to understand the awk code provided and modify it to fit your requirements. Others may help if you get stuck.
Thanks. I understand and appreciate your help.
Regs,

r.-
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Duplicate records

Gents, Please give a help file --BAD STATUS NOT RESHOOTED-- *** VP 41255/51341 in sw 2973 *** VP 41679/51521 in sw 2973 *** VP 41687/51653 in sw 2973 *** VP 41719/51629 in sw 2976 --BAD COG NOT RESHOOTED-- *** VP 41689/51497 in sw 2974 *** VP 41699/51677 in sw 2974 *** VP... (18 Replies)
Discussion started by: jiam912
18 Replies

2. Shell Programming and Scripting

Deleting duplicate records from file 1 if records from file 2 match

I have 2 files "File 1" is delimited by ";" and "File 2" is delimited by "|". File 1 below (3 record shown): Doc1;03/01/2012;New York;6 Main Street;Mr. Smith 1;Mr. Jones Doc2;03/01/2012;Syracuse;876 Broadway;John Davis;Barbara Lull Doc3;03/01/2012;Buffalo;779 Old Windy Road;Charles... (2 Replies)
Discussion started by: vestport
2 Replies

3. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Hi, In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'. In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create... (7 Replies)
Discussion started by: machomaddy
7 Replies

4. UNIX for Dummies Questions & Answers

Need to keep duplicate records

Consider my input is 10 10 20 then, uniq -u will give 20 and uniq -dwill return 10. But i need the output as , 10 10 How we can achieve this? Thanks (4 Replies)
Discussion started by: pandeesh
4 Replies

5. UNIX for Dummies Questions & Answers

CSV file:Find duplicates, save original and duplicate records in a new file

Hi Unix gurus, Maybe it is too much to ask for but please take a moment and help me out. A very humble request to you gurus. I'm new to Unix and I have started learning Unix. I have this project which is way to advanced for me. File format: CSV file File has four columns with no header... (8 Replies)
Discussion started by: arvindosu
8 Replies

6. UNIX for Dummies Questions & Answers

Getting non-duplicate records

Hi, I have a file with these records abc xyz xyz pqr uvw cde cde In my o/p file , I want all the non duplicate rows to be shown. o/p abc pqr uvw Any suggestions how to do this? Thanks for the help. rs (2 Replies)
Discussion started by: rs123
2 Replies

7. Shell Programming and Scripting

Find Duplicate records in first Column in File

Hi, Need to find a duplicate records on the first column, ANU4501710430989 0000000W20389390 ANU4501710430989 0000000W67065483 ANU4501130050520 0000000W80838713 ANU4501210170685 0000000W69246611... (3 Replies)
Discussion started by: Murugesh
3 Replies

8. Shell Programming and Scripting

find out duplicate records in file?

Dear All, I have one file which looks like : account1:passwd1 account2:passwd2 account3:passwd3 account1:passwd4 account5:passwd5 account6:passwd6 you can see there're two records for account1. and is there any shell command which can find out : account1 is the duplicate record in... (3 Replies)
Discussion started by: tiger2000
3 Replies

9. Shell Programming and Scripting

How to find Duplicate Records in a text file

Hi all pls help me by providing soln for my problem I'm having a text file which contains duplicate records . Example: abc 1000 3452 2463 2343 2176 7654 3452 8765 5643 3452 abc 1000 3452 2463 2343 2176 7654 3452 8765 5643 3452 tas 3420 3562 ... (1 Reply)
Discussion started by: G.Aavudai
1 Replies

10. Shell Programming and Scripting

Records Duplicate

Hi Everyone, I have a flat file of 1000 unique records like following : For eg Andy,Flower,201-987-0000,12/23/01 Andrew,Smith,101-387-3400,11/12/01 Ani,Ross,401-757-8640,10/4/01 Rich,Finny,245-308-0000,2/27/06 Craig,Ford,842-094-8740,1/3/04 . . . . . . Now I want to duplicate... (9 Replies)
Discussion started by: ganesh123
9 Replies
Login or Register to Ask a Question