Identifying dupes within a database and creating unique sub-sets

12-16-2013

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Identifying dupes within a database and creating unique sub-sets

Hello,
I have a database of name variants with the following structure:

Code:

variant=variant=variant

The number of variants can be as many as thirty to forty.
Since the database is quite large (at present around 60,000 lines) duplicate sets of variants creep in. Thus

Code:

John=Johann=Jon

and some Hundred lines on

Code:

Jan=Johann

What I need is a script (PERL or AWK, since I work under Windows) which could do the following:
1. Identify such duplicates. Thus in the example above

Code:

John

is a duplicate entry
2. Connect up both entries resulting in one single entry:

Code:

John=Johann=Jon=Jan=Johann

3. Clean up the dupe(s) and provide one single set of Unique name variants.

Code:

John=Johann=Jon=Jan

The script, I am sure, would also prove useful for others who face similar problems of duplication iin their databases.
I am giving below a pseudo example as input:

Code:

Peter=Pieter=Miotr
Mary=Mariam
Pierre=Peter
Marium=Mary=Marie=Maria
Shyam=Syam=Siam
Shym=Shyam=Shhyam=Shayam=Sham=Syam=Siam=Sam

The expected output would be:

Code:

Marium=Mary=Marie=Maria=Mariam
Peter=Pieter=Piotr=Pierre
Sam=Sham=Shayam=Shhyam=Shyam=Shym=Siam=Syam

Many thanks in advance for your help

gimley

View Public Profile for gimley

Find all posts by gimley

12-16-2013

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

You could try this, but I'm not sure how quick it will be:

Code:

awk '
function remove_dups(list, have, num, keys, i, new) {
    have[""]
    num=split(list, keys, "=")
    for(i=1;i<=num;i++) {
       if(!(keys[i] in have)) new=new "=" keys[i]
       have[keys[i]]
    }
    return substr(new,2)
}
function merge(list, num, keys,i,new) {
   new=remove_dups(list)
   num=split(new, keys, "=")
   master=keys[1]
   for(i=1;i<=num;i++)
      if(keys[i] in Found) {
          new = remove_dups(List[Found[keys[i]]] "=" new)
          delete List[Found[keys[i]]]
      }
   num=split(new, keys, "=")
   List[master]=new
   for(i=1;i<=num;i++) Found[keys[i]]=master
}
{merge($0)}
END { for (l in List) print List[l] }' infile

Last edited by Chubler_XL; 12-16-2013 at 11:22 PM.. Reason: Standardise variable names

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

12-17-2013

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks. It was pretty fast. Zipped through 20,000 lines in a few seconds. I doubt that there are any issues, since I tested the output file for dupes and there were none.
Many thanks.

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Identifying dupes within a database and creating unique sub-sets

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regex to identify unique words in a dictionary database

Discussion started by: gimley

2. Shell Programming and Scripting

Identifying single words in a dictionary database

Discussion started by: gimley

3. Shell Programming and Scripting

Help with Perl script for identifying dupes in column1

Discussion started by: gimley

4. Programming

Unique Number Identifying

Discussion started by: Gautham

5. Shell Programming and Scripting

Script for identifying and deleting dupes in a line

Discussion started by: gimley

6. UNIX for Dummies Questions & Answers

Identifying the commands creating subshells

Discussion started by: pandeesh

7. UNIX for Dummies Questions & Answers

split a file with unique sets

Discussion started by: ChicagoBlues

8. Virtualization and Cloud Computing

Clouds (Partially Order Sets) - Streams (Linearly Ordered Sets) - Part 2

Discussion started by: Linux Bot

9. Programming

Creating a Unique ID on distributed systems

Discussion started by: pic