Identifying dupes within a database and creating unique sub-sets


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Identifying dupes within a database and creating unique sub-sets
# 1  
Old 12-16-2013
Identifying dupes within a database and creating unique sub-sets

Hello,
I have a database of name variants with the following structure:
Code:
variant=variant=variant

The number of variants can be as many as thirty to forty.
Since the database is quite large (at present around 60,000 lines) duplicate sets of variants creep in. Thus
Code:
John=Johann=Jon

and some Hundred lines on
Code:
Jan=Johann

What I need is a script (PERL or AWK, since I work under Windows) which could do the following:
1. Identify such duplicates. Thus in the example above
Code:
John

is a duplicate entry
2. Connect up both entries resulting in one single entry:
Code:
John=Johann=Jon=Jan=Johann

3. Clean up the dupe(s) and provide one single set of Unique name variants.
Code:
John=Johann=Jon=Jan

The script, I am sure, would also prove useful for others who face similar problems of duplication iin their databases.
I am giving below a pseudo example as input:
Code:
Peter=Pieter=Miotr
Mary=Mariam
Pierre=Peter
Marium=Mary=Marie=Maria
Shyam=Syam=Siam
Shym=Shyam=Shhyam=Shayam=Sham=Syam=Siam=Sam

The expected output would be:
Code:
Marium=Mary=Marie=Maria=Mariam
Peter=Pieter=Piotr=Pierre
Sam=Sham=Shayam=Shhyam=Shyam=Shym=Siam=Syam

Many thanks in advance for your help
# 2  
Old 12-16-2013
You could try this, but I'm not sure how quick it will be:

Code:
awk '
function remove_dups(list, have, num, keys, i, new) {
    have[""]
    num=split(list, keys, "=")
    for(i=1;i<=num;i++) {
       if(!(keys[i] in have)) new=new "=" keys[i]
       have[keys[i]]
    }
    return substr(new,2)
}
function merge(list, num, keys,i,new) {
   new=remove_dups(list)
   num=split(new, keys, "=")
   master=keys[1]
   for(i=1;i<=num;i++)
      if(keys[i] in Found) {
          new = remove_dups(List[Found[keys[i]]] "=" new)
          delete List[Found[keys[i]]]
      }
   num=split(new, keys, "=")
   List[master]=new
   for(i=1;i<=num;i++) Found[keys[i]]=master
}
{merge($0)}
END { for (l in List) print List[l] }' infile


Last edited by Chubler_XL; 12-16-2013 at 11:22 PM.. Reason: Standardise variable names
# 3  
Old 12-17-2013
Many thanks. It was pretty fast. Zipped through 20,000 lines in a few seconds. I doubt that there are any issues, since I tested the output file for dupes and there were none.
Many thanks.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regex to identify unique words in a dictionary database

Hello, I have a dictionary which I am building for the Open Source Community. The data structure is as under HEADWORD=PARTOFSPEECH=ENGLISH MEANING as shown in the example below अ=m=Prefix signifying negation. अँहँ=ind=Interjection expressing disapprobation. अं=int=An interjection... (2 Replies)
Discussion started by: gimley
2 Replies

2. Shell Programming and Scripting

Identifying single words in a dictionary database

I am reworking a Marathi-English dictionary to be out on open-source. My dictionary has the Headword in Marathi, followed by its Part of Speech and subsequently by its English glosses as in the examples below; अकरसणें v i To contract, shrink. अकरा a Eleven. अकराळ a Frightful, terrible. विकराळ... (2 Replies)
Discussion started by: gimley
2 Replies

3. Shell Programming and Scripting

Help with Perl script for identifying dupes in column1

Dear all, I have a large dictionary database which has the following structure source word=target word e.g. book=livre Since the database is very large in spite of all the care taken, it so happens that at times the source word is repeated e.g. book=livre book=tome Since I want to... (7 Replies)
Discussion started by: gimley
7 Replies

4. Programming

Unique Number Identifying

I'm trying to solve the below problem for a number: Enter a number and if it has all unique digits print unique number else non-unique number. Eg: Input=123; Output=unique number Input=112; Output=Non-unique number The thing i tried is splitting the number into digits by using % operator... (2 Replies)
Discussion started by: Gautham
2 Replies

5. Shell Programming and Scripting

Script for identifying and deleting dupes in a line

I am compiling a synonym dictionary which has the following structure Headword=Synonym1,Synonym2 and so on, with each synonym separated by a comma. As is usual in such cases manual preparation of synonyms results in repeating the synonym which results in dupes as in the example below:... (3 Replies)
Discussion started by: gimley
3 Replies

6. UNIX for Dummies Questions & Answers

Identifying the commands creating subshells

Hi all, This is the basic question. I have read many books which advised to avoid creating sub shells. e.g: use wc -l<filename rather than using cat file|wc -l. So, how to identify whether a command creates subshell or not? so,is it better to use tail -n+1 file in stead of using cat.... (3 Replies)
Discussion started by: pandeesh
3 Replies

7. UNIX for Dummies Questions & Answers

split a file with unique sets

This may sound like a trivial problem, but I still need some help: I have a file with ids and I want to split it 'n' ways (could be any number) into files: 1 1 1 2 2 3 3 4 5 5 Let's assume 'n' is 3, and we cannot have the same id in two different partitions. So the partitions may... (8 Replies)
Discussion started by: ChicagoBlues
8 Replies

8. Virtualization and Cloud Computing

Clouds (Partially Order Sets) - Streams (Linearly Ordered Sets) - Part 2

timbass Sat, 28 Jul 2007 10:07:53 +0000 Originally posted in Yahoo! CEP-Interest Here is my follow-up note on posets (partially ordered sets) and tosets (totally or linearly ordered sets) as background set theory for event processing, and in particular CEP and ESP. In my last note, we... (0 Replies)
Discussion started by: Linux Bot
0 Replies

9. Programming

Creating a Unique ID on distributed systems

Hi, How do you actually create a unique ID on a distributed system. I looked at gethostid but the man page says that its not guaranteed to be unique. Also using the IP address does not seem to be a feasible solution. Is there a function call or mechanism by which this is possible when even the... (4 Replies)
Discussion started by: pic
4 Replies
Login or Register to Ask a Question