Merging dupes on different lines in a dictionary

09-09-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Merging dupes on different lines in a dictionary

I am working on a homonym dictionary of names i.e. names which are clustered together according to their �sound-alike� pronunciation:
An example will make this clear:

Quote:

ameer=aamir=aameer=amir

Since the dictionary is manually constructed it often happens that inadvertently two sets of �homonyms� which should be grouped together are grouped separately. Thus:

Quote:

bisnnu=vishnu
vishno=wishnu=vishnoo=vaishnu=vishnu=visnu=veeshnu=vishanu=vishnau

�vishnu� is shared in both the first set and the second and actually both sets should be reduced to one:

Quote:

bisnnu=vishnu=vishno=wishnu=vishnoo=vaishnu=vishnu=visnu=veeshnu=vishanu=vishnau

I have written a program which points out such �dupes� and also the line on which they occur in the database. But since I am a newbie in Perl try as I might, I cannot write a perl program which will safely merge both sets where there are dupes. I have a script in Ultraedit format which does the job, but it is dreadfully slow and takes too much time.

I am giving below a sample of such dupes:

Quote:

yasmine=yashmeen=yasamin=yasameen=yaasmin=yashameen=yasmeen=yasmin
yashmeen=yazmeen=yasmeen=yasmin=yashmin
watson=vatson=wattson
watson
tekchand=teckchand=tekchanda=tekachnda=tekachnd=teckchanda
tekchand=tekachand
sailesh
shailaesh=shailesh=sailesh

The expected output should be

Quote:

yasmine=yashmeen=yasamin=yasameen=yaasmin=yashameen=yasmeen=yasmin=yashmeen=yazmeen=yasmeen=yasmin=y ashmin
watson=vatson=wattson=watson
tekchand=teckchand=tekchanda=tekachnda=tekachnd=teckchanda=tekchand=tekachand
sailesh=shailaesh=shailesh=sailesh

Ideally the program should also weed out duplicates in a given row but I have an awk program that does the job efficently.

Any help would be really great. Many thanks in advance for a PERL or AWK script. I work under windows and hence sed will not help.

gimley

View Public Profile for gimley

Find all posts by gimley

09-09-2012

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Here is an awk script:

Code:

awk -F= '
{ k=$1
  for(i=1;i<=NF;i++)
     if($i in same) k=same[$i];
  for(i=1;i<=NF;i++)
     same[$i]=k;
}
END {
   for(i in same)
      keys[same[i]]=keys[same[i]] "=" i;
   for(k in keys)
      print substr(keys[k],2);
}' infile

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

09-10-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Dear Chubler_XL
It works like magic. For the first time my database has no errors and all the names are perfectly merged. This has saved me days of checking and validation. The diagnostic routine I had written to identify dupes along multiple lines now shows that there are no dupes in any file.
Many thanks

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Merging dupes on different lines in a dictionary

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Merging two lines into one (awk)

Discussion started by: sand1234

2. Shell Programming and Scripting

Merging multiple lines to columns with awk, while inserting commas for missing lines

Discussion started by: RalphNY

3. Shell Programming and Scripting

Merging 2 lines together

Discussion started by: skarnm

4. Shell Programming and Scripting

Removing dupes within 2 delimited areas in a large dictionary file

Discussion started by: gimley

5. Shell Programming and Scripting

Merging lines

Discussion started by: Gangadhar Reddy

6. Shell Programming and Scripting

merging two .txt files by alternating x lines from file 1 and y lines from file2

Discussion started by: ink_LE

7. Shell Programming and Scripting

Merging lines

Discussion started by: MDM

8. Shell Programming and Scripting

Conditional merging of lines

Discussion started by: sunny23

9. Shell Programming and Scripting

Merging lines using AWK

Discussion started by: senthil_is

10. UNIX for Dummies Questions & Answers

Merging lines into one

Discussion started by: Foxgard