I am working on a homonym dictionary of names i.e. names which are clustered together according to their “sound-alike” pronunciation:
An example will make this clear:
Since the dictionary is manually constructed it often happens that inadvertently two sets of “homonyms” which should be grouped together are grouped separately. Thus:
Quote:
bisnnu=vishnu
vishno=wishnu=vishnoo=vaishnu=vishnu=visnu=veeshnu=vishanu=vishnau
“vishnu” is shared in both the first set and the second and actually both sets should be reduced to one:
Quote:
bisnnu=vishnu=vishno=wishnu=vishnoo=vaishnu=vishnu=visnu=veeshnu=vishanu=vishnau
I have written a program which points out such “dupes” and also the line on which they occur in the database. But since I am a newbie in Perl try as I might, I cannot write a perl program which will safely merge both sets where there are dupes. I have a script in Ultraedit format which does the job, but it is dreadfully slow and takes too much time.
I am giving below a sample of such dupes:
Quote:
yasmine=yashmeen=yasamin=yasameen=yaasmin=yashameen=yasmeen=yasmin
yashmeen=yazmeen=yasmeen=yasmin=yashmin
watson=vatson=wattson
watson
tekchand=teckchand=tekchanda=tekachnda=tekachnd=teckchanda
tekchand=tekachand
sailesh
shailaesh=shailesh=sailesh
The expected output should be
Quote:
yasmine=yashmeen=yasamin=yasameen=yaasmin=yashameen=yasmeen=yasmin=yashmeen=yazmeen=yasmeen=yasmin=y ashmin
watson=vatson=wattson=watson
tekchand=teckchand=tekchanda=tekachnda=tekachnd=teckchanda=tekchand=tekachand
sailesh=shailaesh=shailesh=sailesh
Ideally the program should also weed out duplicates in a given row but I have an awk program that does the job efficently.
Any help would be really great. Many thanks in advance for a PERL or AWK script. I work under windows and hence sed will not help.