Merging dupes on different lines in a dictionary


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Merging dupes on different lines in a dictionary
# 1  
Old 09-09-2012
Merging dupes on different lines in a dictionary

I am working on a homonym dictionary of names i.e. names which are clustered together according to their “sound-alike” pronunciation:
An example will make this clear:
Quote:
ameer=aamir=aameer=amir
Since the dictionary is manually constructed it often happens that inadvertently two sets of “homonyms” which should be grouped together are grouped separately. Thus:
Quote:
bisnnu=vishnu
vishno=wishnu=vishnoo=vaishnu=vishnu=visnu=veeshnu=vishanu=vishnau
“vishnu” is shared in both the first set and the second and actually both sets should be reduced to one:
Quote:
bisnnu=vishnu=vishno=wishnu=vishnoo=vaishnu=vishnu=visnu=veeshnu=vishanu=vishnau
I have written a program which points out such “dupes” and also the line on which they occur in the database. But since I am a newbie in Perl try as I might, I cannot write a perl program which will safely merge both sets where there are dupes. I have a script in Ultraedit format which does the job, but it is dreadfully slow and takes too much time.

I am giving below a sample of such dupes:
Quote:
yasmine=yashmeen=yasamin=yasameen=yaasmin=yashameen=yasmeen=yasmin
yashmeen=yazmeen=yasmeen=yasmin=yashmin
watson=vatson=wattson
watson
tekchand=teckchand=tekchanda=tekachnda=tekachnd=teckchanda
tekchand=tekachand
sailesh
shailaesh=shailesh=sailesh
The expected output should be
Quote:
yasmine=yashmeen=yasamin=yasameen=yaasmin=yashameen=yasmeen=yasmin=yashmeen=yazmeen=yasmeen=yasmin=y ashmin
watson=vatson=wattson=watson
tekchand=teckchand=tekchanda=tekachnda=tekachnd=teckchanda=tekchand=tekachand
sailesh=shailaesh=shailesh=sailesh
Ideally the program should also weed out duplicates in a given row but I have an awk program that does the job efficently.

Any help would be really great. Many thanks in advance for a PERL or AWK script. I work under windows and hence sed will not help.
# 2  
Old 09-09-2012
Here is an awk script:

Code:
awk -F= '
{ k=$1
  for(i=1;i<=NF;i++)
     if($i in same) k=same[$i];
  for(i=1;i<=NF;i++)
     same[$i]=k;
}
END {
   for(i in same)
      keys[same[i]]=keys[same[i]] "=" i;
   for(k in keys)
      print substr(keys[k],2);
}' infile

This User Gave Thanks to Chubler_XL For This Post:
# 3  
Old 09-10-2012
Dear Chubler_XL
It works like magic. For the first time my database has no errors and all the names are perfectly merged. This has saved me days of checking and validation. The diagnostic routine I had written to identify dupes along multiple lines now shows that there are no dupes in any file.
Many thanks
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Merging two lines into one (awk)

Hi, I am attempting to merge the following lines which run over two lines using awk. INITIAL OUTPUT 2019 Sep 28 10:47:24.695 hkaet9612 last message repeated 1 time 2019 Sep 28 10:47:24.695 hkaet9612 %ETHPORT-5-IF_DOWN_INTERFACE_REMOVED: Interfa ce Ethernet1/45 is down (Interface removed)... (10 Replies)
Discussion started by: sand1234
10 Replies

2. Shell Programming and Scripting

Merging multiple lines to columns with awk, while inserting commas for missing lines

Hello all, I have a large csv file where there are four types of rows I need to merge into one row per person, where there is a column for each possible code / type of row, even if that code/row isn't there for that person. In the csv, a person may be listed from one to four times... (9 Replies)
Discussion started by: RalphNY
9 Replies

3. Shell Programming and Scripting

Merging 2 lines together

I have a small problem, which due to my lack of knowledge, has left me unable to decipher some of the solutions that I looked at on these forums. So below is a piece of text, which I ran via cat -vet, which comes from within a program file. I have many such programs to process and repeatable,... (4 Replies)
Discussion started by: skarnm
4 Replies

4. Shell Programming and Scripting

Removing dupes within 2 delimited areas in a large dictionary file

Hello, I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header : #DATA #VALID 1 and ends with a footer as shown below #END The data between the Header and the Footer consists of... (6 Replies)
Discussion started by: gimley
6 Replies

5. Shell Programming and Scripting

Merging lines

Thanks it worked for me. I have one more question on top of that. We had few records which were splitted in 2 lines instead of one. Now i identified those lines. The file is too big to open via vi and edit it. How can i do it without opening the file. Suppose, I want line number 1001 & 1002 to... (2 Replies)
Discussion started by: Gangadhar Reddy
2 Replies

6. Shell Programming and Scripting

merging two .txt files by alternating x lines from file 1 and y lines from file2

Hi everyone, I have two files (A and B) and want to combine them to one by always taking 10 rows from file A and subsequently 6 lines from file B. This process shall be repeated 40 times (file A = 400 lines; file B = 240 lines). Does anybody have an idea how to do that using perl, awk or sed?... (6 Replies)
Discussion started by: ink_LE
6 Replies

7. Shell Programming and Scripting

Merging lines

Hi folks. Could somebody help me write a script or command that will look through a file and for every line that doesn't contain a certain value, merge it with the one above? For example, the file contains: SCOTLAND|123|ABC|yes SCOTLAND|456|DEF|yes SCOTLAND|78 9|GHI|yes ... (3 Replies)
Discussion started by: MDM
3 Replies

8. Shell Programming and Scripting

Conditional merging of lines

I have a large file where some lines have been split into two lines; some of them even with white spaces before the second line. e.g in the following text I want to merge only specific lines ( say UNIX is cool), also removing white spaces only between them, others shall remain same on the output.... (4 Replies)
Discussion started by: sunny23
4 Replies

9. Shell Programming and Scripting

Merging lines using AWK

Hi, Anybody help on this. :( I want to merge the line with previous line, if the line starts with 7. Otherwise No change in the line. Example file aa.txt is like below 122122 222222 333333 734834 702923 389898 790909 712345 999999 My output should be written in another file... (6 Replies)
Discussion started by: senthil_is
6 Replies

10. UNIX for Dummies Questions & Answers

Merging lines into one

Hello. I would be very pleased if sb. help me to solve my problem. I've got a file with many non blank lines and I want to merge all lines into one not destroy the informations on them. I've tryed it with split and paste, tr, sed , but everything I've done has been wrong. I know about crazy... (8 Replies)
Discussion started by: Foxgard
8 Replies
Login or Register to Ask a Question