Edit distance using perl or awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Edit distance using perl or awk
# 1  
Old 11-15-2015
Edit distance using perl or awk

Dear all,
I am working on a large Sindhi lexicon which I hope to complete by 2017 and place in open source. The database is in Arabic script in two columns delimited by an equal to sign.
Column 1 contains a word or words without the short vowel and also some extraneous information which is stored in brackets.
Column 2 contains the word along with the short vowels whose list I am giving below along with their Unicode values.
Code:
َ U+064E
ُ U+064F
ِ U+0650

It may also be that column 2 can have the same word repeated without any short vowels as is the following case in line 5 of the database
Code:
کاتو=کاتو

What I need is an awk or perl script which will identify only these words whose edit distance is limited to the conditions outlined above i.e.
a. ignore all words/strings in brackets
b. identify words that are similar
c. identify words delimited by the three characters
and store such words in a separate file, delimited by an equal to sign.
A small input sample is provided below
Code:
کاٻِيو (جمع) کاٻِيا (ٿ)=کاٻيو
کاتَرَ (ٿ)=کاتر (ٿ)
کاتِرو ڙ (ٿ)=کاترو ڙ (ٿ)
کاتِڙو (جمع) کاتِڙا (ٿ)=کاتڙو
کاتو (جمع) کاتا=کاتو
کاتي چاڙَهڻُ=کاتي چاڙهڻ
کاتيدارُ (جمع) کاتيدارَ=کاتيدار
کاتي لِکَڻُ=کاتي لکڻ
کاتو پَوَڻُ (ٿ)=کاتو پوڻ (ٿ)
کاٽُ (جمع) کاٽَ=کاٽ
کاٽِڙِيو (جمع) کاٽِڙِيا=کاٽڙيو
کاٽُ هَڻَڻُ=کاٽ هڻڻ
کاٽا کُٽائِڻُ=کاٽا کٽائڻ
کاٽائو (جمع) کاٽائُو=کاٽائو
کاٽائُو پُٽُ=کاٽائو پٽ
کاٽڙو=کاٽڙو
کاٽِڙِي=کاٽڙي
کاٽِڙِيو (جمع) کاٽِڙِيا=کاٽڙيو

The expected output is as under. Cleaned by hand and hopefully meeting the conditions specified above
Code:
کاٻِيو=کاٻيو
کاتَرَ=کاتر
کاتِرو ڙ=کاترو ڙ
کاتِڙو=کاتڙو
کاتو=کاتو
کاتي چاڙَهڻُ=کاتي چاڙهڻ
کاتيدارُ کاتيدارَ=کاتيدار
کاتي لِکَڻُ=کاتي لکڻ
کاتو پَوَڻُ=کاتو پوڻ
کاٽُ کاٽَ=کاٽ
کاٽِڙِيو=کاٽڙيو
کاٽُ هَڻَڻُ=کاٽ هڻڻ
کاٽا کُٽائِڻُ=کاٽا کٽائڻ
کاٽائو کاٽائُو=کاٽائو
کاٽائُو پُٽُ=کاٽائو پٽ
کاٽڙو=کاٽڙو
کاٽِڙِي=کاٽڙي
کاٽِڙِيو=کاٽڙيو

Since the number of words compiled is around 80,000, a script using edit distance would help. I checked out awk and perl scripts for edit distance and tried to tweak them for this purpose, but they did not work out successfully.
Many thanks in advance from me and the community who,hopefully, will benefit from the database.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Edit a file in perl

Hi, I have a file like $ cat abc HDR XXX content XXX content YYY content XXX content YYY content XXX content YYY TRL YYYI want to replace the lines staritng with HDR and TRL For this I have written below code #!/usr/bin/perl -w use strict; open ( FH , "+< abc" ) || die "Can't... (1 Reply)
Discussion started by: sam05121988
1 Replies

2. Shell Programming and Scripting

Edit a file using awk ?

Hey guys, I'm trying to learn a bit of awk/sed and I'm using different sites to learn it from, and i think I'm starting to get confused (doesn't take much!). Anyway, say I have a csv file which has something along the lines of the following in it:"test","127.0.0.1","startup... (6 Replies)
Discussion started by: jimbob01
6 Replies

3. Shell Programming and Scripting

perl edit file

Is there a way to edit a file without opening two files the only method I know is one file for reading from and one file writing to I cannot think of any other ways (4 Replies)
Discussion started by: 3junior
4 Replies

4. Shell Programming and Scripting

edit fields awk

Hi there, i need some help please... I have this text, it's name data.txt that contains the following information: Mark Owen: 6999999888 6999999888 +302310999999 2310999999 Steve Blade Pit: +30691111222 2310888777 6999999888 John Rose: 2310777555 310544565 +302310999999 Mary Stuart:... (7 Replies)
Discussion started by: Mark_orig
7 Replies

5. Programming

Problems using Perl DBI to edit database entries - basic stuff

Hello. I am taking a Perl class in college and we've briefly covered SQL and moved on. We have a term project and we can do whatever we want. My project will rely strongly on an SQL Database so I am trying to learn as much about Perl DBI as I can to get things up and going. I am basically... (1 Reply)
Discussion started by: Dave247
1 Replies

6. Programming

Converting distance list to distance matrix in R

Hi power user, I have this type of data (distance list): file1 A B 10 B C 20 C D 50I want output like this # A B C D A 0 10 30 80 B 10 0 20 70 C 30 20 0 50 D 80 70 50 0 Which is a distance matrix I have tried... (0 Replies)
Discussion started by: anjas
0 Replies

7. Shell Programming and Scripting

File edit with awk or sed

I have the follwoing file: This looks to be : seperated. For the first field i want only the file name without ".txt" and also i want to remove "+" sign if the second field starts with "+" sign. Input file: Output file: Appreciate your help (9 Replies)
Discussion started by: pinnacle
9 Replies

8. Shell Programming and Scripting

help on a perl script to edit file

Hi, sample file looks like this.. <hp> <name> <detail>adsg</detail> ... ... </name><ft>4264</ft> </hp> I need to edit the last but one line using perl script. I want the format to be .. <hp> <name> <detail>adsg</detail> ... ... </name> (9 Replies)
Discussion started by: meghana
9 Replies

9. Shell Programming and Scripting

Edit a line in a file with perl

Hi, How can I edit a line in a file? For example, a.txt contains: start: 1 2 3 4 stop: a b c d and I want to change "3" to "9" and to add "5" after "4" the result should be (a.txt): start: 1 9 3 4 5 stop: a b c d Thanks, zed (5 Replies)
Discussion started by: zed
5 Replies

10. Shell Programming and Scripting

Shell/Perl Script to edit dhcpd.conf

Hi, I need to get a script together to edit the dhcp service configuration file dhcpd.conf. Mac addresses are defined in classes ex. class "HOST1" { match if substring (hardware, 1,18)=00:11:11:FF:FF:FF;} class "HOST2" ... class "HOST3" ... ... followed by allow or deny statements:... (4 Replies)
Discussion started by: sahilb
4 Replies
Login or Register to Ask a Question