Writing a clustering concordance for a Perso-Arabic script

08-07-2015

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Writing a clustering concordance for a Perso-Arabic script

I am working on a database of a language using Arabic Script. One of the major issues is that the shape of the characters changes according to their initial, medial or final positioning. Another major issue is that of the clustering of vowels within the word: the clustering changes totally the pronunciation.
What I am looking for is a concordance of such clusters read from a file and their display in initial medial or final position with a couple of examples read from the database.
Two files will be provided:

Code:

A look-up file called clusters and a database termed dictionary

An example will make this clear: (I will use English to make this understandable)
The cluster file will be repertoire of just single characters or two or more letter characters as in the example below

Code:

Clusters
a
oi
oa 
ai
ea
ui

The dictionary will comprise of the word followed by its mapping delimited by an equal to sign as in the example below. The mappings are pseudo since in the real dictionary these will be in the International phonetic alphabet.

Code:

Dictionary
act=akt
ball=ball
beta=bita
coat=kot
load=lod
approach=eproch
goal=gol
rain=ren
paint=pent
rail=rel
failure=felyer
sea=si
beans=bins
easy=izi
please=pliz
beach=bich
leather=lethar
already=alredi
early=erli
break=brek
bread=bred
juice=jus
fruit=frut
suit=sut

The expected output would be as under.

Code:

keyword from cluster
position Initial Medial or Final [In case no example is found just a dash]
Frequency of occurence
Two or three examples of the word from the database

Only one example is given below

Code:

a
Init	3	act=akt,approach=eproch,already=alredi
Mid	1	ball
Fin	1	beta=bita

There is one condition. Only the largest string from the clusters file will be considered. If the character is already found in the large cluster it will be ignored. Thus

Code:

a in final position also occurs in sea but is ignored because the cluster ea is already there.

Code:

Similarly a in medial position has only one example, since it occurs elsewhere in different combinations.

Since I work under Windows a Perl or Awk script could help. I do write scripts in Perl and Awk, but this is beyond my skill-set.
Any help would be greatly appreciated, since the final output will help create standards for that particular linguistic community and this work will be put up free for use.

---------- Post updated 08-07-15 at 03:27 AM ---------- Previous update was 08-06-15 at 08:35 PM ----------

My sincere apologies to all who took pains to read the request. I guess my memory isn't what it used to be (I am nearly 70 years old). Still, I should have checked on the forum before posting, which I did not. I will be more careful next time.
I found that I had already written a similar code in Perl and which was bettered by folks on the forum. Here is the code which was put up:

Code:

#! /usr/bin/perl

use strict;  # These two lines save you endless trouble
use warnings; # without them typos and such errors get missed

open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
# $/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>)); # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
    chomp(my $syllable = $_);
    my $count = 0;
    my $init = my $med = my $fin = my $stdalone = "NONE";
    for my $word (@corpus) {
        if ( $word =~ /^$syllable.+/) {
            if ($init eq "NONE") {
                $init = $word;
                $count++;
            }
        }
        elsif ($word =~ /.+$syllable.+/) {
            if ($med eq "NONE") {
                $med = $word;
                $count++;
            }
        }
        elsif ($word =~ /.+$syllable$/) {
            if ($fin eq "NONE") {
                $fin = $word;
                $count++;
            }
        }
        elsif ($word =~ /^$syllable$/) {
            if ($stdalone eq "NONE") {
                $stdalone = $word;
                $count++;
            }
        }
        last if $count == 4;
    }
    print "$syllable\nInitial $init\nMedial $med\nFinal $fin\nStandalone $stdalone\n";
    #print "$init\t$med\t$fin\t$stdalone\n";
}

However, I would still appreciate if as I had requested earlier two changes could be incorporated.
Since the data contains the Perso-Arabic script and its IPA delimited by an equal to sign, the present code does not correctly identify the intial syllables. This may be because of the delimiter and the IPA string that follows.
If the output could contain frequency, that would also be a great help and if the number of sample occurences could be increased to at least 4 or 5.
Sorry once more for the lapse of memory and many thanks for your comprehension.

Last edited by zaxxon; 08-07-2015 at 03:55 AM.. Reason: code tag typo

gimley

View Public Profile for gimley

Find all posts by gimley

08-09-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Are the words Clusters and Dictionary in your sample input files intended to be the names of those files, or are they headers that actually appear as the first line in those files? When you first mentioned those files, you said they had the names clusters and dictionary (with the 1st character being a lowercase letter). In the code in your update to your post, they had the names Syllables and Corpus, respectively. So what are your actual filenames (note that case does matter in filenames on standards-conforming UNIX and Linux filesystems).

Does the order in which the clusters appear in the output matter? If it does matter, will the lines in the cluster input file always be in an order such that all lines with the same number of characters are adjacent and the lines with fewer characters come before lines with more characters? (The code I'm playing with now will produce output with the shortest clusters first and, within each cluster length, will output the clusters in the order in which they appear in the input file. Extra code will be needed if that is not acceptable.)

In your sample output:

Code:

a
Init	3	act=akt,approach=eproch,already=alredi
Mid	1	ball
Fin	1	beta=bita

why is the format of the 3rd line different than the format of the 2nd and 4th lines? Why aren't the 2nd and 4th lines:

Code:

a
Init	3	act,approach,already
Mid	1	ball
Fin	1	beta

or, why isn't the 3rd line:

Code:

a
Init	3	act=akt,approach=eproch,already=alredi
Mid	1	ball=ball
Fin	1	beta=bita

so all of the output is in the same format?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-09-2015

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks for responding.
I understand your queries and in fact I would like to clarify the details so that the Script is more comprehensible
DETAILS
The script invokes 2 files:
1. Syllables: A list of all the syllables.
2. Corpus: A list of words in Arabic script followed by their Indic equivalent, delimited by

Code:

EXPECTED FORMAT
In each case the output is supposed to spew out
a. The syllable in question whether it is Initial Medial or Final.
b. At least 6 to 10 examples (at present only one is spewed out)
c. Additional Bells and whistles: A frequency count of all the words [not present in my script: I don't know how to tailor two sets of counts]
In other words the output should be as under:

Code:

SYLLABLE: FREQUENCY 
Initial 6 EXAMPLES 
Medial 6 EXAMPLES 
Final 6 EXAMPLES 
Standalone 6 EXAMPLES

The example should have the String in Arabic and also in Indic script.
If there are none or less, then it should specify the same. At present only one example is spewed out
It does work to a certain extent but the following major problems are there
PROBLEMS
1.The script should address only the Perso-Arabic side using the

Code:

delimiter and ignore the Indic side. It does not do that as a result of which all final occurrences are not shown. This is because of the delimiter and therefore valid final occurences in Arabic are not detected. I don't know how to instruct the program to delimit analysis only to the Arabic side of the corpus and ignore the rest
2. I need at least 6-10 instances of tokens from the corpus file. At present only one is given
3. If possible the frequency.should be provided: [ I don't know how to tailor two sets of counts]
I have racked my brains over this and all attempts to get this type of output have failed.
To make the scenario more clear I am attaching the data files as well as the script file.
I have tried again and again to modify the script but the desired formatted output is not spewed out.
One is never too old to learn and I still feel at my age I can master the intricacies of Perl and handle strings.
Many thanks for your help

data n script.zip (192.2 KB)

gimley

View Public Profile for gimley

Find all posts by gimley

08-09-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Thanks for the information. You didn't answer all of the questions, and I now have a new question. What is "Standalone"? Does it mean that the syllable appeared as an entire word? Am I correct in assuming that with a in Syllables and with a=a in Corpus, the standalone count (and only that count) should be incremented by 1 and with the word abracadabra=whatever in Corpus, the initial and final counts should each be incremented by 1, the medial counter should be incremented by 3, and the standalone counter should not change?

I am much more fluent with awk than perl. Will you accept an awk script instead of a perl script?

You didn't answer the question about output order. I assume the output order doesn't matter.

You didn't answer the question about headings in your input files. I assume that there are no headings in either of your input files.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-09-2015

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks for your interest.
A Standalone means that the particular syllable is also a word. To take an example from English

Code:

of

is a standalone syllable in the word

Code:

of

, but an initial in the word

Code:

office

Coming to your second question what I needed and tried to do was identify syllables in the corpus in terms of their positions
An Initial syllable would be a string from the Syllables list which comes in the beginning of the word.
A Medial would be a string from the Syllables list which comes in the middle of the word.
A Final would be a string from the Syllables list which comes at the end of the word.
In all cases I would be working only with the Arabic data and ignore all data to the right hand side of the delimiter

Code:

The Script would identify each syllable as per its position and identify the frequency and then provide for each at least six examples [in some cases there would be None or less than 6].
I trust the above clarifies a bit more the issue.
I do not mind an Awk script. In fact I started with AWK but felt the problem at hand was too complex for Awk to handle. I seem to be mistaken.
Many thanks once again for your help which eventually will help the community to develop a standard.

gimley

View Public Profile for gimley

Find all posts by gimley

08-09-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Here an awk essay that addresses some but not all of your conditions/problems:

Code:

awk '
FNR==NR         {SYL[FNR]=$1
                 MX=FNR
                 next
                }
                {for (c=1; c<=MX; c++)
                        {s=SYL[c]
                         if ($1==s)     {STDAL[s]++
                                         next
                                        }
                         else if ($1 ~ "^"s)
                                        {if (++INIT[s] <= FREQMX)
                                                EXI[s]=EXI[s] "," $0
                                         next
                                        }
                         else if ($1 ~ s"$")
                                        {if (++FIN[s] <= FREQMX)
                                                EXF[s]=EXF[s] "," $0
                                         next
                                        }
                         else if ($1 ~ s)
                                        {MID[s]+=gsub(s,s)
                                         if (MID[s] <= FREQMX)
                                                EXM[s]=EXM[s] "," $0
                                         next
                                        }

                        }
                }
END             {for (c=1; c<=MX; c++)
                                {s=SYL[c]
                                 print s
                                 printf "Init:  %3d\tExample: %s\n", INIT[s], substr (EXI[s],2)
                                 printf "Mid:   %3d\tExample: %s\n", MID[s],  substr (EXM[s],2)
                                 printf "Fin:   %3d\tExample: %s\n", FIN[s],  substr (EXF[s],2)
                                 printf "Alone: %3d", STDAL[s]
                                        if (STDAL[s] > 0) printf "\tExample: %s", s
                                        printf "\n"
                                }
                }

' syllables FS="=" FREQMX=6 corpus
oi
Init:    0    Example: 
Mid:     0    Example: 
Fin:     0    Example: 
Alone:   0
oa
Init:    0    Example: 
Mid:     4    Example: coat=kot,load=lod,approach=eproch,goal=gol
Fin:     0    Example: 
Alone:   0
ai
Init:    0    Example: 
Mid:     4    Example: rain=ren,paint=pent,rail=rel,failure=felyer
Fin:     0    Example: 
Alone:   0
ea
Init:    2    Example: easy=izi,early=erli
Mid:     9    Example: beans=bins,please=pliz,beach=bich,leather=lethar,already=alredi,break=brek
Fin:     1    Example: sea=si
Alone:   0
ui
Init:    0    Example: 
Mid:     3    Example: juice=jus,fruit=frut,suit=sut
Fin:     0    Example: 
Alone:   0
a
Init:    1    Example: act=akt
Mid:     2    Example: ball=ball
Fin:     1    Example: beta=bita
Alone:   1    Example: a

It is tested on an extended version of your samples in post#1; the condition that "Only the largest string from the clusters file will be considered." is covered by having the larger clusters in front of the smaller ones, i.e. "a" is analysed after "ea" and "oa" etc. Unfortunately, some hits on "a" (already, approach) are lost as they are already counted in those clusters ("ea", "oa"). However, if you think this a promising approach, one could try to refine...

---------- Post updated at 14:30 ---------- Previous update was at 14:17 ----------

OK, this one

Code:

awk '
FNR==NR         {SYL[FNR]=$1
                 MX=FNR
                 next
                }
                {TOTLINE=$0
                 for (c=1; c<=MX; c++)
                        {s=SYL[c]
                         if ($1==s)     {STDAL[s]++
                                         next
                                        }
                         else if (gsub ("^"s, "@", $1))
                                        {if (++INIT[s] <= FREQMX)
                                                EXI[s]=EXI[s] "," TOTLINE
                                        }
                         else if (gsub (s"$", "@", $1))
                                        {if (++FIN[s] <= FREQMX)
                                                EXF[s]=EXF[s] "," TOTLINE
                                        }
                         else if (n=gsub (s, "@", $1))
                                        {MID[s]+=n
                                         if (MID[s] <= FREQMX)
                                                EXM[s]=EXM[s] "," TOTLINE
                                        }

                        }
                }
END             {for (c=1; c<=MX; c++)
                                {s=SYL[c]
                                 print s
                                 printf "Init:  %3d\tExample: %s\n", INIT[s], substr (EXI[s],2)
                                 printf "Mid:   %3d\tExample: %s\n", MID[s],  substr (EXM[s],2)
                                 printf "Fin:   %3d\tExample: %s\n", FIN[s],  substr (EXF[s],2)
                                 printf "Alone: %3d", STDAL[s]
                                        if (STDAL[s] > 0) printf "\tExample: %s", s
                                        printf "\n"
                                }
                }

' syllables~ FS="=" FREQMX=8 corpus
oi
Init:    0    Example: 
Mid:     0    Example: 
Fin:     0    Example: 
Alone:   0
oa
Init:    0    Example: 
Mid:     4    Example: coat=kot,load=lod,approach=eproch,goal=gol
Fin:     0    Example: 
Alone:   0
ai
Init:    0    Example: 
Mid:     4    Example: rain=ren,paint=pent,rail=rel,failure=felyer
Fin:     0    Example: 
Alone:   0
ea
Init:    2    Example: easy=izi,early=erli
Mid:     9    Example: beans=bins,please=pliz,beach=bich,leather=lethar,already=alredi,break=brek,bread=bred,heading=heding
Fin:     1    Example: sea=si
Alone:   0
ui
Init:    0    Example: 
Mid:     3    Example: juice=jus,fruit=frut,suit=sut
Fin:     0    Example: 
Alone:   0
a
Init:    3    Example: act=akt,approach=eproch,already=alredi
Mid:     1    Example: ball=ball
Fin:     1    Example: beta=bita
Alone:   1    Example: a

should cover the above mentioned problem. Please report back!

---------- Post updated at 14:39 ---------- Previous update was at 14:30 ----------

And this one

Code:

awk '
FNR==NR         {SYL[FNR]=$1
                 MX=FNR
                 next  
                }
                {TOTLINE=$0
                 for (c=1; c<=MX; c++)
                        {s=SYL[c]
                         if ($1==s)     {STDAL[s]++
                                         next
                                        }
                         if (gsub ("^"s, "@", $1))
                                        {if (++INIT[s] <= FREQMX)
                                                EXI[s]=EXI[s] "," TOTLINE
                                        }
                         if (gsub (s"$", "@", $1))
                                        {if (++FIN[s] <= FREQMX)
                                                EXF[s]=EXF[s] "," TOTLINE
                                        }
                         if (n=gsub (s, "@", $1))
                                        {MID[s]+=n
                                         if (MID[s] <= FREQMX)
                                                EXM[s]=EXM[s] "," TOTLINE
                                        }
                                         
                        }
                }
END             {for (c=1; c<=MX; c++)
                                {s=SYL[c]
                                 print s 
                                 printf "Init:  %3d\tExample: %s\n", INIT[s], substr (EXI[s],2)
                                 printf "Mid:   %3d\tExample: %s\n", MID[s],  substr (EXM[s],2)
                                 printf "Fin:   %3d\tExample: %s\n", FIN[s],  substr (EXF[s],2)
                                 printf "Alone: %3d", STDAL[s]
                                        if (STDAL[s] > 0) printf "\tExample: %s", s
                                        printf "\n"
                                }
                }

' syllables~ FS="=" FREQMX=10 corpus
a
Init:    4    Example: act=akt,approach=eproch,alabama=asdfjg,already=alredi
Mid:     3    Example: ball=ball,alabama=asdfjg
Fin:     2    Example: beta=bita,alabama=asdfjg
Alone:   1    Example: a

would cover even the case of "alabama".

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-09-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The following seems to address all of your issues (although I had to guess on the format of some things):

Code:

#!/bin/ksh
sf="syllables"
cf="corpus"
sample_max=${3:-10}
awk -v sm="$sample_max" '
{	gsub(/\r/, "")
}
FNR == NR {
	# Read the list of syllables to be processed...
	len = length($1)
	if(NR == 1) 
		l = L = len
	else	if(l > len)
			l = len
		else if(L < len)
			L = len
	syl[len, ++lenc[len]] = $1
	next
}
{	# Accumulate data from dictionary entries...
	len = length(word = $1)
	for(i = (L > len) ? len : L; i > 0 && len > 0; i--)
		for(j = 1; j <= lenc[i]; j++) {
			s = syl[i, j]
			if(s == word) {
				# Process standalone match...
				++sa[s]
				len = 0
				if(sasc[s] < sm)
					sasam[s, ++sasc[s]] = $0
			} else {if(sub("^"s, "\a", word)) {
					# Process initial match...
					++init[s]
					len--
					if(insc[s] < sm)
						insam[s, ++insc[s]] = $0
				}
				if(sub(s"$", "\a", word)) {
					# Process finish match...
					++fin[s]
					len--
					if(fisc[s] < sm)
						fisam[s, ++fisc[s]] = $0
				}
				if(c = gsub(s, "\a", word)) {
					# Process medial matches...
					med[s] += c
					len -= c
					if(mesc[s] < sm)
						mesam[s, ++mesc[s]] = $0
				}
			}
		}
}
END {	# Dump collected data...
	for(i = l; i <= L; i++) {
		for(j = 1; j <= lenc[i]; j++) {
			printf("%-11s %7s\n", (s = syl[i, j]) ":",
				sa[s] + init[s] + fin[s] + med[s])
			printf("%-11s %7s%s", "Initial",
				init[s] ? init[s]: "NONE",
				init[s] ? "\t" : "\n")
			for(k = 1; k <= insc[s]; k++)
				printf("%s%s", insam[s, k],
					k == insc[s] ? "\n" : ",")
			printf("%-11s %7s%s", "Medial",
				med[s] ? med[s]: "NONE",
				med[s] ? "\t" : "\n")
			for(k = 1; k <= mesc[s]; k++)
				printf("%s%s", mesam[s, k],
					k == mesc[s] ? "\n" : ",")
			printf("%-11s %7s%s", "Final",
				fin[s] ? fin[s]: "NONE",
				fin[s] ? "\t" : "\n")
			for(k = 1; k <= fisc[s]; k++)
				printf("%s%s", fisam[s, k],
					k == fisc[s] ? "\n" : ",")
			printf("%-11s %7s%s", "Standalone",
				sa[s] ? sa[s]: "NONE",
				sa[s] ? "\t" : "\n")
			for(k = 1; k <= sasc[s]; k++)
				printf("%s%s", sasam[s, k],
					k == sasc[s] ? "\n" : ",")
			print ""
		}
	}
}' "$sf" FS="[=]" "$cf"

Note that this uses the default FS when processing syllables so it ignores extraneous spaces (such as the space after ca ) in your sample syllables file and uses the = as the field separator for the corpus file.

With the file syllables containing:

Code:

a
ue
ueue
oi
oa 
ai
ea
ui

(which has additional lines shown in red). And, with the corpus file containing:

Code:

act=akt
ball=ball
beta=bita
coat=kot
load=lod
approach=eproch
goal=gol
rain=ren
paint=pent
rail=rel
failure=felyer
sea=si
beans=bins
easy=izi
please=pliz
beach=bich
leather=lethar
already=alredi
early=erli
break=brek
bread=bred
juice=jus
fruit=frut
suit=sut
queue=ku
query=kwiri
a=a1
a=a2
a=a3
a=a4
a=a5
a=a6
a=a7
a=a8
a=a9
a=a10
a=a11
abracadabra=abracadabra

(again with additional lines shown in red), and with the code above stored in a file name conc that has been made executable, you can see the results from running that code below:

Code:

$ ./conc
a:               21
Initial           4	act=akt,approach=eproch,already=alredi,abracadabra=abracadabra
Medial            4	ball=ball,abracadabra=abracadabra
Final             2	beta=bita,abracadabra=abracadabra
Standalone       11	a=a1,a=a2,a=a3,a=a4,a=a5,a=a6,a=a7,a=a8,a=a9,a=a10

ue:               1
Initial        NONE
Medial            1	query=kwiri
Final          NONE
Standalone     NONE

oi:               0
Initial        NONE
Medial         NONE
Final          NONE
Standalone     NONE

oa:               4
Initial        NONE
Medial            4	coat=kot,load=lod,approach=eproch,goal=gol
Final          NONE
Standalone     NONE

ai:               4
Initial        NONE
Medial            4	rain=ren,paint=pent,rail=rel,failure=felyer
Final          NONE
Standalone     NONE

ea:              10
Initial           2	easy=izi,early=erli
Medial            7	beans=bins,please=pliz,beach=bich,leather=lethar,already=alredi,break=brek,bread=bred
Final             1	sea=si
Standalone     NONE

ui:               3
Initial        NONE
Medial            3	juice=jus,fruit=frut,suit=sut
Final          NONE
Standalone     NONE

ueue:             1
Initial        NONE
Medial         NONE
Final             1	queue=ku
Standalone     NONE

$

which I think is close (if not exactly) what you want.

Unfortunately, it doesn't even come close to working with the data you supplied in the zip file you uploaded. The code has been set up to remove the DOS format <carriage-return> characters in both of your input files, but that can't make up for the fact that your uploaded corpus file contains LOTS of lines with no equal sign characters and LOTS of lines that have a 1st character that is an equal sign. Both files also contain lots of byte sequences that do not form valid UTF-8 characters. So, with the two files you uploaded, it produces the following output:

Code:

آ:             555
Initial         548	آنڊن=आंडनि,آنڊا=आंडा,آنڊي=आंडे,آنڊو=आंडो,آندو=आंदो,آنڌِ=आंधिमांधि,آنوِرن=आंविरनि,آنوِرا=आंविरा,آنوِري=आंविरे,آنوِرو=आंविरो
Medial            6	غيَرآبادُ=ग़ैरआबादु,دوآب=दोआब,دوآبن=दोआबनि,دوآبُ=दोआबु,زيرآبيِ=ज़ेरआबी,زيرآبيِئَ=ज़ेरआबीअ
Final          NONE
Standalone        1	آ=आ

:              1
Initial           1	اَ=अ
Medial         NONE
Final          NONE
Standalone     NONE

اَ:          1367
Initial        1352	اَنگ=अंग,اَنگَڻُ=अंगणु,اَنگن=अंगनि,اَنگَلَ=अंगल,اَنگَلن=अंगलनि,اَنگَلُ=अंगलु,اَنگِليِ=अंगिली,اَنگيِڪارُ=अंगीकारु,اَنگُ=अंगु,اَنگُستانن=अंगुस्ताननि
Medial           14	اَفسانناَويِسن=अफ़सानानवीसनि,دَراَصَلِ=दरअसलि,نَظَراَندازُ=नज़रअंदाज़ु,بااَدَبُ=बाअदबु,بااَثَرُ=बाअसरि,سَراَنجاميُن=सरअंजामियुनि,سَراَنجاميوُن=सरअंजामियूं,سَراَنجاميِ=सरअंजामी,سَراَنجاميِئَ=सरअंजामीअ,خيراَنديشيُن=ख़ैरअंदेशियुनि
Final             1	اَ=अ
Standalone     NONE

وَ:           942
Initial         426	وَنگ=वंग,وَنگن=वंगनि,وَنگا=वंगा,وَنگيُن=वंगियुनि,وَنگيوُن=वंगियूं,وَنگيِ=वंगी,وَنگيِئَ=वंगीअ,وَنگُ=वंगु,وَنگي=वंगे,وَنگو=वंगो
Medial          449	اَٽِڪاوَن=अटिकावनि,اَٺُوَنجاهيِن=अठुवंजाहीं,اَٺُوَنجاهُە=अठुवंजाहु,اَٺُوَنجاهين=अठुवंजाहें,اَٺُوَنجاهون=अठुवंजाहों,اَڻَوَنڊِئي=अणवंडिए,اَڻَوَنڊِيَن=अणवंडियनि,اَڻَوَنڊِيا=अणवंडिया,اَڻَوَنڊِيُن=अणवंडियुनि,اَڻَوَنڊِيوُن=अणवंडियूं
Final            67	اَٽِڪاوَ=अटिकाव,اَٿَوَ=अथव,اِطِلاوَ=इतिलाव,ايِذاوَ=ईज़ाव,اُڪِلاوَ=उकिलाव,اُپاوَ=उपाव,اوَ=औ,ڪَهڪاوَ=कहिकाव,ڪانوَ=कांव,گهَٽاوَ=घटाव
Standalone     NONE

وا:          1086
Initial         318	واهيِ=ओहीवाहीअ,واري=पुॼंदीअवारनि,وارن=पुॼंदीअवारियुनि,وارا=पुॼंदीअवारीअ,وارو=पुॼंदीअवारे,واري=फेरीअवारनि,وارو=फेरीअवारे,واري=मिठाईअवारनि,وارو=मिठाईअवारे,وانگُرُ=वांगुरु
Medial          737	اَڻَواقُفُ=अणवाक़ुफ़ु,اَپواد=अपवाद,اَپوادَن=अपवादनि,اَپوادُ=अपवादु,اَوائِليِ=अवाइली,عَواميِ=अवामी,عَوامُ=अवामु,اَسُوار=असुवार,اَسُوارن=असुवारनि,اَسُوارُ=असुवारु
Final            31	اَڻاوا=अणावा,عَلاوا=अलावा,اَغُوا=अग़ुवा,آبهَوا=आबहवा,عُضوا=उज्*़वा,جَلِوا=जलिवा,تَوا=तवा,دَوا=दवा,پَٿراوا=पथिरावा,پَراوا=परावा
Standalone     NONE

يِ:          7941
Initial        NONE
Medial         4998	اَنگيِڪارُ=अंगीकारु,اَنگوُڙيِئَ=अंगूड़ीअ,اَنجيِرَ=अंजीर,اَنجيِرن=अंजीरनि,اَنجيِرُ=अंजीरु,اَندِريِن=अंदिरीं,اَندِريِنئَ=अंदिरींअ,اَنڌائيِئَ=अंधाईअ,اَنڌاريِئَ=अंधारीअ,اَنڌيِئَ=अंधीअ
Final          2943	اَنگِليِ=अंगिली,اَنگوُڙيِ=अंगूड़ी,اَنگوُريِ=अंगूरी,اَنتَرياميِ=अंतरियामी,اَندروُنيِ=अंदिरूनी,اَنڌائيِ=अंधाई,اَنڌاريِ=अंधारी,اَنڌيِ=अंधी,اَنڌيرگَرديِ=अंधेरगर्दी,نَگريِ=अंधेरनगरीअ
Standalone     NONE

which to me seems to be garbage. I don't know enough about Arabic or Indic to make any guess at whether or not your input files could be cleaned up programmatically. If they can be, I don't have the expertise to do it unless you can provide explicit directions on how to do it.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Writing a clustering concordance for a Perso-Arabic script

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regex to identify illegal characters in a perso-arabic database

Discussion started by: gimley

2. Red Hat

Font chinese and arabic

Discussion started by: jegaraman

3. HP-UX

install arabic lang

Discussion started by: drpix

4. Shell Programming and Scripting

CREATING A SYLLABLE CONCORDANCE WITH POSITIONAL VARIANTS

Discussion started by: gimley

5. Shell Programming and Scripting

Creating a syllable concordance

Discussion started by: gimley

6. Solaris

Arabic package in solaris

Discussion started by: malikshahid85

7. Solaris

arabic setting in solaris

Discussion started by: malikshahid85

8. UNIX for Dummies Questions & Answers

Arabic characters in QNX4

Discussion started by: hbc