Writing a clustering concordance for a Perso-Arabic script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Writing a clustering concordance for a Perso-Arabic script
# 1  
Old 08-07-2015
Writing a clustering concordance for a Perso-Arabic script

I am working on a database of a language using Arabic Script. One of the major issues is that the shape of the characters changes according to their initial, medial or final positioning. Another major issue is that of the clustering of vowels within the word: the clustering changes totally the pronunciation.
What I am looking for is a concordance of such clusters read from a file and their display in initial medial or final position with a couple of examples read from the database.
Two files will be provided:
Code:
A look-up file called clusters and a database termed dictionary

An example will make this clear: (I will use English to make this understandable)
The cluster file will be repertoire of just single characters or two or more letter characters as in the example below
Code:
Clusters
a
oi
oa 
ai
ea
ui

The dictionary will comprise of the word followed by its mapping delimited by an equal to sign as in the example below. The mappings are pseudo since in the real dictionary these will be in the International phonetic alphabet.
Code:
Dictionary
act=akt
ball=ball
beta=bita
coat=kot
load=lod
approach=eproch
goal=gol
rain=ren
paint=pent
rail=rel
failure=felyer
sea=si
beans=bins
easy=izi
please=pliz
beach=bich
leather=lethar
already=alredi
early=erli
break=brek
bread=bred
juice=jus
fruit=frut
suit=sut

The expected output would be as under.
Code:
keyword from cluster
position Initial Medial or Final [In case no example is found just a dash]
Frequency of occurence
Two or three examples of the word from the database

Only one example is given below
Code:
a
Init	3	act=akt,approach=eproch,already=alredi
Mid	1	ball
Fin	1	beta=bita

There is one condition. Only the largest string from the clusters file will be considered. If the character is already found in the large cluster it will be ignored. Thus
Code:
a in final position also occurs in sea but is ignored because the cluster ea is already there.

Code:
Similarly a in medial position has only one example, since it occurs elsewhere in different combinations.

Since I work under Windows a Perl or Awk script could help. I do write scripts in Perl and Awk, but this is beyond my skill-set.
Any help would be greatly appreciated, since the final output will help create standards for that particular linguistic community and this work will be put up free for use.

---------- Post updated 08-07-15 at 03:27 AM ---------- Previous update was 08-06-15 at 08:35 PM ----------

My sincere apologies to all who took pains to read the request. I guess my memory isn't what it used to be (I am nearly 70 years old). Still, I should have checked on the forum before posting, which I did not. I will be more careful next time.
I found that I had already written a similar code in Perl and which was bettered by folks on the forum. Here is the code which was put up:
Code:
#! /usr/bin/perl

use strict;  # These two lines save you endless trouble
use warnings; # without them typos and such errors get missed

open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
# $/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>)); # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
    chomp(my $syllable = $_);
    my $count = 0;
    my $init = my $med = my $fin = my $stdalone = "NONE";
    for my $word (@corpus) {
        if ( $word =~ /^$syllable.+/) {
            if ($init eq "NONE") {
                $init = $word;
                $count++;
            }
        }
        elsif ($word =~ /.+$syllable.+/) {
            if ($med eq "NONE") {
                $med = $word;
                $count++;
            }
        }
        elsif ($word =~ /.+$syllable$/) {
            if ($fin eq "NONE") {
                $fin = $word;
                $count++;
            }
        }
        elsif ($word =~ /^$syllable$/) {
            if ($stdalone eq "NONE") {
                $stdalone = $word;
                $count++;
            }
        }
        last if $count == 4;
    }
    print "$syllable\nInitial $init\nMedial $med\nFinal $fin\nStandalone $stdalone\n";
    #print "$init\t$med\t$fin\t$stdalone\n";
}

However, I would still appreciate if as I had requested earlier two changes could be incorporated.
Since the data contains the Perso-Arabic script and its IPA delimited by an equal to sign, the present code does not correctly identify the intial syllables. This may be because of the delimiter and the IPA string that follows.
If the output could contain frequency, that would also be a great help and if the number of sample occurences could be increased to at least 4 or 5.
Sorry once more for the lapse of memory and many thanks for your comprehension.

Last edited by zaxxon; 08-07-2015 at 03:55 AM.. Reason: code tag typo
# 2  
Old 08-09-2015
Are the words Clusters and Dictionary in your sample input files intended to be the names of those files, or are they headers that actually appear as the first line in those files? When you first mentioned those files, you said they had the names clusters and dictionary (with the 1st character being a lowercase letter). In the code in your update to your post, they had the names Syllables and Corpus, respectively. So what are your actual filenames (note that case does matter in filenames on standards-conforming UNIX and Linux filesystems).

Does the order in which the clusters appear in the output matter? If it does matter, will the lines in the cluster input file always be in an order such that all lines with the same number of characters are adjacent and the lines with fewer characters come before lines with more characters? (The code I'm playing with now will produce output with the shortest clusters first and, within each cluster length, will output the clusters in the order in which they appear in the input file. Extra code will be needed if that is not acceptable.)

In your sample output:
Code:
a
Init	3	act=akt,approach=eproch,already=alredi
Mid	1	ball
Fin	1	beta=bita

why is the format of the 3rd line different than the format of the 2nd and 4th lines? Why aren't the 2nd and 4th lines:
Code:
a
Init	3	act,approach,already
Mid	1	ball
Fin	1	beta

or, why isn't the 3rd line:
Code:
a
Init	3	act=akt,approach=eproch,already=alredi
Mid	1	ball=ball
Fin	1	beta=bita

so all of the output is in the same format?
# 3  
Old 08-09-2015
Many thanks for responding.
I understand your queries and in fact I would like to clarify the details so that the Script is more comprehensible
DETAILS
The script invokes 2 files:
1. Syllables: A list of all the syllables.
2. Corpus: A list of words in Arabic script followed by their Indic equivalent, delimited by
Code:
=

EXPECTED FORMAT
In each case the output is supposed to spew out
a. The syllable in question whether it is Initial Medial or Final.
b. At least 6 to 10 examples (at present only one is spewed out)
c. Additional Bells and whistles: A frequency count of all the words [not present in my script: I don't know how to tailor two sets of counts]
In other words the output should be as under:
Code:
SYLLABLE: FREQUENCY 
Initial 6 EXAMPLES 
Medial 6 EXAMPLES 
Final 6 EXAMPLES 
Standalone 6 EXAMPLES

The example should have the String in Arabic and also in Indic script.
If there are none or less, then it should specify the same. At present only one example is spewed out
It does work to a certain extent but the following major problems are there
PROBLEMS
1.The script should address only the Perso-Arabic side using the
Code:
=

delimiter and ignore the Indic side. It does not do that as a result of which all final occurrences are not shown. This is because of the delimiter and therefore valid final occurences in Arabic are not detected. I don't know how to instruct the program to delimit analysis only to the Arabic side of the corpus and ignore the rest
2. I need at least 6-10 instances of tokens from the corpus file. At present only one is given
3. If possible the frequency.should be provided: [ I don't know how to tailor two sets of counts]
I have racked my brains over this and all attempts to get this type of output have failed.
To make the scenario more clear I am attaching the data files as well as the script file.
I have tried again and again to modify the script but the desired formatted output is not spewed out.
One is never too old to learn and I still feel at my age I can master the intricacies of Perl and handle strings.
Many thanks for your help
# 4  
Old 08-09-2015
Thanks for the information. You didn't answer all of the questions, and I now have a new question. What is "Standalone"? Does it mean that the syllable appeared as an entire word? Am I correct in assuming that with a in Syllables and with a=a in Corpus, the standalone count (and only that count) should be incremented by 1 and with the word abracadabra=whatever in Corpus, the initial and final counts should each be incremented by 1, the medial counter should be incremented by 3, and the standalone counter should not change?

I am much more fluent with awk than perl. Will you accept an awk script instead of a perl script?

You didn't answer the question about output order. I assume the output order doesn't matter.

You didn't answer the question about headings in your input files. I assume that there are no headings in either of your input files.
# 5  
Old 08-09-2015
Many thanks for your interest.
A Standalone means that the particular syllable is also a word. To take an example from English
Code:
of

is a standalone syllable in the word
Code:
of

, but an initial in the word
Code:
office

Coming to your second question what I needed and tried to do was identify syllables in the corpus in terms of their positions
An Initial syllable would be a string from the Syllables list which comes in the beginning of the word.
A Medial would be a string from the Syllables list which comes in the middle of the word.
A Final would be a string from the Syllables list which comes at the end of the word.
In all cases I would be working only with the Arabic data and ignore all data to the right hand side of the delimiter
Code:
=

The Script would identify each syllable as per its position and identify the frequency and then provide for each at least six examples [in some cases there would be None or less than 6].
I trust the above clarifies a bit more the issue.
I do not mind an Awk script. In fact I started with AWK but felt the problem at hand was too complex for Awk to handle. I seem to be mistaken.
Many thanks once again for your help which eventually will help the community to develop a standard.
# 6  
Old 08-09-2015
Here an awk essay that addresses some but not all of your conditions/problems:
Code:
awk '
FNR==NR         {SYL[FNR]=$1
                 MX=FNR
                 next
                }
                {for (c=1; c<=MX; c++)
                        {s=SYL[c]
                         if ($1==s)     {STDAL[s]++
                                         next
                                        }
                         else if ($1 ~ "^"s)
                                        {if (++INIT[s] <= FREQMX)
                                                EXI[s]=EXI[s] "," $0
                                         next
                                        }
                         else if ($1 ~ s"$")
                                        {if (++FIN[s] <= FREQMX)
                                                EXF[s]=EXF[s] "," $0
                                         next
                                        }
                         else if ($1 ~ s)
                                        {MID[s]+=gsub(s,s)
                                         if (MID[s] <= FREQMX)
                                                EXM[s]=EXM[s] "," $0
                                         next
                                        }

                        }
                }
END             {for (c=1; c<=MX; c++)
                                {s=SYL[c]
                                 print s
                                 printf "Init:  %3d\tExample: %s\n", INIT[s], substr (EXI[s],2)
                                 printf "Mid:   %3d\tExample: %s\n", MID[s],  substr (EXM[s],2)
                                 printf "Fin:   %3d\tExample: %s\n", FIN[s],  substr (EXF[s],2)
                                 printf "Alone: %3d", STDAL[s]
                                        if (STDAL[s] > 0) printf "\tExample: %s", s
                                        printf "\n"
                                }
                }

' syllables FS="=" FREQMX=6 corpus
oi
Init:    0    Example: 
Mid:     0    Example: 
Fin:     0    Example: 
Alone:   0
oa
Init:    0    Example: 
Mid:     4    Example: coat=kot,load=lod,approach=eproch,goal=gol
Fin:     0    Example: 
Alone:   0
ai
Init:    0    Example: 
Mid:     4    Example: rain=ren,paint=pent,rail=rel,failure=felyer
Fin:     0    Example: 
Alone:   0
ea
Init:    2    Example: easy=izi,early=erli
Mid:     9    Example: beans=bins,please=pliz,beach=bich,leather=lethar,already=alredi,break=brek
Fin:     1    Example: sea=si
Alone:   0
ui
Init:    0    Example: 
Mid:     3    Example: juice=jus,fruit=frut,suit=sut
Fin:     0    Example: 
Alone:   0
a
Init:    1    Example: act=akt
Mid:     2    Example: ball=ball
Fin:     1    Example: beta=bita
Alone:   1    Example: a

It is tested on an extended version of your samples in post#1; the condition that "Only the largest string from the clusters file will be considered." is covered by having the larger clusters in front of the smaller ones, i.e. "a" is analysed after "ea" and "oa" etc. Unfortunately, some hits on "a" (already, approach) are lost as they are already counted in those clusters ("ea", "oa"). However, if you think this a promising approach, one could try to refine...

---------- Post updated at 14:30 ---------- Previous update was at 14:17 ----------

OK, this one
Code:
awk '
FNR==NR         {SYL[FNR]=$1
                 MX=FNR
                 next
                }
                {TOTLINE=$0
                 for (c=1; c<=MX; c++)
                        {s=SYL[c]
                         if ($1==s)     {STDAL[s]++
                                         next
                                        }
                         else if (gsub ("^"s, "@", $1))
                                        {if (++INIT[s] <= FREQMX)
                                                EXI[s]=EXI[s] "," TOTLINE
                                        }
                         else if (gsub (s"$", "@", $1))
                                        {if (++FIN[s] <= FREQMX)
                                                EXF[s]=EXF[s] "," TOTLINE
                                        }
                         else if (n=gsub (s, "@", $1))
                                        {MID[s]+=n
                                         if (MID[s] <= FREQMX)
                                                EXM[s]=EXM[s] "," TOTLINE
                                        }

                        }
                }
END             {for (c=1; c<=MX; c++)
                                {s=SYL[c]
                                 print s
                                 printf "Init:  %3d\tExample: %s\n", INIT[s], substr (EXI[s],2)
                                 printf "Mid:   %3d\tExample: %s\n", MID[s],  substr (EXM[s],2)
                                 printf "Fin:   %3d\tExample: %s\n", FIN[s],  substr (EXF[s],2)
                                 printf "Alone: %3d", STDAL[s]
                                        if (STDAL[s] > 0) printf "\tExample: %s", s
                                        printf "\n"
                                }
                }

' syllables~ FS="=" FREQMX=8 corpus
oi
Init:    0    Example: 
Mid:     0    Example: 
Fin:     0    Example: 
Alone:   0
oa
Init:    0    Example: 
Mid:     4    Example: coat=kot,load=lod,approach=eproch,goal=gol
Fin:     0    Example: 
Alone:   0
ai
Init:    0    Example: 
Mid:     4    Example: rain=ren,paint=pent,rail=rel,failure=felyer
Fin:     0    Example: 
Alone:   0
ea
Init:    2    Example: easy=izi,early=erli
Mid:     9    Example: beans=bins,please=pliz,beach=bich,leather=lethar,already=alredi,break=brek,bread=bred,heading=heding
Fin:     1    Example: sea=si
Alone:   0
ui
Init:    0    Example: 
Mid:     3    Example: juice=jus,fruit=frut,suit=sut
Fin:     0    Example: 
Alone:   0
a
Init:    3    Example: act=akt,approach=eproch,already=alredi
Mid:     1    Example: ball=ball
Fin:     1    Example: beta=bita
Alone:   1    Example: a

should cover the above mentioned problem. Please report back!

---------- Post updated at 14:39 ---------- Previous update was at 14:30 ----------

And this one
Code:
awk '
FNR==NR         {SYL[FNR]=$1
                 MX=FNR
                 next  
                }
                {TOTLINE=$0
                 for (c=1; c<=MX; c++)
                        {s=SYL[c]
                         if ($1==s)     {STDAL[s]++
                                         next
                                        }
                         if (gsub ("^"s, "@", $1))
                                        {if (++INIT[s] <= FREQMX)
                                                EXI[s]=EXI[s] "," TOTLINE
                                        }
                         if (gsub (s"$", "@", $1))
                                        {if (++FIN[s] <= FREQMX)
                                                EXF[s]=EXF[s] "," TOTLINE
                                        }
                         if (n=gsub (s, "@", $1))
                                        {MID[s]+=n
                                         if (MID[s] <= FREQMX)
                                                EXM[s]=EXM[s] "," TOTLINE
                                        }
                                         
                        }
                }
END             {for (c=1; c<=MX; c++)
                                {s=SYL[c]
                                 print s 
                                 printf "Init:  %3d\tExample: %s\n", INIT[s], substr (EXI[s],2)
                                 printf "Mid:   %3d\tExample: %s\n", MID[s],  substr (EXM[s],2)
                                 printf "Fin:   %3d\tExample: %s\n", FIN[s],  substr (EXF[s],2)
                                 printf "Alone: %3d", STDAL[s]
                                        if (STDAL[s] > 0) printf "\tExample: %s", s
                                        printf "\n"
                                }
                }

' syllables~ FS="=" FREQMX=10 corpus
a
Init:    4    Example: act=akt,approach=eproch,alabama=asdfjg,already=alredi
Mid:     3    Example: ball=ball,alabama=asdfjg
Fin:     2    Example: beta=bita,alabama=asdfjg
Alone:   1    Example: a

would cover even the case of "alabama".
This User Gave Thanks to RudiC For This Post:
# 7  
Old 08-09-2015
The following seems to address all of your issues (although I had to guess on the format of some things):
Code:
#!/bin/ksh
sf="syllables"
cf="corpus"
sample_max=${3:-10}
awk -v sm="$sample_max" '
{	gsub(/\r/, "")
}
FNR == NR {
	# Read the list of syllables to be processed...
	len = length($1)
	if(NR == 1) 
		l = L = len
	else	if(l > len)
			l = len
		else if(L < len)
			L = len
	syl[len, ++lenc[len]] = $1
	next
}
{	# Accumulate data from dictionary entries...
	len = length(word = $1)
	for(i = (L > len) ? len : L; i > 0 && len > 0; i--)
		for(j = 1; j <= lenc[i]; j++) {
			s = syl[i, j]
			if(s == word) {
				# Process standalone match...
				++sa[s]
				len = 0
				if(sasc[s] < sm)
					sasam[s, ++sasc[s]] = $0
			} else {if(sub("^"s, "\a", word)) {
					# Process initial match...
					++init[s]
					len--
					if(insc[s] < sm)
						insam[s, ++insc[s]] = $0
				}
				if(sub(s"$", "\a", word)) {
					# Process finish match...
					++fin[s]
					len--
					if(fisc[s] < sm)
						fisam[s, ++fisc[s]] = $0
				}
				if(c = gsub(s, "\a", word)) {
					# Process medial matches...
					med[s] += c
					len -= c
					if(mesc[s] < sm)
						mesam[s, ++mesc[s]] = $0
				}
			}
		}
}
END {	# Dump collected data...
	for(i = l; i <= L; i++) {
		for(j = 1; j <= lenc[i]; j++) {
			printf("%-11s %7s\n", (s = syl[i, j]) ":",
				sa[s] + init[s] + fin[s] + med[s])
			printf("%-11s %7s%s", "Initial",
				init[s] ? init[s]: "NONE",
				init[s] ? "\t" : "\n")
			for(k = 1; k <= insc[s]; k++)
				printf("%s%s", insam[s, k],
					k == insc[s] ? "\n" : ",")
			printf("%-11s %7s%s", "Medial",
				med[s] ? med[s]: "NONE",
				med[s] ? "\t" : "\n")
			for(k = 1; k <= mesc[s]; k++)
				printf("%s%s", mesam[s, k],
					k == mesc[s] ? "\n" : ",")
			printf("%-11s %7s%s", "Final",
				fin[s] ? fin[s]: "NONE",
				fin[s] ? "\t" : "\n")
			for(k = 1; k <= fisc[s]; k++)
				printf("%s%s", fisam[s, k],
					k == fisc[s] ? "\n" : ",")
			printf("%-11s %7s%s", "Standalone",
				sa[s] ? sa[s]: "NONE",
				sa[s] ? "\t" : "\n")
			for(k = 1; k <= sasc[s]; k++)
				printf("%s%s", sasam[s, k],
					k == sasc[s] ? "\n" : ",")
			print ""
		}
	}
}' "$sf" FS="[=]" "$cf"

Note that this uses the default FS when processing syllables so it ignores extraneous spaces (such as the space after ca ) in your sample syllables file and uses the = as the field separator for the corpus file.

With the file syllables containing:
Code:
a
ue
ueue
oi
oa 
ai
ea
ui

(which has additional lines shown in red). And, with the corpus file containing:
Code:
act=akt
ball=ball
beta=bita
coat=kot
load=lod
approach=eproch
goal=gol
rain=ren
paint=pent
rail=rel
failure=felyer
sea=si
beans=bins
easy=izi
please=pliz
beach=bich
leather=lethar
already=alredi
early=erli
break=brek
bread=bred
juice=jus
fruit=frut
suit=sut
queue=ku
query=kwiri
a=a1
a=a2
a=a3
a=a4
a=a5
a=a6
a=a7
a=a8
a=a9
a=a10
a=a11
abracadabra=abracadabra

(again with additional lines shown in red), and with the code above stored in a file name conc that has been made executable, you can see the results from running that code below:
Code:
$ ./conc
a:               21
Initial           4	act=akt,approach=eproch,already=alredi,abracadabra=abracadabra
Medial            4	ball=ball,abracadabra=abracadabra
Final             2	beta=bita,abracadabra=abracadabra
Standalone       11	a=a1,a=a2,a=a3,a=a4,a=a5,a=a6,a=a7,a=a8,a=a9,a=a10

ue:               1
Initial        NONE
Medial            1	query=kwiri
Final          NONE
Standalone     NONE

oi:               0
Initial        NONE
Medial         NONE
Final          NONE
Standalone     NONE

oa:               4
Initial        NONE
Medial            4	coat=kot,load=lod,approach=eproch,goal=gol
Final          NONE
Standalone     NONE

ai:               4
Initial        NONE
Medial            4	rain=ren,paint=pent,rail=rel,failure=felyer
Final          NONE
Standalone     NONE

ea:              10
Initial           2	easy=izi,early=erli
Medial            7	beans=bins,please=pliz,beach=bich,leather=lethar,already=alredi,break=brek,bread=bred
Final             1	sea=si
Standalone     NONE

ui:               3
Initial        NONE
Medial            3	juice=jus,fruit=frut,suit=sut
Final          NONE
Standalone     NONE

ueue:             1
Initial        NONE
Medial         NONE
Final             1	queue=ku
Standalone     NONE

$

which I think is close (if not exactly) what you want.

Unfortunately, it doesn't even come close to working with the data you supplied in the zip file you uploaded. The code has been set up to remove the DOS format <carriage-return> characters in both of your input files, but that can't make up for the fact that your uploaded corpus file contains LOTS of lines with no equal sign characters and LOTS of lines that have a 1st character that is an equal sign. Both files also contain lots of byte sequences that do not form valid UTF-8 characters. So, with the two files you uploaded, it produces the following output:
Code:
آ:             555
Initial         548	آنڊن=आंडनि,آنڊا=आंडा,آنڊي=आंडे,آنڊو=आंडो,آندو=आंदो,آنڌِ=आंधिमांधि,آنوِرن=आंविरनि,آنوِرا=आंविरा,آنوِري=आंविरे,آنوِرو=आंविरो
Medial            6	غيَرآبادُ=ग़ैरआबादु,دوآب=दोआब,دوآبن=दोआबनि,دوآبُ=दोआबु,زيرآبيِ=ज़ेरआबी,زيرآبيِئَ=ज़ेरआबीअ
Final          NONE
Standalone        1	آ=आ

:              1
Initial           1	اَ=अ
Medial         NONE
Final          NONE
Standalone     NONE

اَ:          1367
Initial        1352	اَنگ=अंग,اَنگَڻُ=अंगणु,اَنگن=अंगनि,اَنگَلَ=अंगल,اَنگَلن=अंगलनि,اَنگَلُ=अंगलु,اَنگِليِ=अंगिली,اَنگيِڪارُ=अंगीकारु,اَنگُ=अंगु,اَنگُستانن=अंगुस्ताननि
Medial           14	اَفسانناَويِسن=अफ़सानानवीसनि,دَراَصَلِ=दरअसलि,نَظَراَندازُ=नज़रअंदाज़ु,بااَدَبُ=बाअदबु,بااَثَرُ=बाअसरि,سَراَنجاميُن=सरअंजामियुनि,سَراَنجاميوُن=सरअंजामियूं,سَراَنجاميِ=सरअंजामी,سَراَنجاميِئَ=सरअंजामीअ,خيراَنديشيُن=ख़ैरअंदेशियुनि
Final             1	اَ=अ
Standalone     NONE

وَ:           942
Initial         426	وَنگ=वंग,وَنگن=वंगनि,وَنگا=वंगा,وَنگيُن=वंगियुनि,وَنگيوُن=वंगियूं,وَنگيِ=वंगी,وَنگيِئَ=वंगीअ,وَنگُ=वंगु,وَنگي=वंगे,وَنگو=वंगो
Medial          449	اَٽِڪاوَن=अटिकावनि,اَٺُوَنجاهيِن=अठुवंजाहीं,اَٺُوَنجاهُە=अठुवंजाहु,اَٺُوَنجاهين=अठुवंजाहें,اَٺُوَنجاهون=अठुवंजाहों,اَڻَوَنڊِئي=अणवंडिए,اَڻَوَنڊِيَن=अणवंडियनि,اَڻَوَنڊِيا=अणवंडिया,اَڻَوَنڊِيُن=अणवंडियुनि,اَڻَوَنڊِيوُن=अणवंडियूं
Final            67	اَٽِڪاوَ=अटिकाव,اَٿَوَ=अथव,اِطِلاوَ=इतिलाव,ايِذاوَ=ईज़ाव,اُڪِلاوَ=उकिलाव,اُپاوَ=उपाव,اوَ=औ,ڪَهڪاوَ=कहिकाव,ڪانوَ=कांव,گهَٽاوَ=घटाव
Standalone     NONE

وا:          1086
Initial         318	واهيِ=ओहीवाहीअ,واري=पुॼंदीअवारनि,وارن=पुॼंदीअवारियुनि,وارا=पुॼंदीअवारीअ,وارو=पुॼंदीअवारे,واري=फेरीअवारनि,وارو=फेरीअवारे,واري=मिठाईअवारनि,وارو=मिठाईअवारे,وانگُرُ=वांगुरु
Medial          737	اَڻَواقُفُ=अणवाक़ुफ़ु,اَپواد=अपवाद,اَپوادَن=अपवादनि,اَپوادُ=अपवादु,اَوائِليِ=अवाइली,عَواميِ=अवामी,عَوامُ=अवामु,اَسُوار=असुवार,اَسُوارن=असुवारनि,اَسُوارُ=असुवारु
Final            31	اَڻاوا=अणावा,عَلاوا=अलावा,اَغُوا=अग़ुवा,آبهَوا=आबहवा,عُضوا=उज्*़वा,جَلِوا=जलिवा,تَوا=तवा,دَوا=दवा,پَٿراوا=पथिरावा,پَراوا=परावा
Standalone     NONE

يِ:          7941
Initial        NONE
Medial         4998	اَنگيِڪارُ=अंगीकारु,اَنگوُڙيِئَ=अंगूड़ीअ,اَنجيِرَ=अंजीर,اَنجيِرن=अंजीरनि,اَنجيِرُ=अंजीरु,اَندِريِن=अंदिरीं,اَندِريِنئَ=अंदिरींअ,اَنڌائيِئَ=अंधाईअ,اَنڌاريِئَ=अंधारीअ,اَنڌيِئَ=अंधीअ
Final          2943	اَنگِليِ=अंगिली,اَنگوُڙيِ=अंगूड़ी,اَنگوُريِ=अंगूरी,اَنتَرياميِ=अंतरियामी,اَندروُنيِ=अंदिरूनी,اَنڌائيِ=अंधाई,اَنڌاريِ=अंधारी,اَنڌيِ=अंधी,اَنڌيرگَرديِ=अंधेरगर्दी,نَگريِ=अंधेरनगरीअ
Standalone     NONE

which to me seems to be garbage. I don't know enough about Arabic or Indic to make any guess at whether or not your input files could be cleaned up programmatically. If they can be, I don't have the expertise to do it unless you can provide explicit directions on how to do it.
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regex to identify illegal characters in a perso-arabic database

I am working on Sindhi: a perso-Arabic script and since it shares the Unicode-block with over 400 other languages, quite often the database contains characters which are not wanted: illegal characters. I have identified the character set of Sindhi which is given below: For clarity's sake, each... (8 Replies)
Discussion started by: gimley
8 Replies

2. Red Hat

Font chinese and arabic

At present we are using one application , in which they are loading some files. the files are some times a mix of chinese and arabic. Is there any way to encode these literals and do the loading. Rgds Rj ---------- Post updated at 04:54 AM ---------- Previous update was at 04:47 AM... (0 Replies)
Discussion started by: jegaraman
0 Replies

3. HP-UX

install arabic lang

hi how to install arabic language and set it as default in hpux. also there is any website provide vm for hpunix for testing. (2 Replies)
Discussion started by: drpix
2 Replies

4. Shell Programming and Scripting

CREATING A SYLLABLE CONCORDANCE WITH POSITIONAL VARIANTS

Hello, Some time back I had posted a request for a syllable concordance in which if a syllable was provided in a file, the program would extract a word from a file entitled "Corpus" matching that syllable. The program was The following script was provided which did the job and for which I am... (3 Replies)
Discussion started by: gimley
3 Replies

5. Shell Programming and Scripting

Creating a syllable concordance

Hello, I have two files. The first file contains specific syllables of a language (Hindi) and the second file contains a large database from which these syllables have been culled. The syllable file which has syllables in Hindi has one syllable per line and the corpus file has a data... (8 Replies)
Discussion started by: gimley
8 Replies

6. Solaris

Arabic package in solaris

Hi, I have searched in all installation cds for arabic packages but couldn't find it. 1. Is there any other way to download arabic package? 2. Does we need to reboot the system after installing package? 3. I don't want to reboot the system so is there any service to restart to make the... (2 Replies)
Discussion started by: malikshahid85
2 Replies

7. Solaris

arabic setting in solaris

Hi, i have a file which show text on window like, insert into test values('اسيل للخدمات عبر الأثير'); but when i open this file in solaris it don't show like insert into test values('اسيل للخدمات عبر الأثير'); i also want to see the line same as it is on windows kindly help me (3 Replies)
Discussion started by: malikshahid85
3 Replies

8. UNIX for Dummies Questions & Answers

Arabic characters in QNX4

I want to display Arabic characters in QNX4. This work was been done by a colleague several years ago but he didn't document his work. I installed fonts and I got this display (attached). Please let me know how can correct as per the initial display were working in Arabic (attached). Thanks... (0 Replies)
Discussion started by: hbc
0 Replies
Login or Register to Ask a Question