Writing a clustering concordance for a Perso-Arabic script

08-09-2015

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks for the awk script. It seems to be running. Initially I used a very large syllable list in which some syllables were part of larger syllables and hence did not give any output, since once a larger syllable was admitted a subset of that syllable would automatically be excluded. I assume this is why the large syllable list did not yield results. I also sorted the syllable list on length with the largest first and this did improve the output.
Thanks a lot
p.s. I just got a mail from Mr Don Cragun who has also proposed an awk solution saying that my corpus had flaws. I corrected the same and the output is as desired. Many thanks for your help

---------- Post updated at 08:44 AM ---------- Previous update was at 08:39 AM ----------

Thanks a lot. I am so sorry that the corpus data was faulty. I guess I should have checked it out. I removed all garbage from the data and the script ran just fine. I had a syllable list of over 300 syllables and a corpus of around 37000 and the script sent by The data was sent by the community and since it was entered by hand it had flaws. The responsibility is entirely mine.
Your solution as wella s Rudic's worked just great. Many thanks once more to the forum for their generous help.

gimley

View Public Profile for gimley

Find all posts by gimley

08-09-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You might also want to try the following:

Code:

#!/bin/ksh
sf="syllables"
cf="corpus"
sample_max=10
awk -v sm="$sample_max" '
{	gsub(/\r/, "")
}
FNR == NR {
	# Read the list of syllables to be processed...
	len = length($1)
	if(NR == 1) 
		l = L = len
	else	if(l > len)
			l = len
		else if(L < len)
			L = len
	syl[len, ++lenc[len]] = $1
	next
}
{	# Accumulate data from dictionary entries...
	len = length(word = $1)
	for(i = (L > len) ? len : L; i >= l && len >= l; i--)
		for(j = 1; j <= lenc[i]; j++) {
			# If syllables we have matched leave fewer unmached
			# character in word than we are currently trying to
			# match, short circuit to a shorter syllable length...
			if(len < i)
				break
			s = syl[i, j]
			if(s == word) {
				# Process standalone match...
				++sa[s]
				len = 0
				if(sasc[s] < sm)
					sasam[s] = (sasc[s]++ ? \
					    sasam[s] "," : "\t") $0
			} else {if(sub("^"s, "\a", word)) {
					# Process initial match...
					++init[s]
					len -= i
					if(insc[s] < sm)
						insam[s] = (insc[s]++ ? \
						    insam[s] "," : "\t") $0
				}
				if(sub(s"$", "\a", word)) {
					# Process finish match...
					++fin[s]
					len -= i
					if(fisc[s] < sm)
						fisam[s] = (fisc[s]++ ? \
						    fisam[s] "," : "\t") $0
				}
				if(c = gsub(s, "\a", word)) {
					# Process medial matches...
					med[s] += c
					len -= c * i
					if(mesc[s] < sm)
						mesam[s] = (mesc[s]++ ? \
						    mesam[s] "," : "\t") $0
				}
			}
		}
}
END {	# Dump collected data...
	for(i = l; i <= L; i++) {
		for(j = 1; j <= lenc[i]; j++) {
			printf("%-11s %7s\n", (s = syl[i, j]) ":",
			    sa[s] + init[s] + fin[s] + med[s])
			printf("%-11s %7s%s\n", "Initial",
			    init[s] ? init[s] : "NONE", insam[s])
			printf("%-11s %7s%s\n", "Medial",
			    med[s] ? med[s] : "NONE", mesam[s])
			printf("%-11s %7s%s\n", "Final",
			    fin[s] ? fin[s] : "NONE", fisam[s])
			printf("%-11s %7s%s\n\n", "Standalone",
			    sa[s] ? sa[s] : "NONE", sasam[s])
		}
	}
}' "$sf" FS="[=]" "$cf"

It produces exactly the same output as my earlier suggestion, but incorporates improvements from RudiC's suggestion and finishes performance enhancements that were incomplete in my earlier post. If you have a lot of relatively short words and some relatively long syllables, words that match relatively long syllables (leaving a small number of unmatched characters), and words that contain a few medium syllables that have been matched (leaving a small number of unmatched characters); this script will run faster. (Further improvements could be made by keeping track of the longest sequence of unmatched characters instead of just keeping track of the number of unmatched characters, but I'll leave that as an exercise for the reader.)

Cheers,
Don

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-09-2015

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Sorry for the late response: it was night here when you posted this solution. I tested this script and it does give better results. Many thanks for all your help.

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Writing a clustering concordance for a Perso-Arabic script

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regex to identify illegal characters in a perso-arabic database

Discussion started by: gimley

2. Red Hat

Font chinese and arabic

Discussion started by: jegaraman

3. HP-UX

install arabic lang

Discussion started by: drpix

4. Shell Programming and Scripting

CREATING A SYLLABLE CONCORDANCE WITH POSITIONAL VARIANTS

Discussion started by: gimley

5. Shell Programming and Scripting

Creating a syllable concordance

Discussion started by: gimley

6. Solaris

Arabic package in solaris

Discussion started by: malikshahid85

7. Solaris

arabic setting in solaris

Discussion started by: malikshahid85

8. UNIX for Dummies Questions & Answers

Arabic characters in QNX4

Discussion started by: hbc