Writing a clustering concordance for a Perso-Arabic script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Writing a clustering concordance for a Perso-Arabic script
# 8  
Old 08-09-2015
Many thanks for the awk script. It seems to be running. Initially I used a very large syllable list in which some syllables were part of larger syllables and hence did not give any output, since once a larger syllable was admitted a subset of that syllable would automatically be excluded. I assume this is why the large syllable list did not yield results. I also sorted the syllable list on length with the largest first and this did improve the output.
Thanks a lot
p.s. I just got a mail from Mr Don Cragun who has also proposed an awk solution saying that my corpus had flaws. I corrected the same and the output is as desired. Many thanks for your help

---------- Post updated at 08:44 AM ---------- Previous update was at 08:39 AM ----------

Thanks a lot. I am so sorry that the corpus data was faulty. I guess I should have checked it out. I removed all garbage from the data and the script ran just fine. I had a syllable list of over 300 syllables and a corpus of around 37000 and the script sent by The data was sent by the community and since it was entered by hand it had flaws. The responsibility is entirely mine.
Your solution as wella s Rudic's worked just great. Many thanks once more to the forum for their generous help.
# 9  
Old 08-09-2015
You might also want to try the following:
Code:
#!/bin/ksh
sf="syllables"
cf="corpus"
sample_max=10
awk -v sm="$sample_max" '
{	gsub(/\r/, "")
}
FNR == NR {
	# Read the list of syllables to be processed...
	len = length($1)
	if(NR == 1) 
		l = L = len
	else	if(l > len)
			l = len
		else if(L < len)
			L = len
	syl[len, ++lenc[len]] = $1
	next
}
{	# Accumulate data from dictionary entries...
	len = length(word = $1)
	for(i = (L > len) ? len : L; i >= l && len >= l; i--)
		for(j = 1; j <= lenc[i]; j++) {
			# If syllables we have matched leave fewer unmached
			# character in word than we are currently trying to
			# match, short circuit to a shorter syllable length...
			if(len < i)
				break
			s = syl[i, j]
			if(s == word) {
				# Process standalone match...
				++sa[s]
				len = 0
				if(sasc[s] < sm)
					sasam[s] = (sasc[s]++ ? \
					    sasam[s] "," : "\t") $0
			} else {if(sub("^"s, "\a", word)) {
					# Process initial match...
					++init[s]
					len -= i
					if(insc[s] < sm)
						insam[s] = (insc[s]++ ? \
						    insam[s] "," : "\t") $0
				}
				if(sub(s"$", "\a", word)) {
					# Process finish match...
					++fin[s]
					len -= i
					if(fisc[s] < sm)
						fisam[s] = (fisc[s]++ ? \
						    fisam[s] "," : "\t") $0
				}
				if(c = gsub(s, "\a", word)) {
					# Process medial matches...
					med[s] += c
					len -= c * i
					if(mesc[s] < sm)
						mesam[s] = (mesc[s]++ ? \
						    mesam[s] "," : "\t") $0
				}
			}
		}
}
END {	# Dump collected data...
	for(i = l; i <= L; i++) {
		for(j = 1; j <= lenc[i]; j++) {
			printf("%-11s %7s\n", (s = syl[i, j]) ":",
			    sa[s] + init[s] + fin[s] + med[s])
			printf("%-11s %7s%s\n", "Initial",
			    init[s] ? init[s] : "NONE", insam[s])
			printf("%-11s %7s%s\n", "Medial",
			    med[s] ? med[s] : "NONE", mesam[s])
			printf("%-11s %7s%s\n", "Final",
			    fin[s] ? fin[s] : "NONE", fisam[s])
			printf("%-11s %7s%s\n\n", "Standalone",
			    sa[s] ? sa[s] : "NONE", sasam[s])
		}
	}
}' "$sf" FS="[=]" "$cf"

It produces exactly the same output as my earlier suggestion, but incorporates improvements from RudiC's suggestion and finishes performance enhancements that were incomplete in my earlier post. If you have a lot of relatively short words and some relatively long syllables, words that match relatively long syllables (leaving a small number of unmatched characters), and words that contain a few medium syllables that have been matched (leaving a small number of unmatched characters); this script will run faster. (Further improvements could be made by keeping track of the longest sequence of unmatched characters instead of just keeping track of the number of unmatched characters, but I'll leave that as an exercise for the reader.)

Cheers,
Don
# 10  
Old 08-09-2015
Sorry for the late response: it was night here when you posted this solution. I tested this script and it does give better results. Many thanks for all your help.
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regex to identify illegal characters in a perso-arabic database

I am working on Sindhi: a perso-Arabic script and since it shares the Unicode-block with over 400 other languages, quite often the database contains characters which are not wanted: illegal characters. I have identified the character set of Sindhi which is given below: For clarity's sake, each... (8 Replies)
Discussion started by: gimley
8 Replies

2. Red Hat

Font chinese and arabic

At present we are using one application , in which they are loading some files. the files are some times a mix of chinese and arabic. Is there any way to encode these literals and do the loading. Rgds Rj ---------- Post updated at 04:54 AM ---------- Previous update was at 04:47 AM... (0 Replies)
Discussion started by: jegaraman
0 Replies

3. HP-UX

install arabic lang

hi how to install arabic language and set it as default in hpux. also there is any website provide vm for hpunix for testing. (2 Replies)
Discussion started by: drpix
2 Replies

4. Shell Programming and Scripting

CREATING A SYLLABLE CONCORDANCE WITH POSITIONAL VARIANTS

Hello, Some time back I had posted a request for a syllable concordance in which if a syllable was provided in a file, the program would extract a word from a file entitled "Corpus" matching that syllable. The program was The following script was provided which did the job and for which I am... (3 Replies)
Discussion started by: gimley
3 Replies

5. Shell Programming and Scripting

Creating a syllable concordance

Hello, I have two files. The first file contains specific syllables of a language (Hindi) and the second file contains a large database from which these syllables have been culled. The syllable file which has syllables in Hindi has one syllable per line and the corpus file has a data... (8 Replies)
Discussion started by: gimley
8 Replies

6. Solaris

Arabic package in solaris

Hi, I have searched in all installation cds for arabic packages but couldn't find it. 1. Is there any other way to download arabic package? 2. Does we need to reboot the system after installing package? 3. I don't want to reboot the system so is there any service to restart to make the... (2 Replies)
Discussion started by: malikshahid85
2 Replies

7. Solaris

arabic setting in solaris

Hi, i have a file which show text on window like, insert into test values('اسيل للخدمات عبر الأثير'); but when i open this file in solaris it don't show like insert into test values('اسيل للخدمات عبر الأثير'); i also want to see the line same as it is on windows kindly help me (3 Replies)
Discussion started by: malikshahid85
3 Replies

8. UNIX for Dummies Questions & Answers

Arabic characters in QNX4

I want to display Arabic characters in QNX4. This work was been done by a colleague several years ago but he didn't document his work. I installed fonts and I got this display (attached). Please let me know how can correct as per the initial display were working in Arabic (attached). Thanks... (0 Replies)
Discussion started by: hbc
0 Replies
Login or Register to Ask a Question