Sponsored Content
Top Forums Shell Programming and Scripting Writing a clustering concordance for a Perso-Arabic script Post 302951664 by RudiC on Sunday 9th of August 2015 08:39:29 AM
Old 08-09-2015
Here an awk essay that addresses some but not all of your conditions/problems:
Code:
awk '
FNR==NR         {SYL[FNR]=$1
                 MX=FNR
                 next
                }
                {for (c=1; c<=MX; c++)
                        {s=SYL[c]
                         if ($1==s)     {STDAL[s]++
                                         next
                                        }
                         else if ($1 ~ "^"s)
                                        {if (++INIT[s] <= FREQMX)
                                                EXI[s]=EXI[s] "," $0
                                         next
                                        }
                         else if ($1 ~ s"$")
                                        {if (++FIN[s] <= FREQMX)
                                                EXF[s]=EXF[s] "," $0
                                         next
                                        }
                         else if ($1 ~ s)
                                        {MID[s]+=gsub(s,s)
                                         if (MID[s] <= FREQMX)
                                                EXM[s]=EXM[s] "," $0
                                         next
                                        }

                        }
                }
END             {for (c=1; c<=MX; c++)
                                {s=SYL[c]
                                 print s
                                 printf "Init:  %3d\tExample: %s\n", INIT[s], substr (EXI[s],2)
                                 printf "Mid:   %3d\tExample: %s\n", MID[s],  substr (EXM[s],2)
                                 printf "Fin:   %3d\tExample: %s\n", FIN[s],  substr (EXF[s],2)
                                 printf "Alone: %3d", STDAL[s]
                                        if (STDAL[s] > 0) printf "\tExample: %s", s
                                        printf "\n"
                                }
                }

' syllables FS="=" FREQMX=6 corpus
oi
Init:    0    Example: 
Mid:     0    Example: 
Fin:     0    Example: 
Alone:   0
oa
Init:    0    Example: 
Mid:     4    Example: coat=kot,load=lod,approach=eproch,goal=gol
Fin:     0    Example: 
Alone:   0
ai
Init:    0    Example: 
Mid:     4    Example: rain=ren,paint=pent,rail=rel,failure=felyer
Fin:     0    Example: 
Alone:   0
ea
Init:    2    Example: easy=izi,early=erli
Mid:     9    Example: beans=bins,please=pliz,beach=bich,leather=lethar,already=alredi,break=brek
Fin:     1    Example: sea=si
Alone:   0
ui
Init:    0    Example: 
Mid:     3    Example: juice=jus,fruit=frut,suit=sut
Fin:     0    Example: 
Alone:   0
a
Init:    1    Example: act=akt
Mid:     2    Example: ball=ball
Fin:     1    Example: beta=bita
Alone:   1    Example: a

It is tested on an extended version of your samples in post#1; the condition that "Only the largest string from the clusters file will be considered." is covered by having the larger clusters in front of the smaller ones, i.e. "a" is analysed after "ea" and "oa" etc. Unfortunately, some hits on "a" (already, approach) are lost as they are already counted in those clusters ("ea", "oa"). However, if you think this a promising approach, one could try to refine...

---------- Post updated at 14:30 ---------- Previous update was at 14:17 ----------

OK, this one
Code:
awk '
FNR==NR         {SYL[FNR]=$1
                 MX=FNR
                 next
                }
                {TOTLINE=$0
                 for (c=1; c<=MX; c++)
                        {s=SYL[c]
                         if ($1==s)     {STDAL[s]++
                                         next
                                        }
                         else if (gsub ("^"s, "@", $1))
                                        {if (++INIT[s] <= FREQMX)
                                                EXI[s]=EXI[s] "," TOTLINE
                                        }
                         else if (gsub (s"$", "@", $1))
                                        {if (++FIN[s] <= FREQMX)
                                                EXF[s]=EXF[s] "," TOTLINE
                                        }
                         else if (n=gsub (s, "@", $1))
                                        {MID[s]+=n
                                         if (MID[s] <= FREQMX)
                                                EXM[s]=EXM[s] "," TOTLINE
                                        }

                        }
                }
END             {for (c=1; c<=MX; c++)
                                {s=SYL[c]
                                 print s
                                 printf "Init:  %3d\tExample: %s\n", INIT[s], substr (EXI[s],2)
                                 printf "Mid:   %3d\tExample: %s\n", MID[s],  substr (EXM[s],2)
                                 printf "Fin:   %3d\tExample: %s\n", FIN[s],  substr (EXF[s],2)
                                 printf "Alone: %3d", STDAL[s]
                                        if (STDAL[s] > 0) printf "\tExample: %s", s
                                        printf "\n"
                                }
                }

' syllables~ FS="=" FREQMX=8 corpus
oi
Init:    0    Example: 
Mid:     0    Example: 
Fin:     0    Example: 
Alone:   0
oa
Init:    0    Example: 
Mid:     4    Example: coat=kot,load=lod,approach=eproch,goal=gol
Fin:     0    Example: 
Alone:   0
ai
Init:    0    Example: 
Mid:     4    Example: rain=ren,paint=pent,rail=rel,failure=felyer
Fin:     0    Example: 
Alone:   0
ea
Init:    2    Example: easy=izi,early=erli
Mid:     9    Example: beans=bins,please=pliz,beach=bich,leather=lethar,already=alredi,break=brek,bread=bred,heading=heding
Fin:     1    Example: sea=si
Alone:   0
ui
Init:    0    Example: 
Mid:     3    Example: juice=jus,fruit=frut,suit=sut
Fin:     0    Example: 
Alone:   0
a
Init:    3    Example: act=akt,approach=eproch,already=alredi
Mid:     1    Example: ball=ball
Fin:     1    Example: beta=bita
Alone:   1    Example: a

should cover the above mentioned problem. Please report back!

---------- Post updated at 14:39 ---------- Previous update was at 14:30 ----------

And this one
Code:
awk '
FNR==NR         {SYL[FNR]=$1
                 MX=FNR
                 next  
                }
                {TOTLINE=$0
                 for (c=1; c<=MX; c++)
                        {s=SYL[c]
                         if ($1==s)     {STDAL[s]++
                                         next
                                        }
                         if (gsub ("^"s, "@", $1))
                                        {if (++INIT[s] <= FREQMX)
                                                EXI[s]=EXI[s] "," TOTLINE
                                        }
                         if (gsub (s"$", "@", $1))
                                        {if (++FIN[s] <= FREQMX)
                                                EXF[s]=EXF[s] "," TOTLINE
                                        }
                         if (n=gsub (s, "@", $1))
                                        {MID[s]+=n
                                         if (MID[s] <= FREQMX)
                                                EXM[s]=EXM[s] "," TOTLINE
                                        }
                                         
                        }
                }
END             {for (c=1; c<=MX; c++)
                                {s=SYL[c]
                                 print s 
                                 printf "Init:  %3d\tExample: %s\n", INIT[s], substr (EXI[s],2)
                                 printf "Mid:   %3d\tExample: %s\n", MID[s],  substr (EXM[s],2)
                                 printf "Fin:   %3d\tExample: %s\n", FIN[s],  substr (EXF[s],2)
                                 printf "Alone: %3d", STDAL[s]
                                        if (STDAL[s] > 0) printf "\tExample: %s", s
                                        printf "\n"
                                }
                }

' syllables~ FS="=" FREQMX=10 corpus
a
Init:    4    Example: act=akt,approach=eproch,alabama=asdfjg,already=alredi
Mid:     3    Example: ball=ball,alabama=asdfjg
Fin:     2    Example: beta=bita,alabama=asdfjg
Alone:   1    Example: a

would cover even the case of "alabama".
This User Gave Thanks to RudiC For This Post:
 

8 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Arabic characters in QNX4

I want to display Arabic characters in QNX4. This work was been done by a colleague several years ago but he didn't document his work. I installed fonts and I got this display (attached). Please let me know how can correct as per the initial display were working in Arabic (attached). Thanks... (0 Replies)
Discussion started by: hbc
0 Replies

2. Solaris

arabic setting in solaris

Hi, i have a file which show text on window like, insert into test values('اسيل للخدمات عبر الأثير'); but when i open this file in solaris it don't show like insert into test values('اسيل للخدمات عبر الأثير'); i also want to see the line same as it is on windows kindly help me (3 Replies)
Discussion started by: malikshahid85
3 Replies

3. Solaris

Arabic package in solaris

Hi, I have searched in all installation cds for arabic packages but couldn't find it. 1. Is there any other way to download arabic package? 2. Does we need to reboot the system after installing package? 3. I don't want to reboot the system so is there any service to restart to make the... (2 Replies)
Discussion started by: malikshahid85
2 Replies

4. Shell Programming and Scripting

Creating a syllable concordance

Hello, I have two files. The first file contains specific syllables of a language (Hindi) and the second file contains a large database from which these syllables have been culled. The syllable file which has syllables in Hindi has one syllable per line and the corpus file has a data... (8 Replies)
Discussion started by: gimley
8 Replies

5. Shell Programming and Scripting

CREATING A SYLLABLE CONCORDANCE WITH POSITIONAL VARIANTS

Hello, Some time back I had posted a request for a syllable concordance in which if a syllable was provided in a file, the program would extract a word from a file entitled "Corpus" matching that syllable. The program was The following script was provided which did the job and for which I am... (3 Replies)
Discussion started by: gimley
3 Replies

6. HP-UX

install arabic lang

hi how to install arabic language and set it as default in hpux. also there is any website provide vm for hpunix for testing. (2 Replies)
Discussion started by: drpix
2 Replies

7. Red Hat

Font chinese and arabic

At present we are using one application , in which they are loading some files. the files are some times a mix of chinese and arabic. Is there any way to encode these literals and do the loading. Rgds Rj ---------- Post updated at 04:54 AM ---------- Previous update was at 04:47 AM... (0 Replies)
Discussion started by: jegaraman
0 Replies

8. Shell Programming and Scripting

Regex to identify illegal characters in a perso-arabic database

I am working on Sindhi: a perso-Arabic script and since it shares the Unicode-block with over 400 other languages, quite often the database contains characters which are not wanted: illegal characters. I have identified the character set of Sindhi which is given below: For clarity's sake, each... (8 Replies)
Discussion started by: gimley
8 Replies
All times are GMT -4. The time now is 04:44 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy