Modifying an awk script for syllable splitting

03-24-2016

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Modifying an awk script for syllable splitting

I have found this syllable splitter in awk. The code is given below. Basically the script cuts words and names into syllables. However it fails when the word contains 2 consonants which constitute a single syllable. An example is given below

Code:

ashford
raphael

The output is as under:

Code:

ashford	as-hford	2	 VC-CCVrC
raphael	rap-ha-el	3	 rVC-CV-VC

instead of

Code:

ashford	ash-ford	2	 VCC-CVrC
raphael	ra-pha-el	3	 rVC-CV-VC

How do I modify the code to allow sh or ph to be treated as a single syllable.
I contacted the authors who have not reponded since the code is old and maybe they do not see any merit in changing the code.
A single example of modification either for ph or sh would help. I can then modify the code for all other such combinations.
Out of respect for the authors I have removed their names from the script.
Many thanks
Awk script follows

Code:

# This script reads a tab-separated file and syllabifies the columns pointed to by the variable'phons' (ot the first column, by default).
# gawk -f syll.gk fn>fn.out

BEGIN {
  FS="\t"; 
  OFS="\t";
  
  if (code=="brulex") {
    V="[aiouy�����^eE�AO_]"; # vowels
    C="[ptkbdgfs/vzjmn/shN�]"; # consonants except liquids & semivowels
    C1="[pkbgfs/vzj]";
    L="[lR]"; # liquids 
    Y="[��\377]"; # semi-vowels \377 stands for y-umlaut
    X="[ptkbdgfs/vzjmnN�xlR��\377]"; # all consonants 
  } else { # code == LAIPTTS)
    V="[iYeE2591a@oO�uy*]";   # Vowels
    C="[pbmfvtdnNkgszxSZGh/sh]";  # Consonants except liquids & semivowels
    C1="[pkbgfsSvzZ]";
    L="[lR]"; # liquids
    Y="[j8w]"; # semi-vowels
    X="[pbmfvtdnNkgszSZGlRrhxGj8w]";   # all consonants, including semivowels
  }
  if (phons==0) phons=1;
}

{
 a=$phons;
 n=1
}

{
   while (i= match (a, V V)) {
    a=substr(a,1,i) "-" substr(a,i+1,length(a)); n++; }

  while (i= match(a, V X V)) { 
    a=substr(a,1,i) "-" substr(a,i+1,length(a)); n++}

  while (i=match(a, V Y Y V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2, length(a)); n++} 

  while (i=match(a, V C Y V)) {
    a=substr(a,1,i) "-" substr(a,i+1, length(a)); n++} 

  while (i=match(a, V L Y V)) {
    a=substr(a,1,i) "-" substr(a,i+1, length(a)); n++}

  while (i=match(a, V "[td]R" V)) {
    a=substr(a,1,i) "-" substr(a,i+1, length(a)); n++} 

  while (i=match(a, V "[td]R" Y V)) {
    a=substr(a,1,i) "-" substr(a,i+1, length(a)); n++} 

  while (i=match(a, V C1 L V)) {
    a=substr(a,1,i) "-" substr (a,i+1,length(a)); n++}

  while (i=match(a, V X X V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2, length(a)); n++}

  while (i= match(a, V X X X V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2,length(a)); n++}

  while (i=match(a, V X X X X V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2,length(a)); n++}

  while (i=match(a, V X X X X X V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2,length(a)); n++}

# suppress the final schwa (^) in some multisyllabic words 
# notr^ -> notR
# ar-bR^   =>  aRbR
  b=gensub(/-([^-]+)\^$/,"\\1",1,a) ;  
  if (b!=a) { # there is a schwa to delete
    a=b; 
    $phons=substr($phons,1,length($phons)-1);
    n--;
      }
# meme chose quand schwa='*'
  b=gensub(/-([^-]+)\*$/,"\\1",1,a) ;  
  if (b!=a) { # there is a schwa to delete
    a=b; 
    $phons=substr($phons,1,length($phons)-1);
    n--;
      }


# compute the CVY skeleton
  sk= " ";
  for (i=1;i<=length(a);i++) {
    ph=substr(a,i,1);
    if (ph~V) sk=sk"V";
    else if ((ph~C)||(ph~L)) sk=sk"C";
    else if (ph~Y) sk=sk"Y";
    else sk=sk ph;
  }
}

{ print $0,a,n,sk }

gimley

View Public Profile for gimley

Find all posts by gimley

03-24-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Well, THAT is some piece o' code! While just getting a remote idea of how it works and what it does, and not pretending this will be a generally correct solution, adding

Code:

  gsub (/sh/, "&-",a)
  gsub (/ph/, "-&",a)

just above the first while (i = match... will result in

Code:

ashford    ash-ford    1     VCC-CVrC
raphael    ra-pha-el    2     rV-CCV-VC

---------- Post updated at 15:53 ---------- Previous update was at 15:36 ----------

And this will correct for the syllable count:

Code:

  n+=gsub (/sh/, "&-",a)
  n+=gsub (/ph/, "-&",a)

resulting in

Code:

ashford    ash-ford    2     VCC-CVrC
raphael    ra-pha-el    3     rV-CCV-VC

---------- Post updated at 15:57 ---------- Previous update was at 15:53 ----------

Howsoever, with the overall algorithm,YMMV:

Code:

reel    re-el    2     rV-VC
real    re-al    2     rV-VC
cooperation    co-o-pe-ra-ti-on    6     cV-V-CV-rV-CV-VC
Liverpool    Li-ver-po-ol    4     LV-CVr-CV-VC

---------- Post updated at 16:07 ---------- Previous update was at 15:57 ----------

An, not sure if you now like the way it hyphenates shepherd:

Code:

shepherd    sh-e-pherd    3     CC-V-CCVrC

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-24-2016

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Thanks for the help. I agree

Code:

Shepherd    sh-e-pherd

gets tagged incorrectly
But at least the pointers you gave allow for a better split.

gimley

View Public Profile for gimley

Find all posts by gimley

03-24-2016

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

* This is a file of consonant combinations that I occasionally use:

Code:

*       This file contains legitimate letter combinations.
*       I should probably add vowel combinations, ie, ey, etc.
*
*       First section, beginning of word.
pt ps
-
*       Second section, beginning and middle.
bl br
ch cl cr
dr
fl fr
gl gn gr
kl kr
pl pr pt
qu
sc sh sl sm sn sp sr st str sw
th tr
wh
-
*       Third section, middle and end word 
bj bs
ct
dg ds
ft
gh
ks
lch lk ls lv
mp ms
nd ng ns nt
ps
rch rk rg rs rt
tch ts
-
*       Fourth section, end of word.
dst dth ght nth rst
-
*       Fifth section, doubled letters.
bb cc dd gg ll mm nn pp ss tt

I use these to compose English-like words, but they may be useful for splitting as well. There might be others that could be added, e.g. "ff".

See also results of search, like: split by syllable, such as: Syllable Rules: Divide Into Syllables

Best wishes ... cheers, drl

This User Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

03-24-2016

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks for the useful pointers. I am trying to divide words in Indian languages which are romanised into English. These follow slightly different rules. But some of the rules you have provided apply to the transliterations also. The rules you have provided have given me a better insight into how the splitter should work.

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Modifying an awk script for syllable splitting

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Awk: Modifying columns based on comparison

Discussion started by: siramitsharma

2. Shell Programming and Scripting

Need some help modifying script

Discussion started by: jjj0923

3. Shell Programming and Scripting

Modifying awk code to be inside condition

Discussion started by: kristinu

4. Shell Programming and Scripting

awk script for modifying the file

Discussion started by: sonam273

5. UNIX for Dummies Questions & Answers

Understanding / Modifying AWK command

Discussion started by: Aussiemick

6. Shell Programming and Scripting

AWK script for programatically modifying java files

Discussion started by: rocker86

7. Shell Programming and Scripting

modifying a awk line

Discussion started by: phil_heath

8. Shell Programming and Scripting

awk modifying entries on 2 lines at 2 positions

Discussion started by: gav2251