Modifying an awk script for syllable splitting


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Modifying an awk script for syllable splitting
# 1  
Old 03-24-2016
Modifying an awk script for syllable splitting

I have found this syllable splitter in awk. The code is given below. Basically the script cuts words and names into syllables. However it fails when the word contains 2 consonants which constitute a single syllable. An example is given below
Code:
ashford
raphael

The output is as under:
Code:
ashford	as-hford	2	 VC-CCVrC
raphael	rap-ha-el	3	 rVC-CV-VC

instead of
Code:
ashford	ash-ford	2	 VCC-CVrC
raphael	ra-pha-el	3	 rVC-CV-VC

How do I modify the code to allow sh or ph to be treated as a single syllable.
I contacted the authors who have not reponded since the code is old and maybe they do not see any merit in changing the code.
A single example of modification either for ph or sh would help. I can then modify the code for all other such combinations.
Out of respect for the authors I have removed their names from the script.
Many thanks
Awk script follows
Code:
# This script reads a tab-separated file and syllabifies the columns pointed to by the variable'phons' (ot the first column, by default).
# gawk -f syll.gk fn>fn.out

BEGIN {
  FS="\t"; 
  OFS="\t";
  
  if (code=="brulex") {
    V="[aiouyîâêôû^eEéAO_]"; # vowels
    C="[ptkbdgfs/vzjmn/shN£]"; # consonants except liquids & semivowels
    C1="[pkbgfs/vzj]";
    L="[lR]"; # liquids 
    Y="[ïü\377]"; # semi-vowels \377 stands for y-umlaut
    X="[ptkbdgfs/vzjmnN£xlRïü\377]"; # all consonants 
  } else { # code == LAIPTTS)
    V="[iYeE2591a@oO§uy*]";   # Vowels
    C="[pbmfvtdnNkgszxSZGh/sh]";  # Consonants except liquids & semivowels
    C1="[pkbgfsSvzZ]";
    L="[lR]"; # liquids
    Y="[j8w]"; # semi-vowels
    X="[pbmfvtdnNkgszSZGlRrhxGj8w]";   # all consonants, including semivowels
  }
  if (phons==0) phons=1;
}

{
 a=$phons;
 n=1
}

{
   while (i= match (a, V V)) {
    a=substr(a,1,i) "-" substr(a,i+1,length(a)); n++; }

  while (i= match(a, V X V)) { 
    a=substr(a,1,i) "-" substr(a,i+1,length(a)); n++}

  while (i=match(a, V Y Y V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2, length(a)); n++} 

  while (i=match(a, V C Y V)) {
    a=substr(a,1,i) "-" substr(a,i+1, length(a)); n++} 

  while (i=match(a, V L Y V)) {
    a=substr(a,1,i) "-" substr(a,i+1, length(a)); n++}

  while (i=match(a, V "[td]R" V)) {
    a=substr(a,1,i) "-" substr(a,i+1, length(a)); n++} 

  while (i=match(a, V "[td]R" Y V)) {
    a=substr(a,1,i) "-" substr(a,i+1, length(a)); n++} 

  while (i=match(a, V C1 L V)) {
    a=substr(a,1,i) "-" substr (a,i+1,length(a)); n++}

  while (i=match(a, V X X V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2, length(a)); n++}

  while (i= match(a, V X X X V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2,length(a)); n++}

  while (i=match(a, V X X X X V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2,length(a)); n++}

  while (i=match(a, V X X X X X V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2,length(a)); n++}

# suppress the final schwa (^) in some multisyllabic words 
# notr^ -> notR
# ar-bR^   =>  aRbR
  b=gensub(/-([^-]+)\^$/,"\\1",1,a) ;  
  if (b!=a) { # there is a schwa to delete
    a=b; 
    $phons=substr($phons,1,length($phons)-1);
    n--;
      }
# meme chose quand schwa='*'
  b=gensub(/-([^-]+)\*$/,"\\1",1,a) ;  
  if (b!=a) { # there is a schwa to delete
    a=b; 
    $phons=substr($phons,1,length($phons)-1);
    n--;
      }


# compute the CVY skeleton
  sk= " ";
  for (i=1;i<=length(a);i++) {
    ph=substr(a,i,1);
    if (ph~V) sk=sk"V";
    else if ((ph~C)||(ph~L)) sk=sk"C";
    else if (ph~Y) sk=sk"Y";
    else sk=sk ph;
  }
}

{ print $0,a,n,sk }

# 2  
Old 03-24-2016
Well, THAT is some piece o' code! While just getting a remote idea of how it works and what it does, and not pretending this will be a generally correct solution, adding
Code:
  gsub (/sh/, "&-",a)
  gsub (/ph/, "-&",a)

just above the first while (i = match... will result in
Code:
ashford    ash-ford    1     VCC-CVrC
raphael    ra-pha-el    2     rV-CCV-VC

---------- Post updated at 15:53 ---------- Previous update was at 15:36 ----------

And this will correct for the syllable count:
Code:
  n+=gsub (/sh/, "&-",a)
  n+=gsub (/ph/, "-&",a)

resulting in
Code:
ashford    ash-ford    2     VCC-CVrC
raphael    ra-pha-el    3     rV-CCV-VC

---------- Post updated at 15:57 ---------- Previous update was at 15:53 ----------

Howsoever, with the overall algorithm,YMMV:
Code:
reel    re-el    2     rV-VC
real    re-al    2     rV-VC
cooperation    co-o-pe-ra-ti-on    6     cV-V-CV-rV-CV-VC
Liverpool    Li-ver-po-ol    4     LV-CVr-CV-VC

---------- Post updated at 16:07 ---------- Previous update was at 15:57 ----------

An, not sure if you now like the way it hyphenates shepherd:
Code:
shepherd    sh-e-pherd    3     CC-V-CCVrC

This User Gave Thanks to RudiC For This Post:
# 3  
Old 03-24-2016
Thanks for the help. I agree
Code:
Shepherd    sh-e-pherd

gets tagged incorrectly
But at least the pointers you gave allow for a better split.
# 4  
Old 03-24-2016
Hi.

* This is a file of consonant combinations that I occasionally use:
Code:
*       This file contains legitimate letter combinations.
*       I should probably add vowel combinations, ie, ey, etc.
*
*       First section, beginning of word.
pt ps
-
*       Second section, beginning and middle.
bl br
ch cl cr
dr
fl fr
gl gn gr
kl kr
pl pr pt
qu
sc sh sl sm sn sp sr st str sw
th tr
wh
-
*       Third section, middle and end word 
bj bs
ct
dg ds
ft
gh
ks
lch lk ls lv
mp ms
nd ng ns nt
ps
rch rk rg rs rt
tch ts
-
*       Fourth section, end of word.
dst dth ght nth rst
-
*       Fifth section, doubled letters.
bb cc dd gg ll mm nn pp ss tt

I use these to compose English-like words, but they may be useful for splitting as well. There might be others that could be added, e.g. "ff".

See also results of search, like: split by syllable, such as: Syllable Rules: Divide Into Syllables

Best wishes ... cheers, drl
This User Gave Thanks to drl For This Post:
# 5  
Old 03-24-2016
Many thanks for the useful pointers. I am trying to divide words in Indian languages which are romanised into English. These follow slightly different rules. But some of the rules you have provided apply to the transliterations also. The rules you have provided have given me a better insight into how the splitter should work.
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Awk: Modifying columns based on comparison

Hi, I have following input in the file in which i want to club the entries based on $1. Also $11 is equal to $13 of other record(where $13 must be on higher side for any $1) then sum all other fields except $11 & $13. Final output required is as follows: INPUTFILE: ... (11 Replies)
Discussion started by: siramitsharma
11 Replies

2. Shell Programming and Scripting

Need some help modifying script

I have a script that currently runs fine and I need to add or || (or) condition to the if statement and I'm not sure the exact syntax as it relates to the use of brackets. my current script starts like this: errLog="/usr/local/website-logs/error.log" apacheRestart="service httpd restart"... (3 Replies)
Discussion started by: jjj0923
3 Replies

3. Shell Programming and Scripting

Modifying awk code to be inside condition

I have the following awk script and I want to change it to be inside a condition for the file extension. ################################################################################ # abs: Returns the absolute value of a number function abs(val) { return val > 0 ? val \ ... (4 Replies)
Discussion started by: kristinu
4 Replies

4. Shell Programming and Scripting

awk script for modifying the file

I have the records in the format one row 0009714494919I MY010727408948010 NNNNNN N PUSAAR727408948010 R007YM08705 9602002 S 111+0360832-0937348 I want to get it int the format 0009714494919I MY010727408948010 NNNNNN N PUSAAR727408948010 R007YM08705 9602002 S ... (2 Replies)
Discussion started by: sonam273
2 Replies

5. UNIX for Dummies Questions & Answers

Understanding / Modifying AWK command

Hey all, So I have an AWK command here awk '{if(FNR==NR) {arr++;next} if($0 in arr) { arr--; if (arr == 0) delete arr;next}{print $0 >"list2output.csv"}} END {for(i in arr){print i >"list1output.csv"}}' list1 list2 (refer to image for a more readable format) This code was submitted... (1 Reply)
Discussion started by: Aussiemick
1 Replies

6. Shell Programming and Scripting

AWK script for programatically modifying java files

Hi, I want to add a String variable to all java classes in my project. Assuming a class like public class Random { String var="Constant string"; ... ... ... } The text in bold is what I want to add to all java files in my workspace. I am an absolute newbie to AWK, and read somewhere that... (5 Replies)
Discussion started by: rocker86
5 Replies

7. Shell Programming and Scripting

modifying a awk line

Hi, I want to print specific columns (from 201 to 1001). The line that I am using is listed below. However I also want to print column 1. So column 1 and 201 to 1001. What modifcations do I need to make? Code: awk -F'\t' 'BEGIN {min = 201; max = 1001 }{for (i=min; i<=max; i++) printf... (5 Replies)
Discussion started by: phil_heath
5 Replies

8. Shell Programming and Scripting

awk modifying entries on 2 lines at 2 positions

Hi this script adds text in the correct place on one line only, in a script. awk 'BEGIN{ printf "Enter residue and chain information: " getline var < "-" split(var,a) } /-s rec:/{$7=a; } {print}' FLXDOCK but I need the same info added at position 7 on line 34 and... (1 Reply)
Discussion started by: gav2251
1 Replies
Login or Register to Ask a Question