Creating a syllable concordance Post: 302522289

Sponsored Content

Top Forums Shell Programming and Scripting Creating a syllable concordance Post 302522289 by Skrynesaver on Saturday 14th of May 2011 07:20:42 AM

05-14-2011

Registered User

Ah, I misunderstood your intent, I read the intended output as the corpus data.

Try the following, it doesn't create an index (though that would be an interesting project to do so, hmmn)

Code:

#! /usr/bin/perl

use strict;   # These two lines save you endless trouble 
use warnings; # without them typos and such errors get missed

open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
$/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>));  # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
    chomp(my $syllable = $_);
    for my $word (@corpus){
        if ( $word =~ /$syllable/){  # use a regular expression to find a match for the syllable
            print "$syllable=$word\n";
            last; #Stop processing the array of words as we have an example
        }
    }
}

Skrynesaver

View Public Profile for Skrynesaver

Find all posts by Skrynesaver

5 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

CREATING A SYLLABLE CONCORDANCE WITH POSITIONAL VARIANTS

Hello, Some time back I had posted a request for a syllable concordance in which if a syllable was provided in a file, the program would extract a word from a file entitled "Corpus" matching that syllable. The program was The following script was provided which did the job and for which I am...

2. Shell Programming and Scripting

Syllable splitter in Perl

Hello, I am a relative newbie and want to split Names in English into syllables. Does anyone know of a perl script which does that. Since my main area is linguistics, I would be happy to add rules to it and post the perl script back for other users. I tried the CPan perl modules but they don't...

3. Shell Programming and Scripting

Writing a clustering concordance for a Perso-Arabic script

I am working on a database of a language using Arabic Script. One of the major issues is that the shape of the characters changes according to their initial, medial or final positioning. Another major issue is that of the clustering of vowels within the word: the clustering changes totally the...

4. Shell Programming and Scripting

Modifying an awk script for syllable splitting

I have found this syllable splitter in awk. The code is given below. Basically the script cuts words and names into syllables. However it fails when the word contains 2 consonants which constitute a single syllable. An example is given below ashford raphael The output is as under: ...

5. Shell Programming and Scripting

Find Syllable count mismatch

Hello, I have written a syllable splitter for Pseudo English and Indic. I have a large database with the following structure Syllables in Pseudo English delimited by |=Syllables in Devanagari delimited by | The tool produces syllables in both scripts. An example is given below: ...

LEARN ABOUT DEBIAN

mmseg

MMSEG(1)						User Contributed Perl Documentation						  MMSEG(1)

NAME

       mmseg - maximum matching segment Chinese text.

SYNOPSIS

       mmseg -d dict_file [option]... [corpus_file]...

DESCRIPTION

       mmseg is a tool for segmenting Chinese text into words using maximum matching algorithm. mmseg segments corpus_file, or standard input if
       no filename is specified, and write the segmented result to standard output.

OPTIONS

       -d dict_file
	   Use dict_file as lexicon. A default lexicon can be found at /usr/share/sunpinyin-slm/dict.utf8.

       -f,--format (text|bin)
	   Output Format, can be 'text' or 'bin'. default 'bin'.  Normally, in text mode, word text are output, while in binary mode, binary short
	   integer of the word-ids are written to stdout.

       -s, --stok STOK_ID
	   Sentence token id. Default 10.  It will be written to output in binary mode after every sentence.

       -i, --show-id
	   Show Id info. Under text output format mode, attach id after known words.  If under binary mode, print id(s) in text.

       -a, --ambiguious-id AMBI-ID
	   Ambiguious means ABC => A BC or AB C. If specified (AMBI-ID != 0), The sequence ABC will not be segmented, in binary mode, the AMBI-ID
	   is written out; in text mode, "<ambi>ABC</ambi>" will be output. Default is 0.

NOTES

       Under binary mode, consecutive id of 0 are merged into one 0.  Under text mode, no space are inserted between unknown-words.

AUTHOR

       Originally written by Phill.Zhang <phill.zhang@sun.com>.  Currently maintained by Kov.Chai <tchaikov@gmail.com>.

SEE ALSO

       slmseg(1), ids2ngram (1).

perl v5.14.2							    2012-06-09								  MMSEG(1)

5 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

CREATING A SYLLABLE CONCORDANCE WITH POSITIONAL VARIANTS

Discussion started by: gimley

2. Shell Programming and Scripting

Syllable splitter in Perl

Discussion started by: gimley

3. Shell Programming and Scripting

Writing a clustering concordance for a Perso-Arabic script

Discussion started by: gimley

4. Shell Programming and Scripting

Modifying an awk script for syllable splitting

Discussion started by: gimley

5. Shell Programming and Scripting

Find Syllable count mismatch

Discussion started by: gimley

LEARN ABOUT DEBIAN

mmseg