Sponsored Content
Top Forums Shell Programming and Scripting Creating a syllable concordance Post 302522289 by Skrynesaver on Saturday 14th of May 2011 07:20:42 AM
Old 05-14-2011
Ah, I misunderstood your intent, I read the intended output as the corpus data.

Try the following, it doesn't create an index (though that would be an interesting project to do so, hmmn)
Code:
#! /usr/bin/perl

use strict;   # These two lines save you endless trouble 
use warnings; # without them typos and such errors get missed

open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
$/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>));  # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
    chomp(my $syllable = $_);
    for my $word (@corpus){
        if ( $word =~ /$syllable/){  # use a regular expression to find a match for the syllable
            print "$syllable=$word\n";
            last; #Stop processing the array of words as we have an example
        }
    }
}

 

5 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

CREATING A SYLLABLE CONCORDANCE WITH POSITIONAL VARIANTS

Hello, Some time back I had posted a request for a syllable concordance in which if a syllable was provided in a file, the program would extract a word from a file entitled "Corpus" matching that syllable. The program was The following script was provided which did the job and for which I am... (3 Replies)
Discussion started by: gimley
3 Replies

2. Shell Programming and Scripting

Syllable splitter in Perl

Hello, I am a relative newbie and want to split Names in English into syllables. Does anyone know of a perl script which does that. Since my main area is linguistics, I would be happy to add rules to it and post the perl script back for other users. I tried the CPan perl modules but they don't... (6 Replies)
Discussion started by: gimley
6 Replies

3. Shell Programming and Scripting

Writing a clustering concordance for a Perso-Arabic script

I am working on a database of a language using Arabic Script. One of the major issues is that the shape of the characters changes according to their initial, medial or final positioning. Another major issue is that of the clustering of vowels within the word: the clustering changes totally the... (9 Replies)
Discussion started by: gimley
9 Replies

4. Shell Programming and Scripting

Modifying an awk script for syllable splitting

I have found this syllable splitter in awk. The code is given below. Basically the script cuts words and names into syllables. However it fails when the word contains 2 consonants which constitute a single syllable. An example is given below ashford raphael The output is as under: ... (4 Replies)
Discussion started by: gimley
4 Replies

5. Shell Programming and Scripting

Find Syllable count mismatch

Hello, I have written a syllable splitter for Pseudo English and Indic. I have a large database with the following structure Syllables in Pseudo English delimited by |=Syllables in Devanagari delimited by | The tool produces syllables in both scripts. An example is given below: ... (2 Replies)
Discussion started by: gimley
2 Replies
MMSEG(1)						User Contributed Perl Documentation						  MMSEG(1)

NAME
mmseg - maximum matching segment Chinese text. SYNOPSIS
mmseg -d dict_file [option]... [corpus_file]... DESCRIPTION
mmseg is a tool for segmenting Chinese text into words using maximum matching algorithm. mmseg segments corpus_file, or standard input if no filename is specified, and write the segmented result to standard output. OPTIONS
-d dict_file Use dict_file as lexicon. A default lexicon can be found at /usr/share/sunpinyin-slm/dict.utf8. -f,--format (text|bin) Output Format, can be 'text' or 'bin'. default 'bin'. Normally, in text mode, word text are output, while in binary mode, binary short integer of the word-ids are written to stdout. -s, --stok STOK_ID Sentence token id. Default 10. It will be written to output in binary mode after every sentence. -i, --show-id Show Id info. Under text output format mode, attach id after known words. If under binary mode, print id(s) in text. -a, --ambiguious-id AMBI-ID Ambiguious means ABC => A BC or AB C. If specified (AMBI-ID != 0), The sequence ABC will not be segmented, in binary mode, the AMBI-ID is written out; in text mode, "<ambi>ABC</ambi>" will be output. Default is 0. NOTES
Under binary mode, consecutive id of 0 are merged into one 0. Under text mode, no space are inserted between unknown-words. AUTHOR
Originally written by Phill.Zhang <phill.zhang@sun.com>. Currently maintained by Kov.Chai <tchaikov@gmail.com>. SEE ALSO
slmseg(1), ids2ngram (1). perl v5.14.2 2012-06-09 MMSEG(1)
All times are GMT -4. The time now is 06:49 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy