Creating a syllable concordance


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Creating a syllable concordance
# 1  
Old 05-14-2011
Creating a syllable concordance

Hello,
I have two files. The first file contains specific syllables of a language (Hindi) and the second file contains a large database from which these syllables have been culled.
The syllable file which has syllables in Hindi has one syllable per line
and the corpus file has a data structure where the word is given in English and its Hindi equivalent is provided, with EQUAL TO (=) as a delimiter
What I tried to get is a structure where each syllable is given and a corresponding example from the corpus file is provided.
Basically it implies a concordance of syllables: I tried to grep from file and get the results but the data I get is too voluminous and pretty slow.
I would really appreciate if a script in AWK or PERL could do the job.
I work in Windows under DOS so the facility of piping is denied to me under AWK.
A pseudo-data file (in English is provided as a zip)
Many thanks in advance for the help.
# 2  
Old 05-14-2011
Hi,

I'm not sure of your required format however the Perl example below should provide the necessary structures.

It loads the entire memory into a hash, this is expensive with a large corpus, however it need only be done once and each search request afterwards is very fast indeed.

If you are running this on a DOS machine, you needn't set the input record separator ($/) as it defaults to the appropriate value for the current environment.
Code:
#! /usr/bin/perl

use strict;
use warnings;

my %corpus;  # The hash we will use to store a map of the corpus in
$/ = "\r\n"; # These files are DOS files, so set the end of line accordingly
open (my $corpus, '<', 'Output')||die 'Could not open Output file, $!';
while(<$corpus>){
   chomp;
   if ($_){ # Avoid emptylines
      my($syllable, $example)=split(/=/, $_ );#extract the values
      $corpus{$syllable}=$example;            #and store in the global hash
   }

}
close $corpus;
open (my $syllable ,'<', 'Syllables' )||die 'Could not open Syllables file, $!';
while (<$syllable>){
   chomp;
   print $corpus{$_} ? "$_ has the example $corpus{$_} in the provided corpus\n": "I do not have an example for $_\n";
}


Last edited by Skrynesaver; 05-14-2011 at 03:26 AM.. Reason: added check for existence of entry in corpus file
This User Gave Thanks to Skrynesaver For This Post:
# 3  
Old 05-14-2011
Many thanks for the kind help. Sorry for the delay in responding, but my broadband server was down.
At the outset, I am sorry, I forgot to zip the corpus.

The sample corpus is:
cat bracken amaze (one word on each line)

The syllables were:
ca bra maz (one on each line)

I ran the program and it gave the following output
I do not have an example for ca
bra
maz
Instead of the expected output:
ca=cat bra=bracken maz=amaze (one word on each line)
I tried with and without the $ but in both cases, the result was the same.
Your help would be really appreciated.
# 4  
Old 05-14-2011
Ah, I misunderstood your intent, I read the intended output as the corpus data.

Try the following, it doesn't create an index (though that would be an interesting project to do so, hmmn)
Code:
#! /usr/bin/perl

use strict;   # These two lines save you endless trouble 
use warnings; # without them typos and such errors get missed

open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
$/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>));  # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
    chomp(my $syllable = $_);
    for my $word (@corpus){
        if ( $word =~ /$syllable/){  # use a regular expression to find a match for the syllable
            print "$syllable=$word\n";
            last; #Stop processing the array of words as we have an example
        }
    }
}

# 5  
Old 05-14-2011
Hello,
Am I still doing something wrong.
I used perl at the command line:

perl conc.pl corpus syllables

where corpus is the data from which syllables have to be found
syllables is the file which contains the syllables.
I even tried reversing the command line order, but got no output at all.
Am I doing something wrong. Sorry for the hassle. I walked through the code and it should spew out the syllables. Is the command-line wrong.
Many thanks
# 6  
Old 05-14-2011
Ahh, come on now, do some work here Smilie

The supplied code doesn't read the command line but opens two files by name, 'Corpus' and 'Syllables' , you could change these to
Code:
 $ARGV[0]

and
Code:
 $ARGV[1]

if you wished to supply the answers on the command line.
Code:
cat Corpus;echo '_____________';cat Syllables;echo '_____________'; perl nested_loop.pl;echo '_____________';cat nested_loop.pl 
cat
bracken
amaze
_____________
ca
bra
maz
chai
_____________
ca=cat
bra=bracken
maz=amaze
chai wasn't matched in the supplied corpus
_____________
#! /usr/bin/perl

use strict;   # These two lines save you endless trouble 
use warnings; # without them typos and such errors get missed

open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
$/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>));  # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
    chomp(my $syllable = $_);
    my $found = 0;
    for my $word (@corpus){
        if ( $word =~ /$syllable/){  # use a regular expression to find a match for the syllable
            print "$syllable=$word\n";
            $found = 1;
            last; #Stop processing the array of words as we have an example
        }
    }
    print "$syllable wasn't matched in the supplied corpus\n" if (! $found);
}

# 7  
Old 05-14-2011
Sorry,
I used to work in C and Java and am still learning Perl and Awk. The programs are faster and do much better work than a long program in C. Guess I still have a long way to go in Perl.
Login or Register to Ask a Question

Previous Thread | Next Thread

5 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find Syllable count mismatch

Hello, I have written a syllable splitter for Pseudo English and Indic. I have a large database with the following structure Syllables in Pseudo English delimited by |=Syllables in Devanagari delimited by | The tool produces syllables in both scripts. An example is given below: ... (2 Replies)
Discussion started by: gimley
2 Replies

2. Shell Programming and Scripting

Modifying an awk script for syllable splitting

I have found this syllable splitter in awk. The code is given below. Basically the script cuts words and names into syllables. However it fails when the word contains 2 consonants which constitute a single syllable. An example is given below ashford raphael The output is as under: ... (4 Replies)
Discussion started by: gimley
4 Replies

3. Shell Programming and Scripting

Writing a clustering concordance for a Perso-Arabic script

I am working on a database of a language using Arabic Script. One of the major issues is that the shape of the characters changes according to their initial, medial or final positioning. Another major issue is that of the clustering of vowels within the word: the clustering changes totally the... (9 Replies)
Discussion started by: gimley
9 Replies

4. Shell Programming and Scripting

Syllable splitter in Perl

Hello, I am a relative newbie and want to split Names in English into syllables. Does anyone know of a perl script which does that. Since my main area is linguistics, I would be happy to add rules to it and post the perl script back for other users. I tried the CPan perl modules but they don't... (6 Replies)
Discussion started by: gimley
6 Replies

5. Shell Programming and Scripting

CREATING A SYLLABLE CONCORDANCE WITH POSITIONAL VARIANTS

Hello, Some time back I had posted a request for a syllable concordance in which if a syllable was provided in a file, the program would extract a word from a file entitled "Corpus" matching that syllable. The program was The following script was provided which did the job and for which I am... (3 Replies)
Discussion started by: gimley
3 Replies
Login or Register to Ask a Question