Creating a syllable concordance

05-14-2011

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Creating a syllable concordance

Hello,
I have two files. The first file contains specific syllables of a language (Hindi) and the second file contains a large database from which these syllables have been culled.
The syllable file which has syllables in Hindi has one syllable per line
and the corpus file has a data structure where the word is given in English and its Hindi equivalent is provided, with EQUAL TO (=) as a delimiter
What I tried to get is a structure where each syllable is given and a corresponding example from the corpus file is provided.
Basically it implies a concordance of syllables: I tried to grep from file and get the results but the data I get is too voluminous and pretty slow.
I would really appreciate if a script in AWK or PERL could do the job.
I work in Windows under DOS so the facility of piping is denied to me under AWK.
A pseudo-data file (in English is provided as a zip)
Many thanks in advance for the help.

Data.zip (254 Bytes)

gimley

View Public Profile for gimley

Find all posts by gimley

05-14-2011

Registered User

939, 225

Join Date: Mar 2011

Last Activity: 8 May 2020, 3:48 AM EDT

Location: Éire

Posts: 939

Thanks Given: 27

Thanked 225 Times in 219 Posts

Hi,

I'm not sure of your required format however the Perl example below should provide the necessary structures.

It loads the entire memory into a hash, this is expensive with a large corpus, however it need only be done once and each search request afterwards is very fast indeed.

If you are running this on a DOS machine, you needn't set the input record separator ($/) as it defaults to the appropriate value for the current environment.

Code:

#! /usr/bin/perl

use strict;
use warnings;

my %corpus;  # The hash we will use to store a map of the corpus in
$/ = "\r\n"; # These files are DOS files, so set the end of line accordingly
open (my $corpus, '<', 'Output')||die 'Could not open Output file, $!';
while(<$corpus>){
   chomp;
   if ($_){ # Avoid emptylines
      my($syllable, $example)=split(/=/, $_ );#extract the values
      $corpus{$syllable}=$example;            #and store in the global hash
   }

}
close $corpus;
open (my $syllable ,'<', 'Syllables' )||die 'Could not open Syllables file, $!';
while (<$syllable>){
   chomp;
   print $corpus{$_} ? "$_ has the example $corpus{$_} in the provided corpus\n": "I do not have an example for $_\n";
}

Last edited by Skrynesaver; 05-14-2011 at 03:26 AM.. Reason: added check for existence of entry in corpus file

This User Gave Thanks to Skrynesaver For This Post:

Skrynesaver

View Public Profile for Skrynesaver

Find all posts by Skrynesaver

05-14-2011

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks for the kind help. Sorry for the delay in responding, but my broadband server was down.
At the outset, I am sorry, I forgot to zip the corpus.

The sample corpus is:
cat bracken amaze (one word on each line)

The syllables were:
ca bra maz (one on each line)

I ran the program and it gave the following output
I do not have an example for ca
bra
maz
Instead of the expected output:
ca=cat bra=bracken maz=amaze (one word on each line)
I tried with and without the $ but in both cases, the result was the same.
Your help would be really appreciated.

gimley

View Public Profile for gimley

Find all posts by gimley

05-14-2011

Registered User

939, 225

Join Date: Mar 2011

Last Activity: 8 May 2020, 3:48 AM EDT

Location: Éire

Posts: 939

Thanks Given: 27

Thanked 225 Times in 219 Posts

Ah, I misunderstood your intent, I read the intended output as the corpus data.

Try the following, it doesn't create an index (though that would be an interesting project to do so, hmmn)

Code:

#! /usr/bin/perl

use strict;   # These two lines save you endless trouble 
use warnings; # without them typos and such errors get missed

open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
$/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>));  # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
    chomp(my $syllable = $_);
    for my $word (@corpus){
        if ( $word =~ /$syllable/){  # use a regular expression to find a match for the syllable
            print "$syllable=$word\n";
            last; #Stop processing the array of words as we have an example
        }
    }
}

Skrynesaver

View Public Profile for Skrynesaver

Find all posts by Skrynesaver

05-14-2011

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Hello,
Am I still doing something wrong.
I used perl at the command line:

perl conc.pl corpus syllables

where corpus is the data from which syllables have to be found
syllables is the file which contains the syllables.
I even tried reversing the command line order, but got no output at all.
Am I doing something wrong. Sorry for the hassle. I walked through the code and it should spew out the syllables. Is the command-line wrong.
Many thanks

gimley

View Public Profile for gimley

Find all posts by gimley

05-14-2011

Registered User

939, 225

Join Date: Mar 2011

Last Activity: 8 May 2020, 3:48 AM EDT

Location: Éire

Posts: 939

Thanks Given: 27

Thanked 225 Times in 219 Posts

Ahh, come on now, do some work here

The supplied code doesn't read the command line but opens two files by name, 'Corpus' and 'Syllables' , you could change these to

Code:

 $ARGV[0]

and

Code:

 $ARGV[1]

if you wished to supply the answers on the command line.

Code:

cat Corpus;echo '_____________';cat Syllables;echo '_____________'; perl nested_loop.pl;echo '_____________';cat nested_loop.pl 
cat
bracken
amaze
_____________
ca
bra
maz
chai
_____________
ca=cat
bra=bracken
maz=amaze
chai wasn't matched in the supplied corpus
_____________
#! /usr/bin/perl

use strict;   # These two lines save you endless trouble 
use warnings; # without them typos and such errors get missed

open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
$/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>));  # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
    chomp(my $syllable = $_);
    my $found = 0;
    for my $word (@corpus){
        if ( $word =~ /$syllable/){  # use a regular expression to find a match for the syllable
            print "$syllable=$word\n";
            $found = 1;
            last; #Stop processing the array of words as we have an example
        }
    }
    print "$syllable wasn't matched in the supplied corpus\n" if (! $found);
}

Skrynesaver

View Public Profile for Skrynesaver

Find all posts by Skrynesaver

05-14-2011

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Sorry,
I used to work in C and Java and am still learning Perl and Awk. The programs are faster and do much better work than a long program in C. Guess I still have a long way to go in Perl.

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Creating a syllable concordance

5 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find Syllable count mismatch

Discussion started by: gimley

2. Shell Programming and Scripting

Modifying an awk script for syllable splitting

Discussion started by: gimley

3. Shell Programming and Scripting

Writing a clustering concordance for a Perso-Arabic script

Discussion started by: gimley

4. Shell Programming and Scripting

Syllable splitter in Perl

Discussion started by: gimley

5. Shell Programming and Scripting

CREATING A SYLLABLE CONCORDANCE WITH POSITIONAL VARIANTS

Discussion started by: gimley