Hello,
I have two files. The first file contains specific syllables of a language (Hindi) and the second file contains a large database from which these syllables have been culled.
The syllable file which has syllables in Hindi has one syllable per line
and the corpus file has a data structure where the word is given in English and its Hindi equivalent is provided, with EQUAL TO (=) as a delimiter
What I tried to get is a structure where each syllable is given and a corresponding example from the corpus file is provided.
Basically it implies a concordance of syllables: I tried to grep from file and get the results but the data I get is too voluminous and pretty slow.
I would really appreciate if a script in AWK or PERL could do the job.
I work in Windows under DOS so the facility of piping is denied to me under AWK.
A pseudo-data file (in English is provided as a zip)
Many thanks in advance for the help.
I'm not sure of your required format however the Perl example below should provide the necessary structures.
It loads the entire memory into a hash, this is expensive with a large corpus, however it need only be done once and each search request afterwards is very fast indeed.
If you are running this on a DOS machine, you needn't set the input record separator ($/) as it defaults to the appropriate value for the current environment.
Code:
#! /usr/bin/perl
use strict;
use warnings;
my %corpus; # The hash we will use to store a map of the corpus in
$/ = "\r\n"; # These files are DOS files, so set the end of line accordingly
open (my $corpus, '<', 'Output')||die 'Could not open Output file, $!';
while(<$corpus>){
chomp;
if ($_){ # Avoid emptylines
my($syllable, $example)=split(/=/, $_ );#extract the values
$corpus{$syllable}=$example; #and store in the global hash
}
}
close $corpus;
open (my $syllable ,'<', 'Syllables' )||die 'Could not open Syllables file, $!';
while (<$syllable>){
chomp;
print $corpus{$_} ? "$_ has the example $corpus{$_} in the provided corpus\n": "I do not have an example for $_\n";
}
Last edited by Skrynesaver; 05-14-2011 at 03:26 AM..
Reason: added check for existence of entry in corpus file
This User Gave Thanks to Skrynesaver For This Post:
Many thanks for the kind help. Sorry for the delay in responding, but my broadband server was down.
At the outset, I am sorry, I forgot to zip the corpus.
The sample corpus is:
cat bracken amaze (one word on each line)
The syllables were:
ca bra maz (one on each line)
I ran the program and it gave the following output
I do not have an example for ca
bra
maz
Instead of the expected output:
ca=cat bra=bracken maz=amaze (one word on each line)
I tried with and without the $ but in both cases, the result was the same.
Your help would be really appreciated.
Ah, I misunderstood your intent, I read the intended output as the corpus data.
Try the following, it doesn't create an index (though that would be an interesting project to do so, hmmn)
Code:
#! /usr/bin/perl
use strict; # These two lines save you endless trouble
use warnings; # without them typos and such errors get missed
open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
$/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>)); # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
chomp(my $syllable = $_);
for my $word (@corpus){
if ( $word =~ /$syllable/){ # use a regular expression to find a match for the syllable
print "$syllable=$word\n";
last; #Stop processing the array of words as we have an example
}
}
}
Hello,
Am I still doing something wrong.
I used perl at the command line:
perl conc.pl corpus syllables
where corpus is the data from which syllables have to be found
syllables is the file which contains the syllables.
I even tried reversing the command line order, but got no output at all.
Am I doing something wrong. Sorry for the hassle. I walked through the code and it should spew out the syllables. Is the command-line wrong.
Many thanks
The supplied code doesn't read the command line but opens two files by name, 'Corpus' and 'Syllables' , you could change these to
Code:
$ARGV[0]
and
Code:
$ARGV[1]
if you wished to supply the answers on the command line.
Code:
cat Corpus;echo '_____________';cat Syllables;echo '_____________'; perl nested_loop.pl;echo '_____________';cat nested_loop.pl
cat
bracken
amaze
_____________
ca
bra
maz
chai
_____________
ca=cat
bra=bracken
maz=amaze
chai wasn't matched in the supplied corpus
_____________
#! /usr/bin/perl
use strict; # These two lines save you endless trouble
use warnings; # without them typos and such errors get missed
open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
$/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>)); # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
chomp(my $syllable = $_);
my $found = 0;
for my $word (@corpus){
if ( $word =~ /$syllable/){ # use a regular expression to find a match for the syllable
print "$syllable=$word\n";
$found = 1;
last; #Stop processing the array of words as we have an example
}
}
print "$syllable wasn't matched in the supplied corpus\n" if (! $found);
}
Sorry,
I used to work in C and Java and am still learning Perl and Awk. The programs are faster and do much better work than a long program in C. Guess I still have a long way to go in Perl.
Hello,
I have written a syllable splitter for Pseudo English and Indic.
I have a large database with the following structure
Syllables in Pseudo English delimited by |=Syllables in Devanagari delimited by |
The tool produces syllables in both scripts. An example is given below:
... (2 Replies)
I have found this syllable splitter in awk. The code is given below. Basically the script cuts words and names into syllables. However it fails when the word contains 2 consonants which constitute a single syllable. An example is given below
ashford
raphael
The output is as under:
... (4 Replies)
I am working on a database of a language using Arabic Script. One of the major issues is that the shape of the characters changes according to their initial, medial or final positioning. Another major issue is that of the clustering of vowels within the word: the clustering changes totally the... (9 Replies)
Hello,
I am a relative newbie and want to split Names in English into syllables. Does anyone know of a perl script which does that. Since my main area is linguistics, I would be happy to add rules to it and post the perl script back for other users. I tried the CPan perl modules but they don't... (6 Replies)
Hello,
Some time back I had posted a request for a syllable concordance in which if a syllable was provided in a file, the program would extract a word from a file entitled "Corpus" matching that syllable. The program was
The following script was provided which did the job and for which I am... (3 Replies)