USING A PERL SCRIPT FOR BUCKETING N-GRAMS


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting USING A PERL SCRIPT FOR BUCKETING N-GRAMS
# 1  
Old 04-13-2011
USING A PERL SCRIPT FOR BUCKETING N-GRAMS

I am trying to solve the issue of identifying names from the initials which are prefixed to it and are all conjoined.
The prefixing can be of three major types.
A single Letter prefixed: jsmith
Two letters prefixed:jksmith
Three letters prefixed:jkdsmith

The algorithm which I had in mind was something like this:
BASIC PREMISE:
The data has both correct names and also garbage where intial letters which are not part of the name are appended to the name.
The appending is of the 3 types mentioned above
BASIC INPUTS
For each type, there is a NGram look-up:
uni-,di-, tri-
There is a basic dictionary which contains all the possible valid combos
HOW TO SET ABOUT IT.
The program takes the word from the input file as input and checks against the dictionary.
If the word is found , it goes on to the next
If word not found, then it checks against the digram list, strips off the digram and validates whether the residue exists in the dictionary
If residue maps to dictionary, show the input as PREF.+ word
If word not found, flag it as a possible error.
Bucketing would be the ideal solution, starting off with the largest set of trigrams and then going down to the unigram.

To solve the issue I wrote a PERL program with help from a colleague which is appended below:
Code:
#!/usr/bin/perl
# -d <dictionary> -p <prefixes [=digrams]>
use Getopt::Std;
getopts('d:p:');
open DICT, $opt_d;
@dict = <DICT>;
open PREF, $opt_p;
@pref = <PREF>;
while (<>) {
    /^(..)(.*)$/;
    ($p, $r) = ($1, $2);
    if (grep /$p/, @pref and grep /$r/, @dict) { print "$p + $r\n" }
}

DICT is the dictionary and PREF is the N-Gram file
However my main issue is that any given time, the program can handle only one issue say a digram or a trigram (by adding one more.)
What I need is a program which could start off with the trigrams, take on the digrams and finally identify the Unigrams, in other words bucketing the data and sieving through it.

A small database is given below:
DICTIONARY:
Code:
smith
green
brown
black

NGrams
Code:
Unigram:j,k,d
Digrams:jd,jl,jk
Trigrams:jdm,jlk,jkv

Testdatabase:
Code:
jsmith
kbrown
jdblack
jlgreen
jdmsmith
jlkbrown
jkvblack

Expected result:
Code:
j+smith
k+brown
jd+black
jl+green
jdm+smith
jlk+brown
jkv+black

This would be possible only if the dictionary is checked each time and the residue is flagged as such (as I have shown in the algo)
The script works perfectly for a single type but I don't know how to bucket the data using multiple N-Grams at one shot. Does PERL support such an action of chaining and filtering. This is too complex and beyond my scripting abilities.

Any solutions would be highly appreciated. Many thanks in anticipation

Last edited by fpmurphy; 04-13-2011 at 11:56 AM.. Reason: code tags please!
# 2  
Old 04-15-2011
Don't know about using perl, but this is pretty easy in awk:

Code:
awk -F',' 'FNR==1{F++}
F==1{D[$0]++}
F==2{for(i=1;i<=NF;i++) G[$i]++}
F==3{for(i=3;i>0;i--)
   if(substr($0,1,i) in G && substr($0,i+1) in D) {
      print substr($0,1,i) "+" substr($0,i+1);
      next;
   }
   print "!!" $0;
}' DICTIONARY NGrams Testdatabase


Last edited by Chubler_XL; 04-15-2011 at 12:39 AM..
This User Gave Thanks to Chubler_XL For This Post:
# 3  
Old 04-15-2011
Many thanks. It works just fine. Quick and rapid. Took hardly one second to run through a dictionary of around 200,000 lakh words and an NGram list of around 200

---------- Post updated at 11:26 PM ---------- Previous update was at 11:09 PM ----------

Hello,
On testing the script. Found one small "bug" Let us suppose that the dictionary contains both smith and jsmith and that j is also listed as an NGram. When jsmith is given to the testdatabase, it should pass, instead it is shown as j+smith. I suppose this because the largest string is not checked. I don't know whether this hypothesis is correct.
Otherwise the script runs like a charm.
Many thanks
# 4  
Old 04-15-2011
Yep didn't consider that case, this update checks for it:

Code:
awk -F',' 'FNR==1{F++}
F==1{D[$0]++}
F==2{for(i=1;i<=NF;i++) G[$i]++}
F==3{if($0 in D) print $0; else {
   for(i=3;i>0;i--)
   if(substr($0,1,i) in G && substr($0,i+1) in D) {
      print substr($0,1,i) "+" substr($0,i+1);
      next;
   }
   print "!!" $0;
   }
}' DICTIONARY NGrams Testdatabase

This User Gave Thanks to Chubler_XL For This Post:
# 5  
Old 04-15-2011
Works great and gives the right output. Many thanks for a fabulous script.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Programming

PERL: In a perl-scripttTrying to execute another perl-script that SETS SOME VARIABLES !

I have reviewed many examples on-line about running another process (either PERL or shell command or a program), but do not find any usefull for my needs way. (Reviewed and not useful the system(), 'back ticks', exec() and open()) I would like to run another PERL-script from first one, not... (1 Reply)
Discussion started by: alex_5161
1 Replies

2. Shell Programming and Scripting

Excuting perl script from within a perl script with variables.

Not sure what I am doing wrong here, but I can print the list with no issue. Just a blank screen with the 'do'. #!/usr/bin/perl open FILE, "upslist.txt"; while ($line=<FILE>){ if ($line=~/^(.*?),(.*?)$/){ #print "ups:$1 string:$2\n"; do 'check_snmp_mgeups-0.1.pl -H $1 -C $2'; } ... (1 Reply)
Discussion started by: mrlayance
1 Replies

3. Shell Programming and Scripting

Perl : embedding java script with cgi perl script

Hi All, I am aware that html tags can be embedded in cgi script as below.. In the same way is it possible to embed the below javascript in perl cgi script ?? print("<form action="action.htm" method="post" onSubmit="return submitForm(this.Submitbutton)">"); print("<input type = "text"... (1 Reply)
Discussion started by: scriptscript
1 Replies

4. Shell Programming and Scripting

executing perl script from another perl script : NOT WORKING

Hi Folks, I have 2 perl scripts and I need to execute 2nd perl script from the 1st perl script in WINDOWS. In the 1st perl script that I had, I am calling the 2nd script main.pl =========== print "This is my main script\n"; `perl C:\\Users\\sripathg\\Desktop\\scripts\\hi.pl`; ... (3 Replies)
Discussion started by: giridhar276
3 Replies

5. Shell Programming and Scripting

calling a perl script with arguments from a parent perl script

I am trying to run a perl script which needs input arguments from a parent perl script, but doesn't seem to work. Appreciate your help in this regard. From parent.pl $input1=123; $input2=abc; I tried calling it with system("/usr/bin/perl child.pl $input1 $input2"); and `perl... (1 Reply)
Discussion started by: grajp002
1 Replies

6. Shell Programming and Scripting

HELP on Perl array / sorting - trying to convert Korn Shell Script to Perl

Hi all, Not sure if this should be in the programming forum, but I believe it will get more response under the Shell Programming and Scripting FORUM. Am trying to write a customized df script in Perl and need some help with regards to using arrays and file handlers. At the moment am... (3 Replies)
Discussion started by: newbie_01
3 Replies

7. Shell Programming and Scripting

Perl :How to print the o/p of a Perl script on console and redirecting same in log file @ same time.

How can i print the output of a perl script on a unix console and redirect the same in a log file under same directory simultaneously ? Like in Shell script, we use tee, is there anything in Perl or any other option ? (2 Replies)
Discussion started by: butterfly20
2 Replies

8. Shell Programming and Scripting

perl/unix: script in command line works but not in perl

so in unix this command works works and shows me a list of directories find . -name \*.xls -exec dirname {} \; | sort -u | > list.txt but when i try running a perl script to run this command my $query = 'find . -name \*.xls -exec dirname {} \; | sort -u | > list.txt';... (2 Replies)
Discussion started by: kpddong
2 Replies

9. UNIX for Dummies Questions & Answers

How to Turn perl one-liners into full perl script?

I have the following command prompt perl one liner: perl -e 's/\(\)\\,\"]//g; s/^\s//g; s/;/\n/g' -pi result1 I tried to move this one liner into a perl script I am practicing with (just started learning perl right now). I tried the following (I only know how to open a file and output to a... (1 Reply)
Discussion started by: EDALBNUG
1 Replies

10. Shell Programming and Scripting

[Perl] Accessing array elements within a sed command in Perl script

I am trying to use a script to replace the header of each file, whose filename are stored within the array $test, using the sed command within a Perl script as follows: $count = 0; while ( $count < $#test ) { `sed -e 's/BIOGRF 321/BIOGRF 332/g' ${test} > 0`; `cat 0 >... (2 Replies)
Discussion started by: userix
2 Replies
Login or Register to Ask a Question