I am trying to solve the issue of identifying names from the initials which are prefixed to it and are all conjoined.
The prefixing can be of three major types.
A single Letter prefixed: jsmith
Two letters prefixed:jksmith
Three letters prefixed:jkdsmith
The algorithm which I had in mind was something like this:
BASIC PREMISE:
The data has both correct names and also garbage where intial letters which are not part of the name are appended to the name.
The appending is of the 3 types mentioned above
BASIC INPUTS
For each type, there is a NGram look-up:
uni-,di-, tri-
There is a basic dictionary which contains all the possible valid combos
HOW TO SET ABOUT IT.
The program takes the word from the input file as input and checks against the dictionary.
If the word is found , it goes on to the next
If word not found, then it checks against the digram list, strips off the digram and validates whether the residue exists in the dictionary
If residue maps to dictionary, show the input as PREF.+ word
If word not found, flag it as a possible error.
Bucketing would be the ideal solution, starting off with the largest set of trigrams and then going down to the unigram.
To solve the issue I wrote a PERL program with help from a colleague which is appended below:
DICT is the dictionary and PREF is the N-Gram file
However my main issue is that any given time, the program can handle only one issue say a digram or a trigram (by adding one more.)
What I need is a program which could start off with the trigrams, take on the digrams and finally identify the Unigrams, in other words bucketing the data and sieving through it.
A small database is given below:
DICTIONARY:
NGrams
Testdatabase:
Expected result:
This would be possible only if the dictionary is checked each time and the residue is flagged as such (as I have shown in the algo)
The script works perfectly for a single type but I don't know how to bucket the data using multiple N-Grams at one shot. Does PERL support such an action of chaining and filtering. This is too complex and beyond my scripting abilities.
Any solutions would be highly appreciated. Many thanks in anticipation
Last edited by fpmurphy; 04-13-2011 at 11:56 AM..
Reason: code tags please!
Many thanks. It works just fine. Quick and rapid. Took hardly one second to run through a dictionary of around 200,000 lakh words and an NGram list of around 200
---------- Post updated at 11:26 PM ---------- Previous update was at 11:09 PM ----------
Hello,
On testing the script. Found one small "bug" Let us suppose that the dictionary contains both smith and jsmith and that j is also listed as an NGram. When jsmith is given to the testdatabase, it should pass, instead it is shown as j+smith. I suppose this because the largest string is not checked. I don't know whether this hypothesis is correct.
Otherwise the script runs like a charm.
Many thanks
I have reviewed many examples on-line about running another process (either PERL or shell command or a program), but do not find any usefull for my needs way. (Reviewed and not useful the system(), 'back ticks', exec() and open())
I would like to run another PERL-script from first one, not... (1 Reply)
Not sure what I am doing wrong here, but I can print the list with no issue. Just a blank screen with the 'do'.
#!/usr/bin/perl
open FILE, "upslist.txt";
while ($line=<FILE>){
if ($line=~/^(.*?),(.*?)$/){
#print "ups:$1 string:$2\n";
do 'check_snmp_mgeups-0.1.pl -H $1 -C $2';
} ... (1 Reply)
Hi All,
I am aware that html tags can be embedded in cgi script as below.. In the same way is it possible to embed the below javascript in perl cgi script ??
print("<form action="action.htm" method="post" onSubmit="return submitForm(this.Submitbutton)">");
print("<input type = "text"... (1 Reply)
Hi Folks,
I have 2 perl scripts and I need to execute 2nd perl script from the 1st perl script in WINDOWS.
In the 1st perl script that I had, I am calling the 2nd script
main.pl
===========
print "This is my main script\n";
`perl C:\\Users\\sripathg\\Desktop\\scripts\\hi.pl`;
... (3 Replies)
I am trying to run a perl script which needs input arguments from a parent perl script, but doesn't seem to work. Appreciate your help in this regard.
From parent.pl
$input1=123;
$input2=abc;
I tried calling it with
system("/usr/bin/perl child.pl $input1 $input2");
and
`perl... (1 Reply)
Hi all,
Not sure if this should be in the programming forum, but I believe it will get more response under the Shell Programming and Scripting FORUM.
Am trying to write a customized df script in Perl and need some help with regards to using arrays and file handlers.
At the moment am... (3 Replies)
How can i print the output of a perl script on a unix console and redirect the same in a log file under same directory simultaneously ?
Like in Shell script, we use tee, is there anything in Perl or any other option ? (2 Replies)
so in unix this command works works and shows me a list of directories
find . -name \*.xls -exec dirname {} \; | sort -u | > list.txt
but when i try running a perl script to run this command
my $query = 'find . -name \*.xls -exec dirname {} \; | sort -u | > list.txt';... (2 Replies)
I have the following command prompt perl one liner:
perl -e 's/\(\)\\,\"]//g; s/^\s//g; s/;/\n/g' -pi result1
I tried to move this one liner into a perl script I am practicing with (just started learning perl right now).
I tried the following (I only know how to open a file and output to a... (1 Reply)
I am trying to use a script to replace the header of each file, whose filename are stored within the array $test, using the sed command within a Perl script as follows:
$count = 0;
while ( $count < $#test )
{
`sed -e 's/BIOGRF 321/BIOGRF 332/g' ${test} > 0`;
`cat 0 >... (2 Replies)