Hello,
I work under Windows Vista and I am compiling an open-source stemmer dictionary for English and eventually for other Indian languages. The Engine which I have written has spewed out all lemmatised/expanded forms of the words: Nouns, Adjectives, Adverbs etc. Each set of expanded forms is separated by a hard return. Since each root word was treated as a separate entity according to its grammatical function, the expanded forms sometimes have duplicate sets.
An example will make this clear:
As can be seen the two sets for
have been created. It is evident that since they share the same root word, they should have been merged together but for the reason given above, are treated as separate entities.
Is it possible to write a script which would go through the sets, if a common word is found in set A and set B, both sets will merge together and if possible be sorted and the duplicate forms removed.
The output of the above would look something like this:
The sets are not necessarily contiguous and at times could be separated by another set of words.
Since the data is huge, a perl or awk script or would go a long way in speeding up the process.
Many thanks in advance for helping a work which will aid researchers to create better stemming for English and other languages.
USERS="me you jim joe sue"
for user in ${USERS}; do
rmuser -p $user
usrdir=`cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'`
rm -fr `cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'`
echo Deleting: $user '\t' REMOVING: $usrdir
done
This is for AIX ONLY!!! but easily ported to... (0 Replies)
I wish to clean a text file of the following characters
1/2, 1/4, o (degrees)
I cant display these characters. I have tried ALT+189 etc (my terminal emulator is set to ASCII). How do I display the above ? I am using HP UX 10. (5 Replies)
Hello,
I am trying to analyze data I recently ran, and the only way to efficiently clean up the data is by using an awk file.
I am very new to awk and am having great difficulty with it. In $8 and $9, for example, I am trying to delete numbers that contain 1.
I cannot find any tutorials that... (20 Replies)
HI ,
I am getting the source data as below.
Source Data
CDR_Data,,,,,
F1,F2,F3,F4,F5,F6
5,5,6,7,8,7
6,6,g,,,
7,7,76,,,
8,8,gt,,,
9,9,df ,d,d,d
,,,,, (4 Replies)
Hi,
I have a file with multiple rows. each row has 8 columns.
Column 8 has entries separated by commas. I want to exclude all the rows in which column 8 has more than 3 commas.
1234#0/1 - ABC_1234 3 ATGCATGCATGC HHHIIIGIHVF 1 49:T>C,60:T>C,78:C>A,76:G>T,65:T>G
Thanks,
Diya (3 Replies)
Hi
I need some help to clean my code used to get city location.
wget -q -O - http://www.ip2location.com/ | grep chkRegionCity | awk 'END { print }' | awk -F"" '{print $4}'
It gives me the city but have a leading space.
I am sure this could all be done by one single AWK
Also if possible... (8 Replies)
I have some small problem with my code.
data.html
<TD class="statuscol2">c</TD>
<TD class="statuscol3">18</TD>
<TD class="statuscol4"><SPAN TITLE="#04">test4</SPAN></TD>
<TD... (4 Replies)
Hi,
I have OCR'ed text that needs cleaning.
Lines are delimited by parts of speech (POS), for example,
each line will have either an
adj. OR s. f. OR s. m. etc
I need to uppercase all text before the POS
but all text within parentheses to be lowercase
Text after (and including) the POS... (6 Replies)
I completely understand if nobody wants to take a look at the ENTIRE code. What I am asking is that if anyone could browse quickly over the code and perhaps see if anything could be improved. You need not run the program, but you can if you want to.
I have been using awk for about a week or so,... (2 Replies)
Discussion started by: bedtime
2 Replies
LEARN ABOUT MOJAVE
locale::codes::langfam5.18
Locale::Codes::LangFam(3pm) Perl Programmers Reference Guide Locale::Codes::LangFam(3pm)NAME
Locale::Codes::LangFam - standard codes for language extension identification
SYNOPSIS
use Locale::Codes::LangFam;
$lext = code2langfam('apa'); # $lext gets 'Apache languages'
$code = langfam2code('Apache languages'); # $code gets 'apa'
@codes = all_langfam_codes();
@names = all_langfam_names();
DESCRIPTION
The "Locale::Codes::LangFam" module provides access to standard codes used for identifying language families, such as those as defined in
ISO 639-5.
Most of the routines take an optional additional argument which specifies the code set to use. If not specified, the default ISO 639-5
language family codes will be used.
SUPPORTED CODE SETS
There are several different code sets you can use for identifying language families. A code set may be specified using either a name, or a
constant that is automatically exported by this module.
For example, the two are equivalent:
$lext = code2langfam('apa','alpha');
$lext = code2langfam('apa',LOCALE_LANGFAM_ALPHA);
The codesets currently supported are:
alpha
This is the set of three-letter (lowercase) codes from ISO 639-5 such as 'apa' for Apache languages.
This is the default code set.
ROUTINES
code2langfam ( CODE [,CODESET] )
langfam2code ( NAME [,CODESET] )
langfam_code2code ( CODE ,CODESET ,CODESET2 )
all_langfam_codes ( [CODESET] )
all_langfam_names ( [CODESET] )
Locale::Codes::LangFam::rename_langfam ( CODE ,NEW_NAME [,CODESET] )
Locale::Codes::LangFam::add_langfam ( CODE ,NAME [,CODESET] )
Locale::Codes::LangFam::delete_langfam ( CODE [,CODESET] )
Locale::Codes::LangFam::add_langfam_alias ( NAME ,NEW_NAME )
Locale::Codes::LangFam::delete_langfam_alias ( NAME )
Locale::Codes::LangFam::rename_langfam_code ( CODE ,NEW_CODE [,CODESET] )
Locale::Codes::LangFam::add_langfam_code_alias ( CODE ,NEW_CODE [,CODESET] )
Locale::Codes::LangFam::delete_langfam_code_alias ( CODE [,CODESET] )
These routines are all documented in the Locale::Codes::API man page.
SEE ALSO
Locale::Codes
The Locale-Codes distribution.
Locale::Codes::API
The list of functions supported by this module.
http://www.loc.gov/standards/iso639-5/id.php
ISO 639-5 .
AUTHOR
See Locale::Codes for full author history.
Currently maintained by Sullivan Beck (sbeck@cpan.org).
COPYRIGHT
Copyright (c) 2011-2013 Sullivan Beck
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
perl v5.18.2 2013-11-04 Locale::Codes::LangFam(3pm)