Cleaning through perl or awk a Stemmer dictionary


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Cleaning through perl or awk a Stemmer dictionary
# 1  
Old 05-26-2013
Cleaning through perl or awk a Stemmer dictionary

Hello,
I work under Windows Vista and I am compiling an open-source stemmer dictionary for English and eventually for other Indian languages. The Engine which I have written has spewed out all lemmatised/expanded forms of the words: Nouns, Adjectives, Adverbs etc. Each set of expanded forms is separated by a hard return. Since each root word was treated as a separate entity according to its grammatical function, the expanded forms sometimes have duplicate sets.
An example will make this clear:
Code:
coil
coiled
coiling
coils

coil
coils

coin's
coin
coins
coins'

coin
coined
coining
coins

As can be seen the two sets for
Code:
coil and coin

have been created. It is evident that since they share the same root word, they should have been merged together but for the reason given above, are treated as separate entities.
Is it possible to write a script which would go through the sets, if a common word is found in set A and set B, both sets will merge together and if possible be sorted and the duplicate forms removed.
The output of the above would look something like this:
Code:
coil
coiled
coiling
coils

coin's
coin
coins
coined
coining

The sets are not necessarily contiguous and at times could be separated by another set of words.
Since the data is huge, a perl or awk script or would go a long way in speeding up the process.
Many thanks in advance for helping a work which will aid researchers to create better stemming for English and other languages.
# 2  
Old 05-26-2013
no sorting but this should merge your forms and remove duplicates:

Code:
awk '
BEGIN {RS=""}
{ root=$1;
  for(i=1;i<=NF;i++) if($i in LEM) root=LEM[$i]
  for(i=1;i<=NF;i++) if(!($i in LEM)) {
      LEM[$i]=root
      base[root]=base[root] OFS $i
  }
}
END {
  for(w in base) {
    forms=split(base[w], form);
    for(i=0;i<forms;i++)
      if(length(form[i])) print form[i];
    print "";
  } 
}' infile > outfile

This User Gave Thanks to Chubler_XL For This Post:
# 3  
Old 05-26-2013
Many thanks. It worked beautifully. No hassles about the sort. I can do that very easily by creating a new script .
# 4  
Old 05-26-2013
Here is a version that sorts:

Code:
awk '
BEGIN {RS=""}
{ root=$1;
  for(i=1;i<=NF;i++) if($i in LEM) root=LEM[$i]
  for(i=1;i<=NF;i++) if(!($i in LEM)) {
      LEM[$i]=root
      base[root]=base[root] OFS $i
  }
}
END {
  for(w in base) {
    forms=split(base[w], form);
    for(i=0;i<forms;i++)
      if(length(form[i])) print w","form[i];
    print w"?";
  }
}' infile | sort | awk -F, '{ print $2}'

This User Gave Thanks to Chubler_XL For This Post:
# 5  
Old 05-26-2013
Many thanks. I tested it out on a small sample and it sorts just great.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk xml dictionary script: could I get some input?

I completely understand if nobody wants to take a look at the ENTIRE code. What I am asking is that if anyone could browse quickly over the code and perhaps see if anything could be improved. You need not run the program, but you can if you want to. I have been using awk for about a week or so,... (2 Replies)
Discussion started by: bedtime
2 Replies

2. Shell Programming and Scripting

OCR text that needs cleaning

Hi, I have OCR'ed text that needs cleaning. Lines are delimited by parts of speech (POS), for example, each line will have either an adj. OR s. f. OR s. m. etc I need to uppercase all text before the POS but all text within parentheses to be lowercase Text after (and including) the POS... (6 Replies)
Discussion started by: safran
6 Replies

3. Shell Programming and Scripting

Cleaning output using awk

I have some small problem with my code. data.html <TD class="statuscol2">c</TD> <TD class="statuscol3">18</TD> <TD class="statuscol4"><SPAN TITLE="#04">test4</SPAN></TD> <TD... (4 Replies)
Discussion started by: Jotne
4 Replies

4. Shell Programming and Scripting

Cleaning AWK code

Hi I need some help to clean my code used to get city location. wget -q -O - http://www.ip2location.com/ | grep chkRegionCity | awk 'END { print }' | awk -F"" '{print $4}' It gives me the city but have a leading space. I am sure this could all be done by one single AWK Also if possible... (8 Replies)
Discussion started by: Jotne
8 Replies

5. Shell Programming and Scripting

cleaning the file

Hi, I have a file with multiple rows. each row has 8 columns. Column 8 has entries separated by commas. I want to exclude all the rows in which column 8 has more than 3 commas. 1234#0/1 - ABC_1234 3 ATGCATGCATGC HHHIIIGIHVF 1 49:T>C,60:T>C,78:C>A,76:G>T,65:T>G Thanks, Diya (3 Replies)
Discussion started by: Diya123
3 Replies

6. Shell Programming and Scripting

File cleaning

HI , I am getting the source data as below. Source Data CDR_Data,,,,, F1,F2,F3,F4,F5,F6 5,5,6,7,8,7 6,6,g,,, 7,7,76,,, 8,8,gt,,, 9,9,df ,d,d,d ,,,,, (4 Replies)
Discussion started by: wangkc
4 Replies

7. UNIX for Dummies Questions & Answers

AWK Data Cleaning

Hello, I am trying to analyze data I recently ran, and the only way to efficiently clean up the data is by using an awk file. I am very new to awk and am having great difficulty with it. In $8 and $9, for example, I am trying to delete numbers that contain 1. I cannot find any tutorials that... (20 Replies)
Discussion started by: carmar87
20 Replies

8. UNIX for Dummies Questions & Answers

Cleaning text files

I wish to clean a text file of the following characters 1/2, 1/4, o (degrees) I cant display these characters. I have tried ALT+189 etc (my terminal emulator is set to ASCII). How do I display the above ? I am using HP UX 10. (5 Replies)
Discussion started by: ferretman
5 Replies

9. AIX

doing some spring cleaning....

USERS="me you jim joe sue" for user in ${USERS}; do rmuser -p $user usrdir=`cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'` rm -fr `cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'` echo Deleting: $user '\t' REMOVING: $usrdir done This is for AIX ONLY!!! but easily ported to... (0 Replies)
Discussion started by: Optimus_P
0 Replies
Login or Register to Ask a Question