Identifying single words in a dictionary database Post: 302981083

Sponsored Content

Top Forums Shell Programming and Scripting Identifying single words in a dictionary database Post 302981083 by gimley on Wednesday 7th of September 2016 06:34:13 AM

09-07-2016

Registered User

Identifying single words in a dictionary database

I am reworking a Marathi-English dictionary to be out on open-source. My dictionary has the Headword in Marathi, followed by its Part of Speech and subsequently by its English glosses as in the examples below;

Code:

अकरसणें v i To contract, shrink.
अकरा a Eleven.
अकराळ a Frightful, terrible.
विकराळ a Frightful, terrible.
अकरोट m A walnut.
अकरोड m A walnut.
अकर्त्तव्य a That which is not proper to be done; improper.
अकर्त्ता a Incapable, incompetent.
अकर्तृत्व n Impotence, incapability.
अकर्म n A bad action, sin. 
अकर्मी a Wicked.
कुकर्मी a Wicked.
अकर्मक a Intransitive or neuter.
अकलंक a Exempt from stain, blemish.
अंकलिपि f Figure-writing.
अकलेचा खंदक m A wiseacre.
अकल्पनीय a Inconceivable; undevisable
अकल्पित a Unexpected; unthought of.
अकल्मष a Sinless.
अकल्याण n Infelicity; injury.
अकस m Malice, spite, grudge.
अकसखोर a Spiteful, malicious
अकसी a Spiteful, malicious
अकस्मात् ad Suddenly, unexpectedly, inconsiderately.
अकस्मात ad Suddenly, unexpectedly, inconsiderately.

As can be seen the delimiter of the gloss is either a comma or a semi-colon
To make the dictionary more accessible and easily readable and concise, I want to retain only single words or words with the following

Code:

to 
A

At present I have written a macro in Ultraedit: a Word processor to do the job. It uses a regex to identify all glosses with single words (with the condition specified above)
However Macros are slow and since the database is around 400.000+ words,I wonder if there is a means of identifying such single glosses through an awk or perl script. Thus in the example above, only the following will be retained

Code:

अकरसणें v i To contract, shrink.
अकरा a Eleven.
अकराळ a Frightful, terrible.
विकराळ a Frightful, terrible.
अकरोट m A walnut.
अकरोड m A walnut.
अकर्त्तव्य a That which is not proper to be done; improper.
अकर्त्ता a Incapable, incompetent.
अकर्तृत्व n Impotence, incapability.
अकर्म n A bad action, sin. 
अकर्मी a Wicked.
कुकर्मी a Wicked.
अकर्मक a Intransitive,neuter.
अकलंक a Exempt from stain, blemish.
अंकलिपि f Figure-writing.
अकलेचा खंदक m A wiseacre.
अकल्पनीय a Inconceivable; undevisable
अकल्पित a Unexpected; unthought of.
अकल्मष a Sinless.
अकल्याण n Infelicity; injury.
अकस m Malice, spite, grudge.
अकसखोर a Spiteful, malicious
अकसी a Spiteful, malicious
अकस्मात् ad Suddenly, unexpectedly, inconsiderately.
अकस्मात ad Suddenly, unexpectedly, inconsiderately.

I normally have the painful task of further removing cases where a word has a single gloss as well as a long definition as in

Code:

अकर्त्तव्य a That which is not proper to be done; improper.

which is reduced to

Code:

अकर्त्तव्य a improper.

If both stages can be handled, it would really help a lot.
I work in a Windows environment. Many thanks for any help given which will be gratefully acknowledged by the community once this simplified dictionary is put on line.

gimley

View Public Profile for gimley

Find all posts by gimley

9 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

dictionary words in vim

how can i get the dictionary words in vim using keyboard keys? and how can i get the current directory filename?

2. Shell Programming and Scripting

Grep all words with "con" in them from a dictionary?

Is it possible to grep all words with the string "con" "Con" "CON" etc. etc. from a dictionary? for instance "magic command 'con' dictionary" will spit out words such as Confluence, contended, inconceivable etc etc. I really need this! Thank you!

3. Shell Programming and Scripting

grep multiple words in a single line

Hi.. How to search for multiple words in a single line using grep?. Eg: Jack and Jill went up the hill Jack and Jill were best friends Humpty and Dumpty were good friends too ---------- I want to extract the 2nd statement(assuming there are several statements with...

4. UNIX for Dummies Questions & Answers

extract text between two words on a single line

Hi Guys, Can someone help me with a way to extract text between two words on a single line. For example if the file has below content I want to extract all text between b and f inclusive of b and f. Aparently sed does this but does it line by line and I guess it cannot read word by word. ...

5. Shell Programming and Scripting

Counting all words that start with a capital letter in a string using python dictionary

Hi, I have written the following python snippet to store the capital letter starting words into a dictionary as key and no of its appearances as a value in this dictionary against the key. #!/usr/bin/env python import sys import re hash = {} # initialize an empty dictinonary for line in...

6. Shell Programming and Scripting

Identifying dupes within a database and creating unique sub-sets

Hello, I have a database of name variants with the following structure: variant=variant=variant The number of variants can be as many as thirty to forty. Since the database is quite large (at present around 60,000 lines) duplicate sets of variants creep in. Thus John=Johann=Jon and...

7. UNIX for Dummies Questions & Answers

Count words in a single column

Dear All, I have set of CSV files (comma separated) and each column have some information in them separated by space. Now I want to count them but have not been successful... Example data desired outcome I have tried few things including the link below. for C in $FILES do...

8. Shell Programming and Scripting

Reducing multiple entries in a tri-lingual dictionary to single entries

Dear all, I am editing a tri-lingual dictionary for open source which has the following data structure English headwords <Tab>Devanagari Headwords<Tab>PersoArabic headwords as in the example below to mark, to number अंगणु (اَنگَڻُ) The English headword entry has at times more than one word,...

9. Shell Programming and Scripting

Regex to identify unique words in a dictionary database

Hello, I have a dictionary which I am building for the Open Source Community. The data structure is as under HEADWORD=PARTOFSPEECH=ENGLISH MEANING as shown in the example below अ=m=Prefix signifying negation. अँहँ=ind=Interjection expressing disapprobation. अं=int=An interjection...

9 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

dictionary words in vim

Discussion started by: lakshmananindia

2. Shell Programming and Scripting

Grep all words with "con" in them from a dictionary?

Discussion started by: guptaxpn

3. Shell Programming and Scripting

grep multiple words in a single line

Discussion started by: anduzzi

4. UNIX for Dummies Questions & Answers

extract text between two words on a single line

Discussion started by: krishnaux

5. Shell Programming and Scripting

Counting all words that start with a capital letter in a string using python dictionary

Discussion started by: royalibrahim

6. Shell Programming and Scripting

Identifying dupes within a database and creating unique sub-sets

Discussion started by: gimley

7. UNIX for Dummies Questions & Answers

Count words in a single column

Discussion started by: A-V

8. Shell Programming and Scripting

Reducing multiple entries in a tri-lingual dictionary to single entries

Discussion started by: gimley

9. Shell Programming and Scripting

Regex to identify unique words in a dictionary database

Discussion started by: gimley