Identifying single words in a dictionary database

09-07-2016

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Identifying single words in a dictionary database

I am reworking a Marathi-English dictionary to be out on open-source. My dictionary has the Headword in Marathi, followed by its Part of Speech and subsequently by its English glosses as in the examples below;

Code:

अकरसणें v i To contract, shrink.
अकरा a Eleven.
अकराळ a Frightful, terrible.
विकराळ a Frightful, terrible.
अकरोट m A walnut.
अकरोड m A walnut.
अकर्त्तव्य a That which is not proper to be done; improper.
अकर्त्ता a Incapable, incompetent.
अकर्तृत्व n Impotence, incapability.
अकर्म n A bad action, sin. 
अकर्मी a Wicked.
कुकर्मी a Wicked.
अकर्मक a Intransitive or neuter.
अकलंक a Exempt from stain, blemish.
अंकलिपि f Figure-writing.
अकलेचा खंदक m A wiseacre.
अकल्पनीय a Inconceivable; undevisable
अकल्पित a Unexpected; unthought of.
अकल्मष a Sinless.
अकल्याण n Infelicity; injury.
अकस m Malice, spite, grudge.
अकसखोर a Spiteful, malicious
अकसी a Spiteful, malicious
अकस्मात् ad Suddenly, unexpectedly, inconsiderately.
अकस्मात ad Suddenly, unexpectedly, inconsiderately.

As can be seen the delimiter of the gloss is either a comma or a semi-colon
To make the dictionary more accessible and easily readable and concise, I want to retain only single words or words with the following

Code:

to 
A

At present I have written a macro in Ultraedit: a Word processor to do the job. It uses a regex to identify all glosses with single words (with the condition specified above)
However Macros are slow and since the database is around 400.000+ words,I wonder if there is a means of identifying such single glosses through an awk or perl script. Thus in the example above, only the following will be retained

Code:

अकरसणें v i To contract, shrink.
अकरा a Eleven.
अकराळ a Frightful, terrible.
विकराळ a Frightful, terrible.
अकरोट m A walnut.
अकरोड m A walnut.
अकर्त्तव्य a That which is not proper to be done; improper.
अकर्त्ता a Incapable, incompetent.
अकर्तृत्व n Impotence, incapability.
अकर्म n A bad action, sin. 
अकर्मी a Wicked.
कुकर्मी a Wicked.
अकर्मक a Intransitive,neuter.
अकलंक a Exempt from stain, blemish.
अंकलिपि f Figure-writing.
अकलेचा खंदक m A wiseacre.
अकल्पनीय a Inconceivable; undevisable
अकल्पित a Unexpected; unthought of.
अकल्मष a Sinless.
अकल्याण n Infelicity; injury.
अकस m Malice, spite, grudge.
अकसखोर a Spiteful, malicious
अकसी a Spiteful, malicious
अकस्मात् ad Suddenly, unexpectedly, inconsiderately.
अकस्मात ad Suddenly, unexpectedly, inconsiderately.

I normally have the painful task of further removing cases where a word has a single gloss as well as a long definition as in

Code:

अकर्त्तव्य a That which is not proper to be done; improper.

which is reduced to

Code:

अकर्त्तव्य a improper.

If both stages can be handled, it would really help a lot.
I work in a Windows environment. Many thanks for any help given which will be gratefully acknowledged by the community once this simplified dictionary is put on line.

gimley

View Public Profile for gimley

Find all posts by gimley

09-07-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

This is a *nix forum. What *nix tools would be available on your windows environment?
The only difference I see in the output file is

Code:

अकर्मक a Intransitive,neuter.

. What exactly is the operation of the script to be?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

09-07-2016

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

I have Awk/Gawk and Perl.

May be I did not explain myself clearly. What I need is to remove all glosses which have two or more words and retain only the single words.

This implies a two stage operation. In Stage 1 at present I use a regex to identify such unique words within each string and store the string in a separate file. But then it can so happen that within the string there could also be glosses containing more than one word.

Code:

अकर्त्तव्य a That which is not proper to be done; improper.

In stage 2 I write a second regex to identify the gloss delimited by

Code:

, ; .

resulting in

Code:

अकर्त्तव्य a improper.

and which contains more than one word.
It works but the two stage operation is long and tedious and I was wondering if an Awk or Perl script could do the trick.
Thanks a lot

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Identifying single words in a dictionary database

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regex to identify unique words in a dictionary database

Discussion started by: gimley

2. Shell Programming and Scripting

Reducing multiple entries in a tri-lingual dictionary to single entries

Discussion started by: gimley

3. UNIX for Dummies Questions & Answers

Count words in a single column

Discussion started by: A-V

4. Shell Programming and Scripting

Identifying dupes within a database and creating unique sub-sets

Discussion started by: gimley

5. Shell Programming and Scripting

Counting all words that start with a capital letter in a string using python dictionary

Discussion started by: royalibrahim

6. UNIX for Dummies Questions & Answers

extract text between two words on a single line

Discussion started by: krishnaux

7. Shell Programming and Scripting

grep multiple words in a single line

Discussion started by: anduzzi

8. Shell Programming and Scripting

Grep all words with "con" in them from a dictionary?

Discussion started by: guptaxpn

9. UNIX for Advanced & Expert Users

dictionary words in vim

Discussion started by: lakshmananindia