Regex to identify unique words in a dictionary database


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Regex to identify unique words in a dictionary database
# 1  
Old 02-21-2017
Regex to identify unique words in a dictionary database

Hello,
I have a dictionary which I am building for the Open Source Community. The data structure is as under
Code:
HEADWORD[S]=PARTOFSPEECH=ENGLISH MEANING

as shown in the example below
Code:
अ=m=Prefix signifying negation.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.
अंकडा=m=A number,figure.A hook.
अंकडी=f=A pole with a hook at the extremity.
अंकणी=f=A ruler,a division.
अंकणें=vt=Mark;rule;sketch.
अंकन=n=Marking;numbering.
अंकपट्टी=f=A ticket,label showing price of an article.
अंकपाश=m=Permutations.
अंकमोडणी=f=Operations with figures.
अंकमोडी=f=Operations with figures.
अंकलिपि=f=Figure-writing.
अंकित असणें=vi=To be completely subject to the authority of.
अंकित=a=Marked;defined.Circumscribed,limited.
अंकी=a=Figured,numbered.
अंकुठित=a=Unhesitating;unstopped.
अंकुर दिसणें=vi=To give indications of future character.
अंकुर=m=A sprout.Germination.
अंकुश=m=An elephant goad.
अंकोल=m=Alangium Lamarku.
अंखणी=f=A ruler.Marking;a division.
अंखणे=f=A ruler.Marking;a division.
अंग असणें=E=To have a hand in.
अंग उडणें=vt=tremble,shiver,quake
अंग काढणें=E=Withdraw one's self from.
अंग चोरणें=E=Contract one's self,evade.
अंग चोरणें=vi=To spare one's self;to work lazily.
अंग झाडणें=E=Decline,disallow vehemently.
अंग टाकणें=E=Lose flesh;lie down carelessly.
अंग धरणें=vt=Be seized with rheumatic affection;gain flesh.
अंग धरणें=vt=To gain flesh,to have cramps.
अंग धुणें=n=Bathing,ablution-used by,of women.
अंग परिवर्तन=n=Turning over from one side to the other.
अंग मोडून काम करणें=E=To be unstinted in effort,work strenuously.
अंग मोडून येणें=E=Have the aching premonitory symptoms of fever.
अंग=n=Body.Limb.Side.Concern in.Ability.Support.

I need to identify within the headwords, only those which are single headwords and not those where a Headword is made of more than a single headword
Thus
Code:
अ=m=Prefix signifying negation.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.
अंकडा=m=A number,figure.A hook.
अंकडी=f=A pole with a hook at the extremity.
अंकणी=f=A ruler,a division.
अंकणें=vt=Mark;rule;sketch.
अंकन=n=Marking;numbering.
अंकपट्टी=f=A ticket,label showing price of an article.
अंकपाश=m=Permutations.
अंकमोडणी=f=Operations with figures.
अंकमोडी=f=Operations with figures.
अंकलिपि=f=Figure-writing.
अंग=n=Body.Limb.Side.Concern in.Ability.Support.

should be identified but the following which have more than one word are not valid and should not be identified
Code:
अंग असणें=E=To have a hand in.
अंग उडणें=vt=tremble,shiver,quake
अंग काढणें=E=Withdraw one's self from.
अंग चोरणें=E=Contract one's self,evade.
अंग चोरणें=vi=To spare one's self;to work lazily.
अंग झाडणें=E=Decline,disallow vehemently.
अंग टाकणें=E=Lose flesh;lie down carelessly.
अंग धरणें=vt=Be seized with rheumatic affection;gain flesh.
अंग धरणें=vt=To gain flesh,to have cramps.
अंग धुणें=n=Bathing,ablution-used by,of women.
अंग परिवर्तन=n=Turning over from one side to the other.
अंग मोडून काम करणें=E=To be unstinted in effort,work strenuously.
अंग मोडून येणें=E=Have the aching premonitory symptoms of fever.

A regex in Perl or Unix would be really useful. Thanks a lot.
# 2  
Old 02-21-2017
The regular expression (RE) you need depends on what tool you're using and what you want the RE to do.

If you were using awk and wanted an ERE to select lines from your file that just have one headword, you might try:
Code:
awk '/^[^ =]*=/' file

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
# 3  
Old 02-21-2017
Many thanks. I am sorry, I should have specified that I work in a Windows environment.
Playing round with the AWK regex, I found that it worked beautifully as a Unix regex. My text-editor allows me to choose Unix/Perl as Regexes. All I had to was strip off the forward slashes and I could identify all the singletons.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Regex to identify pattern

Hi In a file I have string in multiple lines. Like below: <?=test.getObjectName("L", "testTBL","D") ?> <?=test.getObjectName("L", "testTBL","testDB", "D") ?> I want to use regex to search for the pattern "<?=test.getObjectName...?>" If the parenthesis has 3 parameters then return 2nd... (5 Replies)
Discussion started by: dashing201
5 Replies

2. Shell Programming and Scripting

Regex to identify illegal characters in a perso-arabic database

I am working on Sindhi: a perso-Arabic script and since it shares the Unicode-block with over 400 other languages, quite often the database contains characters which are not wanted: illegal characters. I have identified the character set of Sindhi which is given below: For clarity's sake, each... (8 Replies)
Discussion started by: gimley
8 Replies

3. Shell Programming and Scripting

How to identify varying unique fields values from a text file in UNIX?

Hi, I have a huge unsorted text file. We wanted to identify the unique field values in a line and consider those fields as a primary key for a table in upstream system. Basically, the process or script should fetch the values from each line that are unique compared to the rest of the lines in... (13 Replies)
Discussion started by: manikandan23
13 Replies

4. Shell Programming and Scripting

Identifying single words in a dictionary database

I am reworking a Marathi-English dictionary to be out on open-source. My dictionary has the Headword in Marathi, followed by its Part of Speech and subsequently by its English glosses as in the examples below; अकरसणें v i To contract, shrink. अकरा a Eleven. अकराळ a Frightful, terrible. विकराळ... (2 Replies)
Discussion started by: gimley
2 Replies

5. Shell Programming and Scripting

Counting all words that start with a capital letter in a string using python dictionary

Hi, I have written the following python snippet to store the capital letter starting words into a dictionary as key and no of its appearances as a value in this dictionary against the key. #!/usr/bin/env python import sys import re hash = {} # initialize an empty dictinonary for line in... (1 Reply)
Discussion started by: royalibrahim
1 Replies

6. Shell Programming and Scripting

Script to create unique look-up for headers for a Dictionary

I have a text file in UTF-8 format which has the following data structure HEADWORD=gloss1,gloss2,gloss3 etc I want to convert it so that all the glosses of the HeadWord appear on separate lines HEADWORD=gloss1 HEADWORD=gloss2 HEADWORD=gloss3 An example will illustrate the requirement... (4 Replies)
Discussion started by: gimley
4 Replies

7. UNIX for Dummies Questions & Answers

Use Regex to identify / format a complex string

First of all, please have mercy on me. I am not a noob to programming, but I am about as noob as you can get with regex. That being said, I have a problem. I've got a string that looks something like this: Publication - Bob M. Jones, Tony X. Stark, and Fred D. Man, \"Really Awesome Article... (1 Reply)
Discussion started by: egill
1 Replies

8. Shell Programming and Scripting

How to identify the occurence of a pattern between a unique character?

hi, is it possible to find the number of occurences of a pattern between two paranthesis. for e.g i have a file as below. >>{ >>hi >>GoodMorning >>how are you? >>} >>is it good, >>tell me yes, if it is good In the above file, its clear the occurence of word "Good"... (17 Replies)
Discussion started by: divak
17 Replies

9. UNIX for Advanced & Expert Users

dictionary words in vim

how can i get the dictionary words in vim using keyboard keys? and how can i get the current directory filename? (1 Reply)
Discussion started by: lakshmananindia
1 Replies
Login or Register to Ask a Question