Regex to identify unique words in a dictionary database
Hello,
I have a dictionary which I am building for the Open Source Community. The data structure is as under
Code:
HEADWORD[S]=PARTOFSPEECH=ENGLISH MEANING
as shown in the example below
Code:
अ=m=Prefix signifying negation.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.
अंकडा=m=A number,figure.A hook.
अंकडी=f=A pole with a hook at the extremity.
अंकणी=f=A ruler,a division.
अंकणें=vt=Mark;rule;sketch.
अंकन=n=Marking;numbering.
अंकपट्टी=f=A ticket,label showing price of an article.
अंकपाश=m=Permutations.
अंकमोडणी=f=Operations with figures.
अंकमोडी=f=Operations with figures.
अंकलिपि=f=Figure-writing.
अंकित असणें=vi=To be completely subject to the authority of.
अंकित=a=Marked;defined.Circumscribed,limited.
अंकी=a=Figured,numbered.
अंकुठित=a=Unhesitating;unstopped.
अंकुर दिसणें=vi=To give indications of future character.
अंकुर=m=A sprout.Germination.
अंकुश=m=An elephant goad.
अंकोल=m=Alangium Lamarku.
अंखणी=f=A ruler.Marking;a division.
अंखणे=f=A ruler.Marking;a division.
अंग असणें=E=To have a hand in.
अंग उडणें=vt=tremble,shiver,quake
अंग काढणें=E=Withdraw one's self from.
अंग चोरणें=E=Contract one's self,evade.
अंग चोरणें=vi=To spare one's self;to work lazily.
अंग झाडणें=E=Decline,disallow vehemently.
अंग टाकणें=E=Lose flesh;lie down carelessly.
अंग धरणें=vt=Be seized with rheumatic affection;gain flesh.
अंग धरणें=vt=To gain flesh,to have cramps.
अंग धुणें=n=Bathing,ablution-used by,of women.
अंग परिवर्तन=n=Turning over from one side to the other.
अंग मोडून काम करणें=E=To be unstinted in effort,work strenuously.
अंग मोडून येणें=E=Have the aching premonitory symptoms of fever.
अंग=n=Body.Limb.Side.Concern in.Ability.Support.
I need to identify within the headwords, only those which are single headwords and not those where a Headword is made of more than a single headword
Thus
Code:
अ=m=Prefix signifying negation.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.
अंकडा=m=A number,figure.A hook.
अंकडी=f=A pole with a hook at the extremity.
अंकणी=f=A ruler,a division.
अंकणें=vt=Mark;rule;sketch.
अंकन=n=Marking;numbering.
अंकपट्टी=f=A ticket,label showing price of an article.
अंकपाश=m=Permutations.
अंकमोडणी=f=Operations with figures.
अंकमोडी=f=Operations with figures.
अंकलिपि=f=Figure-writing.
अंग=n=Body.Limb.Side.Concern in.Ability.Support.
should be identified but the following which have more than one word are not valid and should not be identified
Code:
अंग असणें=E=To have a hand in.
अंग उडणें=vt=tremble,shiver,quake
अंग काढणें=E=Withdraw one's self from.
अंग चोरणें=E=Contract one's self,evade.
अंग चोरणें=vi=To spare one's self;to work lazily.
अंग झाडणें=E=Decline,disallow vehemently.
अंग टाकणें=E=Lose flesh;lie down carelessly.
अंग धरणें=vt=Be seized with rheumatic affection;gain flesh.
अंग धरणें=vt=To gain flesh,to have cramps.
अंग धुणें=n=Bathing,ablution-used by,of women.
अंग परिवर्तन=n=Turning over from one side to the other.
अंग मोडून काम करणें=E=To be unstinted in effort,work strenuously.
अंग मोडून येणें=E=Have the aching premonitory symptoms of fever.
A regex in Perl or Unix would be really useful. Thanks a lot.
hi,
is it possible to find the number of occurences of a pattern between two paranthesis.
for e.g
i have a file as below.
>>{
>>hi
>>GoodMorning
>>how are you?
>>}
>>is it good,
>>tell me yes, if it is good
In the above file, its clear the occurence of word "Good"... (17 Replies)
First of all, please have mercy on me. I am not a noob to programming, but I am about as noob as you can get with regex. That being said, I have a problem.
I've got a string that looks something like this:
Publication - Bob M. Jones, Tony X. Stark, and Fred D. Man, \"Really Awesome Article... (1 Reply)
I have a text file in UTF-8 format which has the following data structure
HEADWORD=gloss1,gloss2,gloss3 etc
I want to convert it so that all the glosses of the HeadWord appear on separate lines
HEADWORD=gloss1
HEADWORD=gloss2
HEADWORD=gloss3
An example will illustrate the requirement... (4 Replies)
Hi,
I have written the following python snippet to store the capital letter starting words into a dictionary as key and no of its appearances as a value in this dictionary against the key.
#!/usr/bin/env python
import sys
import re
hash = {} # initialize an empty dictinonary
for line in... (1 Reply)
I am reworking a Marathi-English dictionary to be out on open-source. My dictionary has the Headword in Marathi, followed by its Part of Speech and subsequently by its English glosses as in the examples below;
अकरसणें v i To contract, shrink.
अकरा a Eleven.
अकराळ a Frightful, terrible.
विकराळ... (2 Replies)
Hi,
I have a huge unsorted text file. We wanted to identify the unique field values in a line and consider those fields as a primary key for a table in upstream system.
Basically, the process or script should fetch the values from each line that are unique compared to the rest of the lines in... (13 Replies)
I am working on Sindhi: a perso-Arabic script and since it shares the Unicode-block with over 400 other languages, quite often the database contains characters which are not wanted: illegal characters.
I have identified the character set of Sindhi which is given below:
For clarity's sake, each... (8 Replies)
Hi
In a file I have string in multiple lines. Like below:
<?=test.getObjectName("L", "testTBL","D") ?>
<?=test.getObjectName("L", "testTBL","testDB", "D") ?>
I want to use regex to search for the pattern "<?=test.getObjectName...?>"
If the parenthesis has 3 parameters then return 2nd... (5 Replies)