Reducing multiple entries in a tri-lingual dictionary to single entries


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Reducing multiple entries in a tri-lingual dictionary to single entries
# 1  
Reducing multiple entries in a tri-lingual dictionary to single entries

Dear all,
I am editing a tri-lingual dictionary for open source which has the following data structure
Code:
English headwords <Tab>Devanagari Headwords<Tab>PersoArabic headwords

as in the example below
Code:
to mark, to number	अंगणु	(اَنگَڻُ)

The English headword entry has at times more than one word, each word separated by a comma or a semi-colon as shown below and also in the example above
Code:
number, numeral, digit; limb; component; tear in cloth	अंगु	(اَنگُ)

For purposes of editing the dictionary, I need to reduce the multiple headwords in English to single entries. Thus the data in the example above would be reduced to the following entries:
Code:
number	अंगु	(اَنگُ)
numeral	अंगु	(اَنگُ)
digit	अंगु	(اَنگُ)
limb	अंगु	(اَنگُ)
component	अंगु	(اَنگُ)
tear in cloth	अंगु	(اَنگُ)

I could handle two mappings but I do not know how to handle such a complicated data structure in Perl or Awk. Any help provided would be gratefully acknowledged.
I would like to add that I work under Windows Vista and Windows 7. Unfortunately solutions in Linux do not help.

I am providing a sample for testing below
Code:
suddenly,unexpectedly	अचानकि	(اَچانَڪِ)
surprise,wonder	अचिरजु	(اَچرجُ)
to wonder	  अचिरजु खाइणु	(اَچرجُ کائِڻُ)
to get surprised	  अचिरजु लॻणु	(اَچرجُ لَڳَڻُ)
to wonder	  अचिरजु थियणु	(اَچرجُ ٿِيَڻُ)
to be surprised	  अचिरज में पवणु	(اَچرج ۾ پَوَڻُ)
unfailing,unerring;sure	अचूकु	(اَچوُڪُ)
inanimate	अचेतनु	(اَچيتَنُ)
unconsciousness,senselessness	अचेताई	(اَچيتائيِ)
unconscious,senseless	अचेतु	(اَچيتُ)
a flood of water;a body of clouds;a large desolated plain	अछ	(اَڇَ)
to be flooded due to heavy rain	  अछ थियणु	(اَڇَ ٿِيَڻُ)
the hair to turn grey,to become old	अछाइणु	(اَڇائِڻُ)
whiteness,clearness	अछाणि	(ِاَڇاڻ)
untouchable	अछूतु	(اَڇوُتُ)
untouched;unpolluted	अछूतो	(اَڇوُتو)
whitish	अछेरो	(اَڇيرو)
white;clean	अछो	(اَڇو)
neat and clean	  अछो उजिरो	(اَڇو اُجِرو)
to age,to become old,to turn into grey hair 	  अछो मथो थियणु	(اَڇو مَٿو ٿِيَڻُ)
to disgrace,to make ashamed	  अछो मुंहुं करणु	(اَڇو مُنهُن ڪَرڻُ)
to be disgraced,to do a shameful act	  अछो मुंहुं थियणु	(اَڇو مُنهنُ ٿِيَڻُ)
to respect the elderliness,to have regard for an old person	  अछनि ॾे ॾिसणु	(اَڇنِ ڏي ڏِسَڻُ)
to be spoilt in the old age	  अछनि में खरणु	(اَڇن ۾ کَرَڻُ)
to be exposed,truth to be known	  अछा कारा पधिरा थियणु	(اَڇا ڪارا پَڌِرا ٿِيَڻُ)
to turn into grey hair ,to become old	  अछा पवणु	(اَڇا پَوَڻُ)
to gain without much effort	  अछा बि तांहिरी थियणु	(اَڇا بِە تانهِريِ ٿِيَڻُ)
to do shameful act in the old age	  अछा लॼाइणु	(اَڇا لَڄائِڻُ)
to enter a false amount in the account	  अछे ते कारो लिखणु	(اَڇي تي ڪارو لِکَڻُ)
python,dragon	अजगरु	(اَجگَرُ)
stranger,unknown person	अजनबी	(اَجنَبيِ)
wonderful,surprising	अजबाइतो	(اَجَبائِتو)
wonder,astonishment	अजबु	(عَجَبُ)
to be surprised	  अजबु लॻणु	(عَجَبُ لَڳَڻُ)
to wonder	  अजबु खाइणु	(عَجَبُ کائِڻُ)
not liable to decay or old age	अजरु	(اَجَرُ)
to live forever,to be immortal	  अजरु  अमरु थियणु	(اَجَرُ اَمَرُ ٿيڻُ)
death,the appointed hour of death	अजलु	(اَجَلُ)
disgrace,infamy,dishonor	अजसु	(اَجسُ)
museum	अजाइबघरु	(عَجائِب گهَرُ)
unnecessary,useless	अजायो	(اَجايو)
a kind of fancy coloured sheet or shawl worn over shoulder	अजिरक	(اَجِرڪَ)
strange,wonderful,surprising	अजीबु	(عَجيِبُ)
very strange,awkward	अजीबो ग़रीबु	(عَجيِب و غَريِبُ)
wonder	अजूबो	(عَجوُبو)
unsuitability	अजोॻाई	(اَجوڳائيِ)
improper,unsuitable	अजोॻो	(اَجوڳو)
unknown,unacquainted,ignorant	अॼाणु	(اَڄاڻُ)
today	अॼु	(اَڄُ)
to complete a work in time	  अॼु जो कमु सुभाणे ते न रखणु	(اَڄُ جو ڪَمُ سُڀاڻي تي نەَ رکَڻُ)

# 2  
Hi, try:
Code:
awk '{n=split($1,F,/[,;]/); for(i=1; i<=n; i++) print F[i],$2,$3}' FS='\t' OFS='\t' file

--edit--
This will work on Linux / Unix. Just noticed that it needs to work under Windows.

Can't help you there.. I know there can be quoting issues, maybe CR/LF related issues...

Perhaps you could put the script in a file and execute that:

keyword_split.awk:
Code:
BEGIN {
  FS=OFS="\t"
}
{
  n=split($1,F,/[,;]/)
  for(i=1; i<=n; i++) print F[i],$2,$3
}

And execute with
Code:
awk -f keyword_split.awk file

Or use Cygwin or some other simulation...

Last edited by Scrutinizer; 04-24-2015 at 01:45 PM..
This User Gave Thanks to Scrutinizer For This Post:
# 3  
Many thanks. It worked perfectly. The tri-lingual dictionary generated out very well.Also thanks for noting that the delimiter could also be a semi-colon. I had missed that out.
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #310
Difficulty: Easy
The very first recorded computer bug was a spider found inside a Harvard Mark II computer.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Identifying single words in a dictionary database

I am reworking a Marathi-English dictionary to be out on open-source. My dictionary has the Headword in Marathi, followed by its Part of Speech and subsequently by its English glosses as in the examples below; अकरसणें v i To contract, shrink. अकरा a Eleven. अकराळ a Frightful, terrible. विकराळ... (2 Replies)
Discussion started by: gimley
2 Replies

2. Shell Programming and Scripting

Help need to convert bi-lingual files in sub-title format

I have a large number of files in the standard subtitle format with the additional proviso that the files are bi-lingual i.e. English and a second language: in this case Hindi. A small sample is given below: 00 04 07 08 00 04 11 00 I mean very high fever... He even vomited. 00 04 07 08 00... (6 Replies)
Discussion started by: gimley
6 Replies

3. Shell Programming and Scripting

Script to code every 2 consecutive entries as single entry

All, I come across the below requirement and my search on the previous posts did not result into any matches. I have one column of data from a csv file like below. And I need to add additional column based on string count in first column. Given column, Required column, Other columns A, 1,... (8 Replies)
Discussion started by: ks_reddy
8 Replies

4. Shell Programming and Scripting

Multiple entries for shell

I have a simple shell file (convert.sh), that I would like to add a loop to that allows the user to have the "Enter ID:" prompt keep displaying until end is typed. So instead of: bash ~/convert.sh Enter ID:123 bash ~/convert.sh Enter ID:456 bash ~/convert.sh Enter ID:789 The user would... (7 Replies)
Discussion started by: cmccabe
7 Replies

5. Shell Programming and Scripting

Filtering out Non-Lingual characters

In one of our project requirements , we will be SCANNING ALL RECORDS OF AN INPUT TEXT FILE AND WILL BE FILTERING OUT RECORDS WHICH CONTAINS NON-LINGUAL CHARACTERS What's meant by this requirement is that we will be retaining records that contains alphabets used in any language , like English... (1 Reply)
Discussion started by: kumarjt
1 Replies

6. Shell Programming and Scripting

Awk match multiple columns in multiple lines in single file

Hi, Input 7488 7389 chr1.fa chr1.fa 3546 9887 chr5.fa chr9.fa 7387 7898 chrX.fa chr3.fa 7488 7389 chr21.fa chr3.fa 7488 7389 chr1.fa chr1.fa 3546 9887 chr9.fa chr5.fa 7898 7387 chrX.fa chr3.fa Desired Output 7488 7389 chr1.fa chr1.fa 2 3546 9887 chr5.fa chr9.fa 2... (2 Replies)
Discussion started by: jacobs.smith
2 Replies

7. Shell Programming and Scripting

Multiple lines in a single column to be merged as a single line for a record

Hi, I have a requirement with, No~Dt~Notes 1~2011/08/1~"aaa bbb ccc ddd eee fff ggg hhh" Single column alone got splitted into multiple lines. I require the output as No~Dt~Notes 1~2011/08/1~"aaa<>bbb<>ccc<>ddd<>eee<>fff<>ggg<>hhh" mean to say those new lines to be... (1 Reply)
Discussion started by: Bhuvaneswari
1 Replies

8. UNIX for Dummies Questions & Answers

Grep multiple strings in multiple files using single command

Hi, I will use below command for grep single string ("osuser" is search string) ex: find . -type f | xarg grep -il osuser but i have one more string "v$session" here i want to grep in which file these two strings are present. any help is appreciated, Thanks in advance. Gagan (2 Replies)
Discussion started by: gagan4599
2 Replies

9. UNIX for Dummies Questions & Answers

Need advice! Removing multiple entries in a single file!

Hello, I have a file Test.txt with 9 columns that looks like this: 1g12 A 14 19 2OAY A 326 331 AAAASA 1l7v A 68 73 1l7v A 68 73 AALAIS 1l7v A 68 73 1XVW B 72 77 AALAIS 1l7v A 68 73 1XXU A 65 70 AALAIS 1l7v A 68 73 1XXU B 65 70 AALAIS 1l7v A 68 73 1XXU C 65 70 AALAIS 1l7v A 68 73 1XXU D... (4 Replies)
Discussion started by: InfoSeeker
4 Replies

10. UNIX for Dummies Questions & Answers

Tri-booting?

Is it possible to triple boot with Solaris 9 (x86 version)? I installed XP Prof first, then Linux Fedora. Currently there it is a dual boot, and the dual boot software came with Fedora. I already used partition magic to allocate 5 gigs of free space on my disk. Basically my questions are:... (1 Reply)
Discussion started by: CapsuleCorpJX
1 Replies

Featured Tech Videos