Deleting duplicate glosses in a dictionary entry


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Deleting duplicate glosses in a dictionary entry
# 1  
Old 08-19-2013
Deleting duplicate glosses in a dictionary entry

I am working on an Urdu to Hindi dictionary and I have created the following file structure:
Code:
Headword=Gloss1,Gloss2,Gloss3

i.e. glosses delimited by a comma.

It so happens that in some cases (around 6000+ in a file of over 200,000+ the glosses are duplicated.
Since this may be a recurrent phenomenon, could a macro or a script be deployed which could check the glosses on the right hand side and if there are duplicates, remove the same and maintain only a single gloss.
An example will make this clear:
Input
Code:
a=b,c,b
d=p,q,p
e=z,y,g,z,g,y

Th expected output would be
Code:
a=b,c
d=p,q
e=g,y,z

In case live data is need here is a sample:
Code:
آبادِیوں=आबादिओं,आबादियों
آبادی=जनसंख्या,आबादी
آبجیکشن=ऑबजेक्शन,ऑब्जेक्शन
آبلا=अबला,उबला
آبو=आबू,आबो
آتشک=आतशक,आतिशक
آتم=आतम,आतम,आत्म,आत्म
آتون=आतून,आतोन
آتیں=आतीं,आतें,आतें,आतीं
آجا=आ जा,आजा
آجاتی=आ जाती,आजाती
آجانا=आ जाना,आजाना
آجکل=आज कल,आजकल
آخری=अंतिम,आख़री
آد=आद,आद,आदि

An Awk or Perl script would be of help. I am on Windows Vista and have no access to Unix
I tried the following script posted on the site, but it does not give expected results:
Code:
{
for (I=1;I<NF;I++)
{
for (J=I+1;J<=NF;J++)
{
if ($I == $J ) { print $I": " $0 }
}
}
}

Many thanks
# 2  
Old 08-19-2013
Here's a perl program, though, I couldn't test it with the actual data (urdu and hindi characters). It works for ASCII characters input (a=b,c,b.......)
Code:
#! /usr/bin/perl

use warnings;
use strict;

my ($line, @lr, %hindi_words);
open I, "< file.txt";
while ($line = <I>) {
    chomp ($line);
    undef %hindi_words;
    @lr = split ('=', $line);
    for (split(',', $lr[1])) {
        $hindi_words{$_} = 1;
    }
    print "$lr[0]=", join(',', keys(%hindi_words)), "\n";
}
close I;

By the way, for this program logically similar words like आबादिओं,आबादियों or आज कल,आजकल or ऑबजेक्शन,ऑब्जेक्शन are different.
This User Gave Thanks to balajesuri For This Post:
# 3  
Old 08-19-2013
Code:
awk -F "[=,]" '{delete a
                printf $1 "="
                for (i=2;i<=NF;i++) a[$i]
                for (i in a) printf i ","
                printf RS}' infile |sed 's/,$//'

This User Gave Thanks to rdcwayx For This Post:
# 4  
Old 08-19-2013
Many thanks. The programs worked beautifully. I hope someone else will also find the programs useful.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

checking duplicate entry in file

Hi i have a file like 110.10 120.10 -1120 110.10 and the lines are having more than 10k. do we have anycommand to check the duplicate entries in the file. I applied the while loop by greping each line with whole file, but it is taking huge amount of time as the file size is large. ... (5 Replies)
Discussion started by: saluja.deepak
5 Replies

2. Shell Programming and Scripting

Searching for an entry and deleting the line

Hi Im trying to scan a file for certain entries and remove their corresponding lines completely. What I have now is this, for USER in user1 user2 user3 user4 do sed '/$USER/d' /etc/sudoers done However this doesn't remove the entries at all. Is there another way for this? Thanks... (2 Replies)
Discussion started by: bludhemn
2 Replies

3. Shell Programming and Scripting

Deleting Duplicates leaving the first entry

Hi, I need to delete duplicate records in a file that is around 30MB. Below is what I need. Below are the entries of input file and the output file that I need. Each section of input file is separated by an empty line. Some of these sections have duplicate uid values. I want to retain only one... (4 Replies)
Discussion started by: Samingla
4 Replies

4. Red Hat

Adding or deleting an entry in /etc/inittab without using vi editrors or any editor.

Hi masters Is there any way to edit or delete an entry in inittab file without using vi or any editors? We can use commands instead or any shell script .. If any one can help deeply appreciated Thanks a lot sai (3 Replies)
Discussion started by: saidiya
3 Replies

5. Shell Programming and Scripting

Need to delete duplicate lease entry

Hi *, I need to delete duplicate lease entries in file according to MAC/IP. I'm having tempfile which contains many lease info and need to have one entry for each IP(not more than that), if it contains more than one entry for same set, need to be deleted that entry... EX: lease... (4 Replies)
Discussion started by: SMNK
4 Replies

6. Shell Programming and Scripting

Deleting file entry

Hello everyone, I want to compare the first line of a file(ABC) with that of a folder,XYZ(folder contents) and want that line to be deleted from the file(ABC) if that entry doesn't exist in the folder(XYZ) I want to put this in a loop. please can anyone help thanks (6 Replies)
Discussion started by: swasid
6 Replies

7. Shell Programming and Scripting

Print Only second Duplicate entry in the file

I have file where it contains 2 columns. In two columns the first column is repeated more than once. I wanted to take the unique record in first column and the corresponding second column value . The below is the example of the file: 8244100320012955|000b063471a4... (4 Replies)
Discussion started by: ravi_rn
4 Replies

8. UNIX for Dummies Questions & Answers

Remove duplicate entry in one line

Can anyone help me how can i print only the unique entry in a line? MI_AP MI_AP MI_CM MI_MF RC_NAP MBS_AP SF_RAN MBS_AP NT_CAR so that it will on output the one unique entry per line. MI_AP MI_CM MI_MF RC_NAP MBS_AP SF_RAN NT_CAR I can't find the same situation on the knowledge... (5 Replies)
Discussion started by: kharen11
5 Replies

9. HP-UX

Hazardous Duplicate Cron Entry?

Hi All, How to prevent starting of processes that have duplicate entries in cron file, i have written a shell script to validate with "ps |grep" command before starting the process, but still when same process started at same time, it may not be able to detect the existing process. Sample... (3 Replies)
Discussion started by: nag_sundaram
3 Replies

10. Shell Programming and Scripting

Deleting double entry in a file

Hi, I am having almost the same problem as junior member 'oupsforum' (refer to subjuct "deleting double entry in a log file"), only that I am using Sun Sorlaris Unix which the uniq command does not has the flag -w. So I am not able to ignore certain portion of the line when the uniq doing the... (3 Replies)
Discussion started by: Wing m. Cheng
3 Replies
Login or Register to Ask a Question