Identifying dupes within a database and creating unique sub-sets
Hello,
I have a database of name variants with the following structure:
The number of variants can be as many as thirty to forty.
Since the database is quite large (at present around 60,000 lines) duplicate sets of variants creep in. Thus
and some Hundred lines on
What I need is a script (PERL or AWK, since I work under Windows) which could do the following:
1. Identify such duplicates. Thus in the example above
is a duplicate entry
2. Connect up both entries resulting in one single entry:
3. Clean up the dupe(s) and provide one single set of Unique name variants.
The script, I am sure, would also prove useful for others who face similar problems of duplication iin their databases.
I am giving below a pseudo example as input:
The expected output would be:
Many thanks in advance for your help
Many thanks. It was pretty fast. Zipped through 20,000 lines in a few seconds. I doubt that there are any issues, since I tested the output file for dupes and there were none.
Many thanks.
Hello,
I have a dictionary which I am building for the Open Source Community. The data structure is as under
HEADWORD=PARTOFSPEECH=ENGLISH MEANING
as shown in the example below
अ=m=Prefix signifying negation.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection... (2 Replies)
I am reworking a Marathi-English dictionary to be out on open-source. My dictionary has the Headword in Marathi, followed by its Part of Speech and subsequently by its English glosses as in the examples below;
अकरसणें v i To contract, shrink.
अकरा a Eleven.
अकराळ a Frightful, terrible.
विकराळ... (2 Replies)
Dear all,
I have a large dictionary database which has the following structure
source word=target word
e.g.
book=livre
Since the database is very large in spite of all the care taken, it so happens that at times the source word is repeated
e.g.
book=livre
book=tome
Since I want to... (7 Replies)
I'm trying to solve the below problem for a number:
Enter a number and if it has all unique digits print unique number else non-unique number.
Eg:
Input=123; Output=unique number
Input=112; Output=Non-unique number
The thing i tried is splitting the number into digits by using % operator... (2 Replies)
I am compiling a synonym dictionary which has the following structure
Headword=Synonym1,Synonym2 and so on, with each synonym separated by a comma.
As is usual in such cases manual preparation of synonyms results in repeating the synonym which results in dupes as in the example below:... (3 Replies)
Hi all,
This is the basic question.
I have read many books which advised to avoid creating sub shells.
e.g: use wc -l<filename
rather than using cat file|wc -l.
So, how to identify whether a command creates subshell or not?
so,is it better to use tail -n+1 file in stead of using cat.... (3 Replies)
This may sound like a trivial problem, but I still need some help:
I have a file with ids and I want to split it 'n' ways (could be any number) into files:
1
1
1
2
2
3
3
4
5
5
Let's assume 'n' is 3, and we cannot have the same id in two different partitions. So the partitions may... (8 Replies)
timbass
Sat, 28 Jul 2007 10:07:53 +0000
Originally posted in Yahoo! CEP-Interest
Here is my follow-up note on posets (partially ordered sets) and tosets (totally or linearly ordered sets) as background set theory for event processing, and in particular CEP and ESP.
In my last note, we... (0 Replies)
Hi,
How do you actually create a unique ID on a distributed system. I looked at gethostid but the man page says that its not guaranteed to be unique. Also using the IP address does not seem to be a feasible solution. Is there a function call or mechanism by which this is possible when even the... (4 Replies)