Sponsored Content
Top Forums Shell Programming and Scripting Identifying dupes within a database and creating unique sub-sets Post 302879975 by Chubler_XL on Monday 16th of December 2013 10:01:56 PM
Old 12-16-2013
You could try this, but I'm not sure how quick it will be:

Code:
awk '
function remove_dups(list, have, num, keys, i, new) {
    have[""]
    num=split(list, keys, "=")
    for(i=1;i<=num;i++) {
       if(!(keys[i] in have)) new=new "=" keys[i]
       have[keys[i]]
    }
    return substr(new,2)
}
function merge(list, num, keys,i,new) {
   new=remove_dups(list)
   num=split(new, keys, "=")
   master=keys[1]
   for(i=1;i<=num;i++)
      if(keys[i] in Found) {
          new = remove_dups(List[Found[keys[i]]] "=" new)
          delete List[Found[keys[i]]]
      }
   num=split(new, keys, "=")
   List[master]=new
   for(i=1;i<=num;i++) Found[keys[i]]=master
}
{merge($0)}
END { for (l in List) print List[l] }' infile


Last edited by Chubler_XL; 12-16-2013 at 11:22 PM.. Reason: Standardise variable names
 

9 More Discussions You Might Find Interesting

1. Programming

Creating a Unique ID on distributed systems

Hi, How do you actually create a unique ID on a distributed system. I looked at gethostid but the man page says that its not guaranteed to be unique. Also using the IP address does not seem to be a feasible solution. Is there a function call or mechanism by which this is possible when even the... (4 Replies)
Discussion started by: pic
4 Replies

2. Virtualization and Cloud Computing

Clouds (Partially Order Sets) - Streams (Linearly Ordered Sets) - Part 2

timbass Sat, 28 Jul 2007 10:07:53 +0000 Originally posted in Yahoo! CEP-Interest Here is my follow-up note on posets (partially ordered sets) and tosets (totally or linearly ordered sets) as background set theory for event processing, and in particular CEP and ESP. In my last note, we... (0 Replies)
Discussion started by: Linux Bot
0 Replies

3. UNIX for Dummies Questions & Answers

split a file with unique sets

This may sound like a trivial problem, but I still need some help: I have a file with ids and I want to split it 'n' ways (could be any number) into files: 1 1 1 2 2 3 3 4 5 5 Let's assume 'n' is 3, and we cannot have the same id in two different partitions. So the partitions may... (8 Replies)
Discussion started by: ChicagoBlues
8 Replies

4. UNIX for Dummies Questions & Answers

Identifying the commands creating subshells

Hi all, This is the basic question. I have read many books which advised to avoid creating sub shells. e.g: use wc -l<filename rather than using cat file|wc -l. So, how to identify whether a command creates subshell or not? so,is it better to use tail -n+1 file in stead of using cat.... (3 Replies)
Discussion started by: pandeesh
3 Replies

5. Shell Programming and Scripting

Script for identifying and deleting dupes in a line

I am compiling a synonym dictionary which has the following structure Headword=Synonym1,Synonym2 and so on, with each synonym separated by a comma. As is usual in such cases manual preparation of synonyms results in repeating the synonym which results in dupes as in the example below:... (3 Replies)
Discussion started by: gimley
3 Replies

6. Programming

Unique Number Identifying

I'm trying to solve the below problem for a number: Enter a number and if it has all unique digits print unique number else non-unique number. Eg: Input=123; Output=unique number Input=112; Output=Non-unique number The thing i tried is splitting the number into digits by using % operator... (2 Replies)
Discussion started by: Gautham
2 Replies

7. Shell Programming and Scripting

Help with Perl script for identifying dupes in column1

Dear all, I have a large dictionary database which has the following structure source word=target word e.g. book=livre Since the database is very large in spite of all the care taken, it so happens that at times the source word is repeated e.g. book=livre book=tome Since I want to... (7 Replies)
Discussion started by: gimley
7 Replies

8. Shell Programming and Scripting

Identifying single words in a dictionary database

I am reworking a Marathi-English dictionary to be out on open-source. My dictionary has the Headword in Marathi, followed by its Part of Speech and subsequently by its English glosses as in the examples below; अकरसणें v i To contract, shrink. अकरा a Eleven. अकराळ a Frightful, terrible. विकराळ... (2 Replies)
Discussion started by: gimley
2 Replies

9. Shell Programming and Scripting

Regex to identify unique words in a dictionary database

Hello, I have a dictionary which I am building for the Open Source Community. The data structure is as under HEADWORD=PARTOFSPEECH=ENGLISH MEANING as shown in the example below अ=m=Prefix signifying negation. अँहँ=ind=Interjection expressing disapprobation. अं=int=An interjection... (2 Replies)
Discussion started by: gimley
2 Replies
TCHTEST(1)							   Tokyo Cabinet							TCHTEST(1)

NAME
tchtest - test cases of the hash database API DESCRIPTION
To use the hash database API easily, the commands `tchtest', `tchmttest', and `tchmgr' are provided. The command `tchtest' is a utility for facility test and performance test. This command is used in the following format. `path' specifies the path of a database file. `rnum' specifies the number of iterations. `bnum' specifies the number of buckets. `apow' specifies the power of the alignment. `fpow' specifies the power of the free block pool. tchtest write [-mt] [-tl] [-td|-tb|-tt|-tx] [-rc num] [-xm num] [-df num] [-nl|-nb] [-as] [-rnd] path rnum [bnum [apow [fpow]]] Store records with keys of 8 bytes. They change as `00000001', `00000002'... tchtest read [-mt] [-rc num] [-xm num] [-df num] [-nl|-nb] [-wb] [-rnd] path Retrieve all records of the database above. tchtest remove [-mt] [-rc num] [-xm num] [-df num] [-nl|-nb] [-rnd] path Remove all records of the database above. tchtest rcat [-mt] [-tl] [-td|-tb|-tt|-tx] [-rc num] [-xm num] [-df num] [-nl|-nb] [-pn num] [-dai|-dad|-rl|-ru] path rnum [bnum [apow [fpow]]] Store records with partway duplicated keys using concatenate mode. tchtest misc [-mt] [-tl] [-td|-tb|-tt|-tx] [-nl|-nb] path rnum Perform miscellaneous test of various operations. tchtest wicked [-mt] [-tl] [-td|-tb|-tt|-tx] [-nl|-nb] path rnum Perform updating operations selected at random. Options feature the following. -mt : call the function `tchdbsetmutex'. -tl : enable the option `HDBTLARGE'. -td : enable the option `HDBTDEFLATE'. -tb : enable the option `HDBTBZIP'. -tt : enable the option `HDBTTCBS'. -tx : enable the option `HDBTEXCODEC'. -rc num : specify the number of cached records. -xm num : specify the size of the extra mapped memory. -df num : specify the unit step number of auto defragmentation. -nl : enable the option `HDBNOLCK'. -nb : enable the option `HDBLCKNB'. -as : use the function `tchdbputasync' instead of `tchdbput'. -rnd : select keys at random. -wb : use the function `tchdbget3' instead of `tchdbget'. -pn num : specify the number of patterns. -dai : use the function `tchdbaddint' instead of `tchdbputcat'. -dad : use the function `tchdbadddouble' instead of `tchdbputcat'. -rl : set the length of values at random. -ru : select update operations at random. This command returns 0 on success, another on failure. SEE ALSO
tchmttest(1), tchmgr(1), tchdb(3), tokyocabinet(3) Man Page 2012-08-18 TCHTEST(1)
All times are GMT -4. The time now is 06:47 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy