Duplicate identification using partial matches


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Duplicate identification using partial matches
# 1  
Old 12-11-2014
Duplicate identification using partial matches

Hi ,

I have a column with names. I would want to match names which match either completely or partially and capture them in separate column like below.

Code:
Input		
Abc		
dbc		
abc xyz		
def		
bcd		
abc ggg	
xxx abc xxx

Output| Duplicate	
Abc|abc xyz
     |abc ggg
     |xxx abc xxx
dbc|	
def|	
bcd|

Thanks for your help!!
# 2  
Old 12-12-2014
What have you tried? I'm asking this primarily to be able to comprehend the logic or how it should work respectively.

Anyways, after some wild guessing I can offer a potential solution.

The awk code flow is as follows:
  1. Read file (first run)
    Read the file line by line, if there is only a single word in the line, store it in the array A, else ignore that line. All single words are stored as "particular word in all lowerspace characters" - "particular word in original format" pairs.
    .
  2. Read file (second run)
    Read the file line by line, this time ignore lines with only single words. For each line with more than one word, see if there are any words stored in the array A in it. If so, create another array (B) and store as "particular word in original format" - "whole line the mentioned word appears in" pairs.

END section: print whole array B + split the pairs and store all single words in array D.
Search all words from array A in array D, if there is no match print that word + a vertical bar.

The gsub functions are simply deleting potential trailing horizontal tabs.

Code:
awk 'NR==FNR{ gsub(/\t+$/,"",$0); if (NF==1) A[tolower($0)]=$0; next }
{ gsub(/\t+$/,"",$0); if (NF==1) next; for (i in A) if (tolower($0) ~ i) B[A[i]"|"$0]++
} END { for (b in B) {split(b,C,"|"); print b; D[C[1]]++}
        for (a in A) if (!(A[a] in D)) print A[a]"|" }
' file file | sort

Demo:
Code:
$ awk 'NR==FNR{ gsub(/\t+$/,"",$0); if (NF==1) A[tolower($0)]=$0; next }
{ gsub(/\t+$/,"",$0); if (NF==1) next; for (i in A) if (tolower($0) ~ i) B[A[i]"|"$0]++
} END { for (b in B) {split(b,C,"|"); print b; D[C[1]]++}
        for (a in A) if (!(A[a] in D)) print A[a]"|" }
' file file | sort
Abc|abc ggg
Abc|abc xyz
Abc|xxx abc xxx
bcd|
dbc|
def|
$


Last edited by rbatte1; 12-12-2014 at 11:38 AM.. Reason: Set up formatted list for clarity
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Search for partial matches in particular column

I have a list a b c d I want to search this list to have partial matches in column 2 in data file col1 col2 col3 1 a/e aa 2 b/e aa 3 z/y aa 4 t/u bb 5 d/f aa 6 a/t aa and extract the relevant rows with header (4 Replies)
Discussion started by: jianp83
4 Replies

2. Shell Programming and Scripting

Compare 2 files and print matches and non-matches in separate files

Hi all, I have two files, chap.txt and complex.txt. chap.txt looks like this: a d l m r k complex.txt looks like this: a c d e l m n j a d l p q r c p r m ......... (7 Replies)
Discussion started by: AshwaniSharma09
7 Replies

3. Shell Programming and Scripting

AWK - Print partial line/partial field

Hello, this is probably a simple request but I've been toying with it for a while. I have a large list of devices and commands that were run with a script, now I have lines such as: a-router-hostname-C#show ver I want to print everything up to (and excluding) the # and everything after it... (3 Replies)
Discussion started by: ippy98
3 Replies

4. Shell Programming and Scripting

Using grep returns partial matches, I need to get an exact match or nothing

I’m trying to modify someone perl script to fix a bug. The piece of code checks that the zone name you want to add is unique. However, when the code runs, it finds a partial match using grep, and decides it already exists, so the “create” command exits. $cstatus = `${ZADM} list -vic | grep... (3 Replies)
Discussion started by: TKD
3 Replies

5. Shell Programming and Scripting

file identification

hi there, i have written the following simple lines: find $SCENE -name "*.xml" echo -n "Input the name of the image file to be read: " set im_name = ($<) i like to set the value for im_name automatically to the .xml, which was found by the first line without having to input it. the... (4 Replies)
Discussion started by: friend
4 Replies

6. UNIX for Dummies Questions & Answers

How to delete partial duplicate lines unix

hi :) I need to delete partial duplicate lines I have this in a file sihp8027,/opt/cf20,1980182 sihp8027,/opt/oracle/10gRelIIcd,155200016 sihp8027,/opt/oracle/10gRelIIcd,155200176 sihp8027,/var/opt/ERP,10376312 and need to leave it like this: sihp8027,/opt/cf20,1980182... (2 Replies)
Discussion started by: C|KiLLeR|S
2 Replies

7. UNIX for Dummies Questions & Answers

file identification

Can anybody tell me what are these files are and what do they do and if they are safe to delete. Thanks /var/cache/yum/base # ls -al total 44792 drwxr-xr-x 4 root root 4096 Sep 22 11:43 . drwxr-xr-x 10 root root 4096 Nov 18 2007 .. -rw-r--r-- 1 root root 0 Sep 22... (5 Replies)
Discussion started by: mcraul
5 Replies

8. UNIX for Dummies Questions & Answers

ip identification

how can i find my own ip address from unix. command like who -x .this would provide all the ip address but i need to list only current user ip address. who am i command does not display the ip. (1 Reply)
Discussion started by: naushad
1 Replies

9. Shell Programming and Scripting

version identification

Hi Which command do i use to know which version of solaris am i working on?? thanks in advance regards (1 Reply)
Discussion started by: knopix
1 Replies

10. Solaris

file identification

Can anyone identify what this file is for? 241436 Dec 17 16:29 dtdbcache_:0 Is it necessary? My system is at 94% and I am trying to clean / directory as much as possible. Any other files I can set to dev/null besides messages, and the wtmp and wtmpx? Please and Thanks. (3 Replies)
Discussion started by: mnsalazar
3 Replies
Login or Register to Ask a Question