Sponsored Content
Top Forums Shell Programming and Scripting Duplicate identification using partial matches Post 302928524 by junior-helper on Friday 12th of December 2014 10:27:27 AM
Old 12-12-2014
What have you tried? I'm asking this primarily to be able to comprehend the logic or how it should work respectively.

Anyways, after some wild guessing I can offer a potential solution.

The awk code flow is as follows:
  1. Read file (first run)
    Read the file line by line, if there is only a single word in the line, store it in the array A, else ignore that line. All single words are stored as "particular word in all lowerspace characters" - "particular word in original format" pairs.
    .
  2. Read file (second run)
    Read the file line by line, this time ignore lines with only single words. For each line with more than one word, see if there are any words stored in the array A in it. If so, create another array (B) and store as "particular word in original format" - "whole line the mentioned word appears in" pairs.

END section: print whole array B + split the pairs and store all single words in array D.
Search all words from array A in array D, if there is no match print that word + a vertical bar.

The gsub functions are simply deleting potential trailing horizontal tabs.

Code:
awk 'NR==FNR{ gsub(/\t+$/,"",$0); if (NF==1) A[tolower($0)]=$0; next }
{ gsub(/\t+$/,"",$0); if (NF==1) next; for (i in A) if (tolower($0) ~ i) B[A[i]"|"$0]++
} END { for (b in B) {split(b,C,"|"); print b; D[C[1]]++}
        for (a in A) if (!(A[a] in D)) print A[a]"|" }
' file file | sort

Demo:
Code:
$ awk 'NR==FNR{ gsub(/\t+$/,"",$0); if (NF==1) A[tolower($0)]=$0; next }
{ gsub(/\t+$/,"",$0); if (NF==1) next; for (i in A) if (tolower($0) ~ i) B[A[i]"|"$0]++
} END { for (b in B) {split(b,C,"|"); print b; D[C[1]]++}
        for (a in A) if (!(A[a] in D)) print A[a]"|" }
' file file | sort
Abc|abc ggg
Abc|abc xyz
Abc|xxx abc xxx
bcd|
dbc|
def|
$


Last edited by rbatte1; 12-12-2014 at 11:38 AM.. Reason: Set up formatted list for clarity
 

10 More Discussions You Might Find Interesting

1. Solaris

file identification

Can anyone identify what this file is for? 241436 Dec 17 16:29 dtdbcache_:0 Is it necessary? My system is at 94% and I am trying to clean / directory as much as possible. Any other files I can set to dev/null besides messages, and the wtmp and wtmpx? Please and Thanks. (3 Replies)
Discussion started by: mnsalazar
3 Replies

2. Shell Programming and Scripting

version identification

Hi Which command do i use to know which version of solaris am i working on?? thanks in advance regards (1 Reply)
Discussion started by: knopix
1 Replies

3. UNIX for Dummies Questions & Answers

ip identification

how can i find my own ip address from unix. command like who -x .this would provide all the ip address but i need to list only current user ip address. who am i command does not display the ip. (1 Reply)
Discussion started by: naushad
1 Replies

4. UNIX for Dummies Questions & Answers

file identification

Can anybody tell me what are these files are and what do they do and if they are safe to delete. Thanks /var/cache/yum/base # ls -al total 44792 drwxr-xr-x 4 root root 4096 Sep 22 11:43 . drwxr-xr-x 10 root root 4096 Nov 18 2007 .. -rw-r--r-- 1 root root 0 Sep 22... (5 Replies)
Discussion started by: mcraul
5 Replies

5. UNIX for Dummies Questions & Answers

How to delete partial duplicate lines unix

hi :) I need to delete partial duplicate lines I have this in a file sihp8027,/opt/cf20,1980182 sihp8027,/opt/oracle/10gRelIIcd,155200016 sihp8027,/opt/oracle/10gRelIIcd,155200176 sihp8027,/var/opt/ERP,10376312 and need to leave it like this: sihp8027,/opt/cf20,1980182... (2 Replies)
Discussion started by: C|KiLLeR|S
2 Replies

6. Shell Programming and Scripting

file identification

hi there, i have written the following simple lines: find $SCENE -name "*.xml" echo -n "Input the name of the image file to be read: " set im_name = ($<) i like to set the value for im_name automatically to the .xml, which was found by the first line without having to input it. the... (4 Replies)
Discussion started by: friend
4 Replies

7. Shell Programming and Scripting

Using grep returns partial matches, I need to get an exact match or nothing

I’m trying to modify someone perl script to fix a bug. The piece of code checks that the zone name you want to add is unique. However, when the code runs, it finds a partial match using grep, and decides it already exists, so the “create” command exits. $cstatus = `${ZADM} list -vic | grep... (3 Replies)
Discussion started by: TKD
3 Replies

8. Shell Programming and Scripting

AWK - Print partial line/partial field

Hello, this is probably a simple request but I've been toying with it for a while. I have a large list of devices and commands that were run with a script, now I have lines such as: a-router-hostname-C#show ver I want to print everything up to (and excluding) the # and everything after it... (3 Replies)
Discussion started by: ippy98
3 Replies

9. Shell Programming and Scripting

Compare 2 files and print matches and non-matches in separate files

Hi all, I have two files, chap.txt and complex.txt. chap.txt looks like this: a d l m r k complex.txt looks like this: a c d e l m n j a d l p q r c p r m ......... (7 Replies)
Discussion started by: AshwaniSharma09
7 Replies

10. UNIX for Dummies Questions & Answers

Search for partial matches in particular column

I have a list a b c d I want to search this list to have partial matches in column 2 in data file col1 col2 col3 1 a/e aa 2 b/e aa 3 z/y aa 4 t/u bb 5 d/f aa 6 a/t aa and extract the relevant rows with header (4 Replies)
Discussion started by: jianp83
4 Replies
WC(1)									FSF								     WC(1)

NAME
wc - print the number of bytes, words, and lines in files SYNOPSIS
wc [OPTION]... [FILE]... DESCRIPTION
Print byte, word, and newline counts for each FILE, and a total line if more than one FILE is specified. With no FILE, or when FILE is -, read standard input. -c, --bytes print the byte counts -m, --chars print the character counts -l, --lines print the newline counts -L, --max-line-length print the length of the longest line -w, --words print the word counts --help display this help and exit --version output version information and exit AUTHOR
Written by Paul Rubin and David MacKenzie. REPORTING BUGS
Report bugs to <bug-coreutils@gnu.org>. COPYRIGHT
Copyright (C) 2002 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICU- LAR PURPOSE. SEE ALSO
The full documentation for wc is maintained as a Texinfo manual. If the info and wc programs are properly installed at your site, the com- mand info wc should give you access to the complete manual. wc (coreutils) 4.5.3 February 2003 WC(1)
All times are GMT -4. The time now is 06:42 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy