Sponsored Content
Top Forums Shell Programming and Scripting Duplicate identification using partial matches Post 302928524 by junior-helper on Friday 12th of December 2014 10:27:27 AM
Old 12-12-2014
What have you tried? I'm asking this primarily to be able to comprehend the logic or how it should work respectively.

Anyways, after some wild guessing I can offer a potential solution.

The awk code flow is as follows:
  1. Read file (first run)
    Read the file line by line, if there is only a single word in the line, store it in the array A, else ignore that line. All single words are stored as "particular word in all lowerspace characters" - "particular word in original format" pairs.
    .
  2. Read file (second run)
    Read the file line by line, this time ignore lines with only single words. For each line with more than one word, see if there are any words stored in the array A in it. If so, create another array (B) and store as "particular word in original format" - "whole line the mentioned word appears in" pairs.

END section: print whole array B + split the pairs and store all single words in array D.
Search all words from array A in array D, if there is no match print that word + a vertical bar.

The gsub functions are simply deleting potential trailing horizontal tabs.

Code:
awk 'NR==FNR{ gsub(/\t+$/,"",$0); if (NF==1) A[tolower($0)]=$0; next }
{ gsub(/\t+$/,"",$0); if (NF==1) next; for (i in A) if (tolower($0) ~ i) B[A[i]"|"$0]++
} END { for (b in B) {split(b,C,"|"); print b; D[C[1]]++}
        for (a in A) if (!(A[a] in D)) print A[a]"|" }
' file file | sort

Demo:
Code:
$ awk 'NR==FNR{ gsub(/\t+$/,"",$0); if (NF==1) A[tolower($0)]=$0; next }
{ gsub(/\t+$/,"",$0); if (NF==1) next; for (i in A) if (tolower($0) ~ i) B[A[i]"|"$0]++
} END { for (b in B) {split(b,C,"|"); print b; D[C[1]]++}
        for (a in A) if (!(A[a] in D)) print A[a]"|" }
' file file | sort
Abc|abc ggg
Abc|abc xyz
Abc|xxx abc xxx
bcd|
dbc|
def|
$


Last edited by rbatte1; 12-12-2014 at 11:38 AM.. Reason: Set up formatted list for clarity
 

10 More Discussions You Might Find Interesting

1. Solaris

file identification

Can anyone identify what this file is for? 241436 Dec 17 16:29 dtdbcache_:0 Is it necessary? My system is at 94% and I am trying to clean / directory as much as possible. Any other files I can set to dev/null besides messages, and the wtmp and wtmpx? Please and Thanks. (3 Replies)
Discussion started by: mnsalazar
3 Replies

2. Shell Programming and Scripting

version identification

Hi Which command do i use to know which version of solaris am i working on?? thanks in advance regards (1 Reply)
Discussion started by: knopix
1 Replies

3. UNIX for Dummies Questions & Answers

ip identification

how can i find my own ip address from unix. command like who -x .this would provide all the ip address but i need to list only current user ip address. who am i command does not display the ip. (1 Reply)
Discussion started by: naushad
1 Replies

4. UNIX for Dummies Questions & Answers

file identification

Can anybody tell me what are these files are and what do they do and if they are safe to delete. Thanks /var/cache/yum/base # ls -al total 44792 drwxr-xr-x 4 root root 4096 Sep 22 11:43 . drwxr-xr-x 10 root root 4096 Nov 18 2007 .. -rw-r--r-- 1 root root 0 Sep 22... (5 Replies)
Discussion started by: mcraul
5 Replies

5. UNIX for Dummies Questions & Answers

How to delete partial duplicate lines unix

hi :) I need to delete partial duplicate lines I have this in a file sihp8027,/opt/cf20,1980182 sihp8027,/opt/oracle/10gRelIIcd,155200016 sihp8027,/opt/oracle/10gRelIIcd,155200176 sihp8027,/var/opt/ERP,10376312 and need to leave it like this: sihp8027,/opt/cf20,1980182... (2 Replies)
Discussion started by: C|KiLLeR|S
2 Replies

6. Shell Programming and Scripting

file identification

hi there, i have written the following simple lines: find $SCENE -name "*.xml" echo -n "Input the name of the image file to be read: " set im_name = ($<) i like to set the value for im_name automatically to the .xml, which was found by the first line without having to input it. the... (4 Replies)
Discussion started by: friend
4 Replies

7. Shell Programming and Scripting

Using grep returns partial matches, I need to get an exact match or nothing

I’m trying to modify someone perl script to fix a bug. The piece of code checks that the zone name you want to add is unique. However, when the code runs, it finds a partial match using grep, and decides it already exists, so the “create” command exits. $cstatus = `${ZADM} list -vic | grep... (3 Replies)
Discussion started by: TKD
3 Replies

8. Shell Programming and Scripting

AWK - Print partial line/partial field

Hello, this is probably a simple request but I've been toying with it for a while. I have a large list of devices and commands that were run with a script, now I have lines such as: a-router-hostname-C#show ver I want to print everything up to (and excluding) the # and everything after it... (3 Replies)
Discussion started by: ippy98
3 Replies

9. Shell Programming and Scripting

Compare 2 files and print matches and non-matches in separate files

Hi all, I have two files, chap.txt and complex.txt. chap.txt looks like this: a d l m r k complex.txt looks like this: a c d e l m n j a d l p q r c p r m ......... (7 Replies)
Discussion started by: AshwaniSharma09
7 Replies

10. UNIX for Dummies Questions & Answers

Search for partial matches in particular column

I have a list a b c d I want to search this list to have partial matches in column 2 in data file col1 col2 col3 1 a/e aa 2 b/e aa 3 z/y aa 4 t/u bb 5 d/f aa 6 a/t aa and extract the relevant rows with header (4 Replies)
Discussion started by: jianp83
4 Replies
platform::shell(3tcl)					       Tcl Bundled Packages					     platform::shell(3tcl)

__________________________________________________________________________________________________________________________________________________

NAME
platform::shell - System identification support code and utilities SYNOPSIS
package require platform::shell ?1.1.4? platform::shell::generic shell platform::shell::identify shell platform::shell::platform shell _________________________________________________________________ DESCRIPTION
The platform::shell package provides several utility commands useful for the identification of the architecture of a specific Tcl shell. This package allows the identification of the architecture of a specific Tcl shell different from the shell running the package. The only requirement is that the other shell (identified by its path), is actually executable on the current machine. While for most platform this means that the architecture of the interrogated shell is identical to the architecture of the running shell this is not generally true. A counter example are all platforms which have 32 and 64 bit variants and where a 64bit system is able to run 32bit code. For these running and interrogated shell may have different 32/64 bit settings and thus different identifiers. For applications like a code repository it is important to identify the architecture of the shell which will actually run the installed packages, versus the architecture of the shell running the repository software. COMMANDS
platform::shell::identify shell This command does the same identification as platform::identify, for the specified Tcl shell, in contrast to the running shell. platform::shell::generic shell This command does the same identification as platform::generic, for the specified Tcl shell, in contrast to the running shell. platform::shell::platform shell This command returns the contents of tcl_platform(platform) for the specified Tcl shell. KEYWORDS
operating system, cpu architecture, platform, architecture platform::shell 1.1.4 platform::shell(3tcl)
All times are GMT -4. The time now is 11:34 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy