Delete duplicate like pattern lines

05-03-2017

Registered User

33, 0

Join Date: Apr 2012

Last Activity: 21 March 2021, 12:37 AM EDT

Posts: 33

Thanks Given: 8

Thanked 0 Times in 0 Posts

Delete duplicate like pattern lines

Hi

I need to delete duplicate like pattern lines from a text file containing 2 duplicates only (one being subset of the other) using sed or awk preferably.

Input:

Code:

FM:Chicago:Development
FM:Chicago:Development:Score
SR:Cary:Testing:Testcases
PM:Newyork:Scripting
PM:Newyork:Scripting:Audit

Output:

Code:

FM:Chicago:Development:Score
SR:Cary:Testing:Testcases
PM:Newyork:Scripting:Audit

tech_frk

View Public Profile for tech_frk

Find all posts by tech_frk

05-03-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Any attempts / ideas / thoughts from your side?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

05-03-2017

Registered User

33, 0

Join Date: Apr 2012

Last Activity: 21 March 2021, 12:37 AM EDT

Posts: 33

Thanks Given: 8

Thanked 0 Times in 0 Posts

Hi RudiC,

i tried the following command but it isnt working. Thank you.

Code:

awk '!seen[$0]++' file

Moderator's Comments:

Please use CODE tags as required by forum rules!

Last edited by RudiC; 05-03-2017 at 05:14 AM.. Reason: Added CODE tags.

tech_frk

View Public Profile for tech_frk

Find all posts by tech_frk

05-03-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

No surprise this won't work as it compares entire lines only. You want partial line suppressed? Try

Code:

awk '
        {T[$0]
        }
END     {for (t1 in T)
           for (t2 in T)
             if (t1 ~ t2 && length(t1) != length(t2)) delete T[t2]
         for (t in T) print t
        }
' file
PM:Newyork:Scripting:Audit
FM:Chicago:Development:Score
SR:Cary:Testing:Testcases

The order in which for (t in T) retrieves T's elements is unspecified; if you need e.g. the order of occurrence, additional measures need to be taken.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

05-03-2017

Registered User

33, 0

Join Date: Apr 2012

Last Activity: 21 March 2021, 12:37 AM EDT

Posts: 33

Thanks Given: 8

Thanked 0 Times in 0 Posts

for some reason iam getting the following error.

Code:

awk: {T[$0]}END{for (t1 in T)for(t2 in T)if(t1 ~ t2 && length(t1)!=length(t2))delete T[t2] for (t in T) print }
awk:                                                                                       ^ syntax error

Moderator's Comments:

Please use CODE tags as required by forum rules!

Last edited by RudiC; 05-03-2017 at 05:29 AM.. Reason: Added CODE tags.

tech_frk

View Public Profile for tech_frk

Find all posts by tech_frk

05-03-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

You just can't cast a multiline script into one single line; at least a semicolon separator is needed in certain places.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

05-03-2017

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

This is similar in spirit to the solution from RudiC. However, it uses a local version of uniq that includes several features beyond a system uniq. In order to consider 2 fields, the first separator is modified. It appears that the last of the duplicate lines is desired. The local utility does not require the file to be sorted. Here is the script:

Code:

#!/usr/bin/env bash

# @(#) s1       Demonstrate elimination of duplicate lines, local uniq.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C sed my-uniq dixf

FILE=${1-data1}
E=expected-output.txt

pl " Input data file $FILE:"
cat $FILE

pl " Input data file transform separator:"
sed 's/:/_/' $FILE |
tee t1

pl " Expected output:"
cat $E

pl " Results, re-transform separator:"
my-uniq --separator=":" --last --field=1 t1 |
sed 's/_/:/' |
tee f1

pl " Verify results if possible:"
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe; pe " Results cannot be verified." ) >&2

pl " Help in my-uniq:"
my-uniq -h

dixf my-uniq

exit 0

producing:

Code:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.7 (jessie) 
bash GNU bash 4.3.30
sed (GNU sed) 4.2.2
my-uniq (local) 1.11
dixf (local) 1.42

-----
 Input data file data1:
FM:Chicago:Development
FM:Chicago:Development:Score
SR:Cary:Testing:Testcases
PM:Newyork:Scripting
PM:Newyork:Scripting:Audit

-----
 Input data file transform separator:
FM_Chicago:Development
FM_Chicago:Development:Score
SR_Cary:Testing:Testcases
PM_Newyork:Scripting
PM_Newyork:Scripting:Audit

-----
 Expected output:
FM:Chicago:Development:Score
SR:Cary:Testing:Testcases
PM:Newyork:Scripting:Audit

-----
 Results, re-transform separator:
FM:Chicago:Development:Score
SR:Cary:Testing:Testcases
PM:Newyork:Scripting:Audit

-----
 Verify results if possible:

-----
 Comparison of 3 created lines with 3 lines of desired results:
 Succeeded -- files (computed) f1 and (standard) expected-output.txt have same content.

-----
 Help in my-uniq:

my-uniq - print or omit unique lines in non-sorted file

Synopsis

This code exists because one common requirement of a task is to
find (or omit) unique (or replicated) lines in a file, but also
to preserve the original order of the lines.  Standard versions
of "uniq" have usually required a sorted input file.

An additional common requirement is to consider only the content
of one field in each line rather than the entire line.  my-uniq
satisfies these requirements.

Usage: my-uniq options files

options:

--count
place count on each processed line, default is off.

--duplicate
print items that have more than one occurrence, default off.

--unique
print items that have only one occurrence, default is off.

--field=n
select a specific field, delimited by the separator, to be
used for the comparison, the default is the entire line.

--separator=string
choose an alternate separator, such as "|", or ",", the
default separator is "whitespace".

--last
allows over-writing, effectively keeping the most-recently
seen instance. Some versions of uniq on other *nix systems use
the most recent (Solaris), the default is compatibility with
GNU/Linux uniq, which keeps the first occurrence.

--quick
omit the operation that prints the lines in the order that
they were read. This prints according to a hash order,
therefore somewhat random -- a quick way to re-order a
file. This also requires less storage, a consideration for
large-volume files.

--help
print this and quit.

--version
print version number and quit.

my-uniq Like GNU/Linux uniq, but files need not be sorted. (what)
Path    : ~/bin/my-uniq
Version : 1.11
Length  : 282 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Help    : probably available with --help
Modules : (for perl codes)
 warnings       1.23
 strict 1.08
 Carp   1.3301
 Data::Dumper   2.151_01
 Getopt::Long   2.42

We often create work-alikes to system utilities that incorporate options that seem obvious (to us) that are useful. We currently don't publish the codes, but perhaps the documentation will help others develop codes for their shops.

I think that the technique of joining fields might be able to be used with the system uniq. The results would depend on which OS is being used, so that the correct duplicate is kept.

Best wishes ... cheers, drl

Last edited by drl; 05-03-2017 at 12:10 PM.. Reason: Correct minor typo (spelling).

drl

View Public Profile for drl

Find all posts by drl

UNIX for Beginners Questions & Answers

Delete duplicate like pattern lines

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to delete all lines before a particular pattern when the pattern is defined in a variable?

Discussion started by: Soham

2. Shell Programming and Scripting

Delete duplicate lines... with a twist!

Discussion started by: shadowww

3. Shell Programming and Scripting

sed pattern to delete lines containing a pattern, except the first occurance

Discussion started by: gary_w

4. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Discussion started by: krishnix

5. UNIX for Dummies Questions & Answers

How to delete partial duplicate lines unix

Discussion started by: C|KiLLeR|S

6. UNIX for Dummies Questions & Answers

Delete lines with duplicate strings based on date

Discussion started by: mattv

7. Shell Programming and Scripting

Delete Lines between the pattern

Discussion started by: susau_79

8. UNIX for Dummies Questions & Answers

How to delete or remove duplicate lines in a file

Discussion started by: reva

9. UNIX for Dummies Questions & Answers

Delete duplicate lines and print to file

Discussion started by: bfurlong

10. Shell Programming and Scripting

delete semi-duplicate lines from file?

Discussion started by: paqman