Visit Our UNIX and Linux User Community


Delete duplicate like pattern lines

 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Delete duplicate like pattern lines
# 1  
Old 05-03-2017
Delete duplicate like pattern lines

Hi

I need to delete duplicate like pattern lines from a text file containing 2 duplicates only (one being subset of the other) using sed or awk preferably.

Input:
Code:
FM:Chicago:Development
FM:Chicago:Development:Score
SR:Cary:Testing:Testcases
PM:Newyork:Scripting
PM:Newyork:Scripting:Audit

Output:
Code:
FM:Chicago:Development:Score
SR:Cary:Testing:Testcases
PM:Newyork:Scripting:Audit

# 2  
Old 05-03-2017
Any attempts / ideas / thoughts from your side?
# 3  
Old 05-03-2017
Hi RudiC,

i tried the following command but it isnt working. Thank you.

Code:
awk '!seen[$0]++' file

Moderator's Comments:
Mod Comment Please use CODE tags as required by forum rules!

Last edited by RudiC; 05-03-2017 at 05:14 AM.. Reason: Added CODE tags.
# 4  
Old 05-03-2017
No surprise this won't work as it compares entire lines only. You want partial line suppressed? Try
Code:
awk '
        {T[$0]
        }
END     {for (t1 in T)
           for (t2 in T)
             if (t1 ~ t2 && length(t1) != length(t2)) delete T[t2]
         for (t in T) print t
        }
' file
PM:Newyork:Scripting:Audit
FM:Chicago:Development:Score
SR:Cary:Testing:Testcases

The order in which for (t in T) retrieves T's elements is unspecified; if you need e.g. the order of occurrence, additional measures need to be taken.
# 5  
Old 05-03-2017
for some reason iam getting the following error.

Code:
awk: {T[$0]}END{for (t1 in T)for(t2 in T)if(t1 ~ t2 && length(t1)!=length(t2))delete T[t2] for (t in T) print }
awk:                                                                                       ^ syntax error


Moderator's Comments:
Mod Comment Please use CODE tags as required by forum rules!

Last edited by RudiC; 05-03-2017 at 05:29 AM.. Reason: Added CODE tags.
# 6  
Old 05-03-2017
You just can't cast a multiline script into one single line; at least a semicolon separator is needed in certain places.
# 7  
Old 05-03-2017
Hi.

This is similar in spirit to the solution from RudiC. However, it uses a local version of uniq that includes several features beyond a system uniq. In order to consider 2 fields, the first separator is modified. It appears that the last of the duplicate lines is desired. The local utility does not require the file to be sorted. Here is the script:
Code:
#!/usr/bin/env bash

# @(#) s1       Demonstrate elimination of duplicate lines, local uniq.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C sed my-uniq dixf

FILE=${1-data1}
E=expected-output.txt

pl " Input data file $FILE:"
cat $FILE

pl " Input data file transform separator:"
sed 's/:/_/' $FILE |
tee t1

pl " Expected output:"
cat $E

pl " Results, re-transform separator:"
my-uniq --separator=":" --last --field=1 t1 |
sed 's/_/:/' |
tee f1

pl " Verify results if possible:"
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe; pe " Results cannot be verified." ) >&2

pl " Help in my-uniq:"
my-uniq -h

dixf my-uniq

exit 0

producing:
Code:
$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.7 (jessie) 
bash GNU bash 4.3.30
sed (GNU sed) 4.2.2
my-uniq (local) 1.11
dixf (local) 1.42

-----
 Input data file data1:
FM:Chicago:Development
FM:Chicago:Development:Score
SR:Cary:Testing:Testcases
PM:Newyork:Scripting
PM:Newyork:Scripting:Audit

-----
 Input data file transform separator:
FM_Chicago:Development
FM_Chicago:Development:Score
SR_Cary:Testing:Testcases
PM_Newyork:Scripting
PM_Newyork:Scripting:Audit

-----
 Expected output:
FM:Chicago:Development:Score
SR:Cary:Testing:Testcases
PM:Newyork:Scripting:Audit

-----
 Results, re-transform separator:
FM:Chicago:Development:Score
SR:Cary:Testing:Testcases
PM:Newyork:Scripting:Audit

-----
 Verify results if possible:

-----
 Comparison of 3 created lines with 3 lines of desired results:
 Succeeded -- files (computed) f1 and (standard) expected-output.txt have same content.

-----
 Help in my-uniq:

my-uniq - print or omit unique lines in non-sorted file

Synopsis

This code exists because one common requirement of a task is to
find (or omit) unique (or replicated) lines in a file, but also
to preserve the original order of the lines.  Standard versions
of "uniq" have usually required a sorted input file.

An additional common requirement is to consider only the content
of one field in each line rather than the entire line.  my-uniq
satisfies these requirements.

Usage: my-uniq options files

options:

--count
place count on each processed line, default is off.

--duplicate
print items that have more than one occurrence, default off.

--unique
print items that have only one occurrence, default is off.

--field=n
select a specific field, delimited by the separator, to be
used for the comparison, the default is the entire line.

--separator=string
choose an alternate separator, such as "|", or ",", the
default separator is "whitespace".

--last
allows over-writing, effectively keeping the most-recently
seen instance. Some versions of uniq on other *nix systems use
the most recent (Solaris), the default is compatibility with
GNU/Linux uniq, which keeps the first occurrence.

--quick
omit the operation that prints the lines in the order that
they were read. This prints according to a hash order,
therefore somewhat random -- a quick way to re-order a
file. This also requires less storage, a consideration for
large-volume files.

--help
print this and quit.

--version
print version number and quit.

my-uniq Like GNU/Linux uniq, but files need not be sorted. (what)
Path    : ~/bin/my-uniq
Version : 1.11
Length  : 282 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Help    : probably available with --help
Modules : (for perl codes)
 warnings       1.23
 strict 1.08
 Carp   1.3301
 Data::Dumper   2.151_01
 Getopt::Long   2.42

We often create work-alikes to system utilities that incorporate options that seem obvious (to us) that are useful. We currently don't publish the codes, but perhaps the documentation will help others develop codes for their shops.

I think that the technique of joining fields might be able to be used with the system uniq. The results would depend on which OS is being used, so that the correct duplicate is kept.

Best wishes ... cheers, drl

Last edited by drl; 05-03-2017 at 12:10 PM.. Reason: Correct minor typo (spelling).

Previous Thread | Next Thread
Test Your Knowledge in Computers #132
Difficulty: Easy
Many of the command line and graphical utilities in a Linux distro are very similar to a Unix system.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to delete all lines before a particular pattern when the pattern is defined in a variable?

I have a file Line 1 a Line 22 Line 33 Line 1 b Line 22 Line 1 c Line 4 Line 5 I want to delete all lines before last occurrence of a line which contains something which is defined in a variable. Say a variable var contains 'Line 1', then I need the following in the output. ... (21 Replies)
Discussion started by: Soham
21 Replies

2. Shell Programming and Scripting

Delete duplicate lines... with a twist!

Hi, I'm sorry I'm no coder so I came here, counting on your free time and good will to beg for spoonfeeding some good code. I'll try to be quick and concise! Got file with 50k lines like this: "Heh, heh. Those darn ninjas. They're _____."*wacky The "canebrake", "timber" & "pygmy" are types... (7 Replies)
Discussion started by: shadowww
7 Replies

3. Shell Programming and Scripting

sed pattern to delete lines containing a pattern, except the first occurance

Hello sed gurus. I am using ksh on Sun and have a file created by concatenating several other files. All files contain header rows. I just need to keep the first occurrence and remove all other header rows. header for file 1111 2222 3333 header for file 1111 2222 3333 header for file... (8 Replies)
Discussion started by: gary_w
8 Replies

4. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Hi All, I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x++' are not working as its running out of buffer space. I dont know if this works : I want to read each line of the File in a For Loop, and want to... (16 Replies)
Discussion started by: krishnix
16 Replies

5. UNIX for Dummies Questions & Answers

How to delete partial duplicate lines unix

hi :) I need to delete partial duplicate lines I have this in a file sihp8027,/opt/cf20,1980182 sihp8027,/opt/oracle/10gRelIIcd,155200016 sihp8027,/opt/oracle/10gRelIIcd,155200176 sihp8027,/var/opt/ERP,10376312 and need to leave it like this: sihp8027,/opt/cf20,1980182... (2 Replies)
Discussion started by: C|KiLLeR|S
2 Replies

6. UNIX for Dummies Questions & Answers

Delete lines with duplicate strings based on date

Hey all, a relative bash/script newbie trying solve a problem. I've got a text file with lots of lines that I've been able to clean up and format with awk/sed/cut, but now I'd like to remove the lines with duplicate usernames based on time stamp. Here's what the data looks like 2007-11-03... (3 Replies)
Discussion started by: mattv
3 Replies

7. Shell Programming and Scripting

Delete Lines between the pattern

Hi All, Below is my requirement. Whatever coming in between ' ', needs to delete. Input File Contents: ============== This is nice 'boy' This 'is bad boy.' Got it Expected Output =========== This is nice This Got it (4 Replies)
Discussion started by: susau_79
4 Replies

8. UNIX for Dummies Questions & Answers

How to delete or remove duplicate lines in a file

Hi please help me how to remove duplicate lines in any file. I have a file having huge number of lines. i want to remove selected lines in it. And also if there exists duplicate lines, I want to delete the rest & just keep one of them. Please help me with any unix commands or even fortran... (7 Replies)
Discussion started by: reva
7 Replies

9. UNIX for Dummies Questions & Answers

Delete duplicate lines and print to file

OK, I have read several things on how to do this, but can't make it work. I am writing this to a vi file then calling it as an awk script. So I need to search a file for duplicate lines, delete duplicate lines, then write the result to another file, say /home/accountant/files/docs/nodup ... (2 Replies)
Discussion started by: bfurlong
2 Replies

10. Shell Programming and Scripting

delete semi-duplicate lines from file?

Ok here's what I'm trying to do. I need to get a listing of all the mountpoints on a system into a file, which is easy enough, just using something like "mount | awk '{print $1}'" However, on a couple of systems, they have some mount points looking like this: /stage /stand /usr /MFPIS... (2 Replies)
Discussion started by: paqman
2 Replies

Featured Tech Videos