Removing duplicates except the last occurrence

11-05-2014

Registered User

14, 0

Join Date: Nov 2013

Last Activity: 1 April 2015, 6:12 AM EDT

Posts: 14

Thanks Given: 11

Thanked 0 Times in 0 Posts

Removing duplicates except the last occurrence

Hi All,

i have a file like below,

Code:

@DB_FCTS\src\Data\Scripts\Delete_CU_OM_BIL_PRT_STMT_TYP.sql
@DB_FCTS\src\Data\Scripts\Delete_CDP_BILL_LBL_MSG.sql
@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql
@DB_FCTS\src\Data\Scripts\Insert_CU_OM_LBL_MSG.sql
@DB_FCTS\src\Data\Scripts\Insert_CU_OM_BT_STMT_TYP.sql
@DB_FCTS\src\Data\Scripts\Insert_OM_BIL_T_ADDR.sql
@DB_FCTS\src\Data\Scripts\Delete_CU_OM_BIL_PRT_STMT_TYP.sql
@DB_FCTS\src\Scripts\MC400_PreDb_Script.sql
@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql
@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql

==========================================
in the file,
line no 1 is repeated in the line 7 and
line no 3 is repeated in the line 9 and 10

my requirement is to remove the duplicate lines and keep only the last occurrence of it.

the output should be like below,

Code:

@DB_FCTS\src\Data\Scripts\Delete_CDP_BILL_LBL_MSG.sql
@DB_FCTS\src\Data\Scripts\Insert_CU_OM_LBL_MSG.sql
@DB_FCTS\src\Data\Scripts\Insert_CU_OM_BT_STMT_TYP.sql
@DB_FCTS\src\Data\Scripts\Insert_OM_BIL_T_ADDR.sql
@DB_FCTS\src\Data\Scripts\Delete_CU_OM_BIL_PRT_STMT_TYP.sql
@DB_FCTS\src\Scripts\MC400_PreDb_Script.sql
@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql

My env details,

SunOS sasbsd27c1 5.10 Generic_150400-10 sun4u sparc SUNW,SPARC-Enterprise

Please script a script to achieve this, i have been trying from morning, but nothing works out.

Thanks in advance

Last edited by vbe; 11-05-2014 at 11:27 AM.. Reason: code tags

mechvijays

View Public Profile for mechvijays

Find all posts by mechvijays

11-05-2014

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Please use code tags as required by forum rules!

How about looking into existing solutions on this site first (see bottom of page: More UNIX and Linux Forum Topics You Might Find Helpful):
Removing duplicates
Help in removing duplicates
removing duplicates.

etc ...

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-05-2014

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

The suggested solutions remove the sequent duplicates, keeping the first instance.
The requirement, keeping the last instance, is far more complex.
The most comprehensive solution is perl:

Code:

perl -ne '$s{$_}=++$i; if (eof()){print sort {$s{$a}<=>$s{$b}} keys %s}' file

Another one is awk | sort | cut:

Code:

awk '{ 
      x[$0] = NR
     }
 END {
      for ( l in x ) printf "%d\t%s\n", x[l], l
     }' file | sort -n | cut -f2-

Another less efficient solution would be tac | awk 'remove sequent duplicates' | tac:

Code:

tac file | awk '!($0 in S) {print; S[$0]}' | tac

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

11-05-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Doing it entirely in awk isn't that hard:

Code:

/usr/xpg4/bin/awk '
$0 in N {
	delete O[N[$0]]
}
{	N[$0] = NR
	O[NR] = $0
}
END {	for(i = 1; i <= NR; i++)
		if(i in O)
			print O[i]
}' file

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-05-2014

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

And also:

Code:

awk 'NR==FNR{L[$0]=FNR; next} L[$0]==FNR' infile infile

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

11-06-2014

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

If you were to run out of memory, you could use tac file | awk '!($0 in S) {print; S[$0]}' | tac posted by MadeInGermany.

A similar code in shell, with filename in variable FILE:

Code:

nl $FILE |
tee f1 |
sort -k 2 -k 1,1rn |
tee f2 |
uniq --skip-fields=1 |
tee f3 |
sort -k 1,1n |
tee f4 |
sed 's/^.*\t//'

Line numbers are added, then the body is sorted, with the secondary sort being reverse numeric. The GNU uniq allows fields to be skipped, the result sorted in numeric order, thus retaining the original order, after which the line number is stripped. Before stripping, this looks like:

Code:

     2	@DB_FCTS\src\Data\Scripts\Delete_CDP_BILL_LBL_MSG.sql
     4	@DB_FCTS\src\Data\Scripts\Insert_CU_OM_LBL_MSG.sql
     5	@DB_FCTS\src\Data\Scripts\Insert_CU_OM_BT_STMT_TYP.sql
     6	@DB_FCTS\src\Data\Scripts\Insert_OM_BIL_T_ADDR.sql
     7	@DB_FCTS\src\Data\Scripts\Delete_CU_OM_BIL_PRT_STMT_TYP.sql
     8	@DB_FCTS\src\Scripts\MC400_PreDb_Script.sql
    10	@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql

Pipelines are useful for doing large-granularity parallel computing, and the disk is not touched because the pipes are simply buffers (usually 65K). The tee in the above is to allow intermediate results to be seen.

I have run across some uniq versions that keep the most recent version of a duplicate ( Solaris if memory serves ).

This was done on:

Code:

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
nl (GNU coreutils) 6.10
sort (GNU coreutils) 6.10
uniq (GNU coreutils) 6.10
sed GNU sed version 4.1.5

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

11-06-2014

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

If there are dupes why does it matter which one is kept...either first one or last one etc...

shamrock

View Public Profile for shamrock

Find all posts by shamrock

Shell Programming and Scripting

Removing duplicates except the last occurrence

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing duplicates from new file

Discussion started by: sagar_1986

2. Shell Programming and Scripting

Removing duplicates from new file

Discussion started by: sagar_1986

3. Shell Programming and Scripting

Help in removing duplicates

Discussion started by: rkrish

4. Emergency UNIX and Linux Support

Removing all the duplicates

Discussion started by: pandeesh

5. Shell Programming and Scripting

Removing duplicates

Discussion started by: gctex

6. UNIX for Advanced & Expert Users

removing duplicates.

Discussion started by: raju4u

7. Shell Programming and Scripting

Removing duplicates

Discussion started by: imdadulla

8. Shell Programming and Scripting

removing duplicates

Discussion started by: stevie_velvet

9. UNIX for Dummies Questions & Answers

removing duplicates and sort -k

Discussion started by: orahi001

10. Shell Programming and Scripting

Removing duplicates

Discussion started by: giannicello