Removing duplicates except the last occurrence


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing duplicates except the last occurrence
# 1  
Old 11-05-2014
Removing duplicates except the last occurrence

Hi All,

i have a file like below,
Code:
@DB_FCTS\src\Data\Scripts\Delete_CU_OM_BIL_PRT_STMT_TYP.sql
@DB_FCTS\src\Data\Scripts\Delete_CDP_BILL_LBL_MSG.sql
@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql
@DB_FCTS\src\Data\Scripts\Insert_CU_OM_LBL_MSG.sql
@DB_FCTS\src\Data\Scripts\Insert_CU_OM_BT_STMT_TYP.sql
@DB_FCTS\src\Data\Scripts\Insert_OM_BIL_T_ADDR.sql
@DB_FCTS\src\Data\Scripts\Delete_CU_OM_BIL_PRT_STMT_TYP.sql
@DB_FCTS\src\Scripts\MC400_PreDb_Script.sql
@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql
@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql

==========================================
in the file,
line no 1 is repeated in the line 7 and
line no 3 is repeated in the line 9 and 10

my requirement is to remove the duplicate lines and keep only the last occurrence of it.

the output should be like below,
Code:
@DB_FCTS\src\Data\Scripts\Delete_CDP_BILL_LBL_MSG.sql
@DB_FCTS\src\Data\Scripts\Insert_CU_OM_LBL_MSG.sql
@DB_FCTS\src\Data\Scripts\Insert_CU_OM_BT_STMT_TYP.sql
@DB_FCTS\src\Data\Scripts\Insert_OM_BIL_T_ADDR.sql
@DB_FCTS\src\Data\Scripts\Delete_CU_OM_BIL_PRT_STMT_TYP.sql
@DB_FCTS\src\Scripts\MC400_PreDb_Script.sql
@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql

My env details,

SunOS sasbsd27c1 5.10 Generic_150400-10 sun4u sparc SUNW,SPARC-Enterprise

Please script a script to achieve this, i have been trying from morning, but nothing works out.

Thanks in advance

Last edited by vbe; 11-05-2014 at 11:27 AM.. Reason: code tags
# 2  
Old 11-05-2014
Please use code tags as required by forum rules!

How about looking into existing solutions on this site first (see bottom of page: More UNIX and Linux Forum Topics You Might Find Helpful):
Removing duplicates
Help in removing duplicates
removing duplicates.

etc ...
# 3  
Old 11-05-2014
The suggested solutions remove the sequent duplicates, keeping the first instance.
The requirement, keeping the last instance, is far more complex.
The most comprehensive solution is perl:
Code:
perl -ne '$s{$_}=++$i; if (eof()){print sort {$s{$a}<=>$s{$b}} keys %s}' file

Another one is awk | sort | cut:
Code:
awk '{ 
      x[$0] = NR
     }
 END {
      for ( l in x ) printf "%d\t%s\n", x[l], l
     }' file | sort -n | cut -f2-

Another less efficient solution would be tac | awk 'remove sequent duplicates' | tac:
Code:
tac file | awk '!($0 in S) {print; S[$0]}' | tac

This User Gave Thanks to MadeInGermany For This Post:
# 4  
Old 11-05-2014
Doing it entirely in awk isn't that hard:
Code:
/usr/xpg4/bin/awk '
$0 in N {
	delete O[N[$0]]
}
{	N[$0] = NR
	O[NR] = $0
}
END {	for(i = 1; i <= NR; i++)
		if(i in O)
			print O[i]
}' file

This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 11-05-2014
And also:
Code:
awk 'NR==FNR{L[$0]=FNR; next} L[$0]==FNR' infile infile

This User Gave Thanks to Scrutinizer For This Post:
# 6  
Old 11-06-2014
Hi.

If you were to run out of memory, you could use tac file | awk '!($0 in S) {print; S[$0]}' | tac posted by MadeInGermany.

A similar code in shell, with filename in variable FILE:
Code:
nl $FILE |
tee f1 |
sort -k 2 -k 1,1rn |
tee f2 |
uniq --skip-fields=1 |
tee f3 |
sort -k 1,1n |
tee f4 |
sed 's/^.*\t//'

Line numbers are added, then the body is sorted, with the secondary sort being reverse numeric. The GNU uniq allows fields to be skipped, the result sorted in numeric order, thus retaining the original order, after which the line number is stripped. Before stripping, this looks like:
Code:
     2	@DB_FCTS\src\Data\Scripts\Delete_CDP_BILL_LBL_MSG.sql
     4	@DB_FCTS\src\Data\Scripts\Insert_CU_OM_LBL_MSG.sql
     5	@DB_FCTS\src\Data\Scripts\Insert_CU_OM_BT_STMT_TYP.sql
     6	@DB_FCTS\src\Data\Scripts\Insert_OM_BIL_T_ADDR.sql
     7	@DB_FCTS\src\Data\Scripts\Delete_CU_OM_BIL_PRT_STMT_TYP.sql
     8	@DB_FCTS\src\Scripts\MC400_PreDb_Script.sql
    10	@DB_FCTS\src\Data\Scripts\Delete_OM_BIDDR.sql

Pipelines are useful for doing large-granularity parallel computing, and the disk is not touched because the pipes are simply buffers (usually 65K). The tee in the above is to allow intermediate results to be seen.

I have run across some uniq versions that keep the most recent version of a duplicate ( Solaris if memory serves ).

This was done on:
Code:
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
nl (GNU coreutils) 6.10
sort (GNU coreutils) 6.10
uniq (GNU coreutils) 6.10
sed GNU sed version 4.1.5

Best wishes ... cheers, drl
# 7  
Old 11-06-2014
If there are dupes why does it matter which one is kept...either first one or last one etc...
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing duplicates from new file

i hav two files like i want to remove/delete all the duplicate lines in file2 which are viz unix,unix2,unix3.I have tried previous post also,but in that complete line must be similar.In this case i have to verify first column only regardless what is the content in succeeding columns. (3 Replies)
Discussion started by: sagar_1986
3 Replies

2. Shell Programming and Scripting

Removing duplicates from new file

i hav two files like i want to remove/delete all the duplicate lines in file2 which are viz unix,unix2,unix3 (2 Replies)
Discussion started by: sagar_1986
2 Replies

3. Shell Programming and Scripting

Help in removing duplicates

I have an input file abc.txt with info like: abcd rateuse inklite robet rateuse abcd I need to remove duplicates from the file (eg: abcd,rateuse) from the file and need to place the contents in same file abc.txt if needed can be placed in another file. can anyone help me in this :( (4 Replies)
Discussion started by: rkrish
4 Replies

4. Emergency UNIX and Linux Support

Removing all the duplicates

i want to remove all the duplictaes in a file.I dont want even a single entry. For the input data: 12345|12|34 12345|13|23 3456|12|90 15670|12|13 12345|10|14 3456|12|13 i need the below data in one file 15670|12|13 and the below data in another file (9 Replies)
Discussion started by: pandeesh
9 Replies

5. Shell Programming and Scripting

Removing duplicates

I have a test file with the following 2 columns: Col 1 | Col 2 T1 | 1 <= remove T5 | 1 T4 | 2 T1 | 3 T3 | 3 T4 | 1 <= remove T1 | 2 <= remove T3 ... (7 Replies)
Discussion started by: gctex
7 Replies

6. UNIX for Advanced & Expert Users

removing duplicates.

Hi All In unix ,we have a file ,there we have to remove the duplicates by using one specific column. Can any body tell me the command. ex: file1 id,name 1,ww 2,qwq 2,asas 3,asa 4,asas 4,asas o/p: 1,ww 2,qwq 3,asa (7 Replies)
Discussion started by: raju4u
7 Replies

7. Shell Programming and Scripting

Removing duplicates

Hi, I have a file in the below format., test test (10) to to (25) see see (45) and i need the output in the format of test 10 to 25 see 45 Some one help me? (6 Replies)
Discussion started by: imdadulla
6 Replies

8. Shell Programming and Scripting

removing duplicates

Hi I have a file that are a list of people & their credentials i recieve frequently The issue is that whne I catnet this list that duplicat entries exists & are NOT CONSECUTIVE (i.e. uniq -1 may not weork here ) I'm trying to write a scrip that will remove duplicate entries the script can... (5 Replies)
Discussion started by: stevie_velvet
5 Replies

9. UNIX for Dummies Questions & Answers

removing duplicates and sort -k

Hello experts, I am trying to remove all lines in a csv file where the 2nd columns is a duplicate. I am try to use sort with the key parameter sort -u -k 2,2 File.csv > Output.csv File.csv File Name|Document Name|Document Title|Organization Word Doc 1.doc|Word Document|Sample... (3 Replies)
Discussion started by: orahi001
3 Replies

10. Shell Programming and Scripting

Removing duplicates

Hi, I've been trying to removed duplicates lines with similar columns in a fixed width file and it's not working. I've search the forum but nothing comes close. I have a sample file: 27147140631203RA CCD * 27147140631203RA PPN * 37147140631207RD AAA 47147140631203RD JNA... (12 Replies)
Discussion started by: giannicello
12 Replies
Login or Register to Ask a Question