Find and remove duplicate record and print list

11-07-2012

Registered User

1,650, 478

Join Date: Mar 2012

Last Activity: 11 September 2019, 8:06 AM EDT

Posts: 1,650

Thanks Given: 58

Thanked 478 Times in 474 Posts

use this..

Just change $1 to on which column you want to find duplicates

Code:

awk '{if(!X[$1]++){print > "clean.txt"}else{print > "remove.txt"}}' file

for multiple columns

Code:

if(!X[$2,$3]++)

Hope this helps you

pamu

View Public Profile for pamu

Find all posts by pamu

11-07-2012

Registered User

411, 1

Join Date: Aug 2010

Last Activity: 23 May 2020, 10:33 PM EDT

Location: EEUU

Posts: 411

Thanks Given: 317

Thanked 1 Time in 1 Post

Hi Pamu

I try:

Code:

awk '{if(!X[$1,$2]++){print > "clean.txt"}else{print > "remove.txt"}}' file

but i got the error = X[ Event not found?

Please advise,

Thanks

---------- Post updated at 03:17 AM ---------- Previous update was at 02:08 AM ----------

Dear Puma

I would like to keep always the last value found,,, looks like the code keep always the fist one... Please advise

---------- Post updated at 03:54 AM ---------- Previous update was at 03:17 AM ----------

Please help me to get the to files output as I display bellow,,, the objetive is to delete the duplicate files, keeping allwas the last one...

The columns where are the duplicate files are $2 $3 ( colum 2-25),, and they have a index indentity (colum 26),, example
in the input file this value appears,

Code:

S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034

Therefore I should get

file # 1 cleaned.txt

Code:

S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034

file # 2 removed.txt

Code:

S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733

Following all the file as input

Code:

S3033.0            7305.01               0     420123.8 2580723.8 151.9311    18
S3035.0            7305.01               0     420123.3 2580773.9 151.6311   130
S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336
S3033.0            7305.02               0     418623.9 2588674.7 156.8311   430
S3349.0            7275.01               0     418623.1 2588627.5 157.0311   542
S3349.0            7280.01               0     418874.1 2588631.6 156.0311   657
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733
S3349.0            7280.02               0     418874.4 2588677.4 156.1311   809
S3353.0            7270.01               0     418375.3 2588718.0 156.9311   846
S3353.0            7280.01               0     418874.8 2588727.6 156.3311   922
S3351.0            7270.01               0     418375.6 2588675.5 157.3311   958
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034

Desired Output 2 files
file # 1 cleaned.txt

Code:

S3035.0            7305.01               0     420123.3 2580773.9 151.6311   130
S3033.0            7305.02               0     418623.9 2588674.7 156.8311   430
S3349.0            7275.01               0     418623.1 2588627.5 157.0311   542
S3349.0            7280.02               0     418874.4 2588677.4 156.1311   809
S3353.0            7270.01               0     418375.3 2588718.0 156.9311   846
S3353.0            7280.01               0     418874.8 2588727.6 156.3311   922
S3351.0            7270.01               0     418375.6 2588675.5 157.3311   958
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034

file # 2 removed.txt

Code:

S3033.0            7305.01               0     420123.8 2580723.8 151.9311    18
S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336
S3349.0            7280.01               0     418874.1 2588631.6 156.0311   657
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733

Thanks in advance ,,

jiam912

View Public Profile for jiam912

Find all posts by jiam912

11-07-2012

Registered User

1,650, 478

Join Date: Mar 2012

Last Activity: 11 September 2019, 8:06 AM EDT

Posts: 1,650

Thanks Given: 58

Thanked 478 Times in 474 Posts

Try this..

Code:

awk '{if(!X[$1]){X[$1]=$0}else{print X[$1] > "remove.txt";X[$1]=$0}}END{for(i in X)print X[i] >"clean.txt"}' file

You may need to sort the output later.

Quote:

Originally Posted by jiam912

The columns where are the duplicate files are $2 $3 ( colum 2-25),, and they have a index indentity (colum 26),

i don't think there are any duplicates from $2 and $3 in first sample. and there are only 7 columns.

pamu

View Public Profile for pamu

Find all posts by pamu

11-07-2012

Registered User

411, 1

Join Date: Aug 2010

Last Activity: 23 May 2020, 10:33 PM EDT

Location: EEUU

Posts: 411

Thanks Given: 317

Thanked 1 Time in 1 Post

Dear Pamu

The duplicate values are in $1 and $2,,, (awk)
With a text editor (colunm 2-25 ).

I will try and I let you know

Thanks for your help

---------- Post updated at 06:50 AM ---------- Previous update was at 04:10 AM ----------

Dear Pamu

using the code

Code:

awk '{if(X[$1]){X[$1]=$0}else{print X[$1] > "remove.txt";X[$1]=$0}}END{for(i in X)print X[i] >"clean.txt"}' file

input
file

Code:

S3033.0            7305.01               0     420123.8 2580723.8 151.9311    18 303373051 30337305
S3033.0            7305.02               0     418623.9 2588674.7 156.8311   430 303373052 30337305
S3035.0            7305.01               0     420123.3 2580773.9 151.6311   130 303573051 30357305
S3349.0            7275.01               0     418623.1 2588627.5 157.0311   542 334972751 33497275
S3349.0            7280.01               0     418874.1 2588631.6 156.0311   657 334972801 33497280
S3349.0            7280.02               0     418874.4 2588677.4 156.1311   809 334972802 33497280
S3351.0            7270.01               0     418375.6 2588675.5 157.3311   958 335172701 33517270
S3353.0            7270.01               0     418375.3 2588718.0 156.9311   846 335372701 33537270
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336 335372751 33537275
S3353.0            7280.01               0     418874.8 2588727.6 156.3311   922 335372801 33537280
S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224 335572751 33557275
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733 335572752 33557275
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034 335572753 33557275

I have sorted the values in colun #9

Then I got the

clean.txt

Code:

S3349.0            7275.01               0     418623.1 2588627.5 157.0311   542 334972751 33497275
S3035.0            7305.01               0     420123.3 2580773.9 151.6311   130 303573051 30357305
S3353.0            7270.01               0     418375.3 2588718.0 156.9311   846 335372701 33537270
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034 335572753 33557275
S3353.0            7280.01               0     418874.8 2588727.6 156.3311   922 335372801 33537280
S3033.0            7305.02               0     418623.9 2588674.7 156.8311   430 303373052 30337305
S3351.0            7270.01               0     418375.6 2588675.5 157.3311   958 335172701 33517270
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336 335372751 33537275
S3349.0            7280.02               0     418874.4 2588677.4 156.1311   809 334972802 33497280

but the file
remove.txt

Is empty???????????

Please can you advise me where is the problem...

I should get this

Code:

S3033.0            7305.01               0     420123.8 2580723.8 151.9311    18
S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733

Thanks for your help and time

---------- Post updated at 06:52 AM ---------- Previous update was at 06:50 AM ----------

dear Pamu,,

This is the code that I am using..

Code:

awk '{if(X[$9]){X[$9]=$0}else{print X[$9] > "remove.txt";X[$9]=$0}}END{for(i in X)print X[i] >"clean.txt"}' file

only I have changed the colunm number

jiam912

View Public Profile for jiam912

Find all posts by jiam912

11-07-2012

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

I'm not sure I understand your "duplicate" criterion. In one post, it's 7275.01 (6 digits + "."), in the other it's just 4 digits before the period. On top, your input files vary from post to post. This does not help us to help you.

Try this; you may want to sort both files afterwards:

Code:

$ awk '{Ar[$2]=$0} END{for (i in Ar) print Ar[i]}' inputfile >clean.txt
S3351.0            7270.01               0     418375.6 2588675.5 157.3311   958
S3349.0            7275.01               0     418623.1 2588627.5 157.0311   542
S3355.0            7275.02               0     418376.0 2588775.5 156.6311   733
S3355.0            7275.03               0     418875.1 2588777.6 156.4311  1034
S3353.0            7280.01               0     418874.8 2588727.6 156.3311   922
S3349.0            7280.02               0     418874.4 2588677.4 156.1311   809
S3035.0            7305.01               0     420123.3 2580773.9 151.6311   130
S3033.0            7305.02               0     418623.9 2588674.7 156.8311   430
$ grep -vfclean.txt inputfile >removed.txt
S3033.0            7305.01               0     420123.8 2580723.8 151.9311    18
S3355.0            7275.01               0     418624.2 2588774.2 156.0311   224
S3353.0            7275.01               0     418624.5 2588726.2 156.3311   336
S3349.0            7280.01               0     418874.1 2588631.6 156.0311   657
S3353.0            7270.01               0     418375.3 2588718.0 156.9311   846

Last edited by RudiC; 11-07-2012 at 08:12 AM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-07-2012

Registered User

411, 1

Join Date: Aug 2010

Last Activity: 23 May 2020, 10:33 PM EDT

Location: EEUU

Posts: 411

Thanks Given: 317

Thanked 1 Time in 1 Post

Dear Rudic

The only change that I did in the output file was to increase 2, columns more to concatenate 4 dig from colun 1 & 4 dig in colum 2. Saved to column 9. To take it as reference to find duplicated records.... Fir that I have used colum 9...

Please can you let me know where is the error in code that I am using...why the file of removed file is empty

Thanks a lot

jiam912

View Public Profile for jiam912

Find all posts by jiam912

11-07-2012

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

I'd prefer pamu to explain his code to you. Did you give my proposal a try? The removed file is not empty with that approach. Right now, it is using the full 7270.01 for testing uniqueness; could be adapted to 4 digits by minor modifications.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Find and remove duplicate record and print list

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Check/print missing number in a consecutive range and remove duplicate numbers

Discussion started by: newbie_01

2. Shell Programming and Scripting

Modifying text file records, find data in one place in the record and print it elsewhere

Discussion started by: LMHmedchem

3. Shell Programming and Scripting

Find key pattern and print selected lines for each record

Discussion started by: redse171

4. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Discussion started by: machomaddy

5. Shell Programming and Scripting

Find x and print its record

Discussion started by: StudentServitor

6. Shell Programming and Scripting

Find duplicate filenames and remove in different mount point

Discussion started by: wanna13e

7. UNIX for Advanced & Expert Users

Print Full record and substring in that record

Discussion started by: ukatru

8. UNIX for Advanced & Expert Users

How to remove duplicate lines of a record without changing the order

Discussion started by: abhi.roy03

9. UNIX for Dummies Questions & Answers

How to extract duplicate records with associated header record

Discussion started by: run_eim