need to remove duplicates based on key in first column and pattern in last column

09-01-2010

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

Quote:

Originally Posted by script_op2a

Is there a reason to have a[4]a[5] , why not just use a[4] and then awk doesn't even have to remove anything?

Using both allows the timestamp to be used in determining the most recent entry which would be necessary should two entries have the same date.

agama

View Public Profile for agama

Find all posts by agama

11-23-2010

Registered User

37, 0

Join Date: Aug 2010

Last Activity: 20 February 2012, 12:30 PM EST

Posts: 37

Thanks Given: 9

Thanked 0 Times in 0 Posts

Hello,

Could anyone help me modify this script to output 1 file containing the unwanted duplicates and the other containing just the desired records?

script_op2a

View Public Profile for script_op2a

Find all posts by script_op2a

11-23-2010

Registered User

2,759, 420

Join Date: Jun 2006

Last Activity: 13 September 2015, 8:58 PM EDT

Posts: 2,759

Thanks Given: 44

Thanked 420 Times in 408 Posts

a simple solution, base on previous script.

Code:

awk '{split($4,a,"_"); if (b[$1]<=a[4]a[5]) {b[$1]=a[4]a[5];c[$1]=$0}}
END{for (i in b) print c[i]}' infile  |sort > desired_records

sort infile > temp2

diff desired_records temp2 |awk '/^>/ {$1="";print}'> unwanted_duplicates_records

This User Gave Thanks to rdcwayx For This Post:

rdcwayx

View Public Profile for rdcwayx

Find all posts by rdcwayx

11-24-2010

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Code:

sort -t_ -k1,1 -k4rn < infile | awk -F_ 'NF{if(A[$1,$2,$3]++)print>"dups.out";else print}' > recs.out

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

11-24-2010

Registered User

37, 0

Join Date: Aug 2010

Last Activity: 20 February 2012, 12:30 PM EST

Posts: 37

Thanks Given: 9

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by rdcwayx

a simple solution, base on previous script.

Code:

awk '{split($4,a,"_"); if (b[$1]<=a[4]a[5]) {b[$1]=a[4]a[5];c[$1]=$0}}
END{for (i in b) print c[i]}' infile  |sort > desired_records

sort infile > temp2

diff desired_records temp2 |awk '/^>/ {$1="";print}'> unwanted_duplicates_records

This almost works. The only problem is that it puts a blank space in front of every line in the

Code:

unwanted_duplicates_records

file.

Is there any way we can fix this?

script_op2a

View Public Profile for script_op2a

Find all posts by script_op2a

11-24-2010

Registered User

2,759, 420

Join Date: Jun 2006

Last Activity: 13 September 2015, 8:58 PM EDT

Posts: 2,759

Thanks Given: 44

Thanked 420 Times in 408 Posts

Code:

diff desired_records temp2 |awk '/^>/ {$1="";print}'|sed 's/^ //' > unwanted_duplicates_records

rdcwayx

View Public Profile for rdcwayx

Find all posts by rdcwayx

05-18-2011

Registered User

37, 0

Join Date: Aug 2010

Last Activity: 20 February 2012, 12:30 PM EST

Posts: 37

Thanks Given: 9

Thanked 0 Times in 0 Posts

This is what I finally used:

Code:

diff $outfile $temp_sort_file |awk '/^>/ {print $0}'|sed 's/^> //' > $temp_dups_file

It works with most of my files.

I'm trying to understand it.

What is the awk part doing exactly? What is

Code:

^>

?

And what is the sed part doing? What is

Code:

's/^> //'

?

script_op2a

View Public Profile for script_op2a

Find all posts by script_op2a

Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicates according to their frequency in column

Discussion started by: corfuitl

2. Shell Programming and Scripting

Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column

Discussion started by: vijaykodukula

3. Shell Programming and Scripting

Remove duplicates within row and separate column

Discussion started by: manigrover

4. Shell Programming and Scripting

Request to check:remove duplicates only in first column

Discussion started by: manigrover

5. Shell Programming and Scripting

remove duplicates based on single column

Discussion started by: Diya123

6. Shell Programming and Scripting

Remove duplicates based on the two key columns

Discussion started by: kmsekhar

7. UNIX for Dummies Questions & Answers

Remove duplicates based on a column in fixed width file

Discussion started by: Qwerty123

8. Shell Programming and Scripting

How can i delete the duplicates based on one column of a line

Discussion started by: rdhanek

9. Shell Programming and Scripting

joining files based on key column

Discussion started by: akil