remove duplicates based on a field and criteria

03-22-2012

Registered User

19, 0

Join Date: Mar 2012

Last Activity: 4 October 2016, 9:18 AM EDT

Location: India

Posts: 19

Thanks Given: 13

Thanked 0 Times in 0 Posts

remove duplicates based on a field and criteria

Hi,

I have a file with fields like below:

A;XYZ;102345;222
B;XYZ;123243;333
C;ABC;234234;444
D;MNO;103345;222
E;DEF;124243;333

desired output:

C;ABC;234234;444
D;MNO;103345;222
E;DEF;124243;333

ie, if the 4rth field is a duplicate.. i need only those records where the 3rd field value is greater .. or infact, more specificcally.. need those where the 2nd and 3rd digits of the 3rd field are greater.. Can we do this with awk?

Pls help in finding a solution where i can quickly process a file of around 50000 records

Thanks a lot in advance

wanderingmind16

View Public Profile for wanderingmind16

Find all posts by wanderingmind16

03-22-2012

Registered User

509, 132

Join Date: Jul 2011

Last Activity: 24 September 2019, 9:48 AM EDT

Location: Chennai, India

Posts: 509

Thanks Given: 16

Thanked 132 Times in 127 Posts

awk

Hi,

Try this one,

Code:

awk 'BEGIN{FS=";";OFS=";";}{split(/\;/,a,c[$4]);if(a[3] < $3 ){c[$4]=$0;}}END{for( i in c ){print c[i];}}' file

Cheers,
Ranga

This User Gave Thanks to rangarasan For This Post:

rangarasan

View Public Profile for rangarasan

Find all posts by rangarasan

03-22-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Hi, try:

Code:

sort -t \; -k3.2,3.3rn infile | awk -F\; '!A[$4]++'

Make sure there are not spaces in the input file.

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

03-22-2012

Registered User

19, 0

Join Date: Mar 2012

Last Activity: 4 October 2016, 9:18 AM EDT

Location: India

Posts: 19

Thanks Given: 13

Thanked 0 Times in 0 Posts

Thanks.. I did a random check and it looks to be working perfectly.. Could you please tell me what the "!A[$4]++ does ?

Thanks again..

wanderingmind16

View Public Profile for wanderingmind16

Find all posts by wanderingmind16

03-22-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Hi, for every line it creates an (associative) array element with field 4 as the index (if it does not exist yet) without a value (or 0 is you will). The exclamation mark negates that value so the outcome is 1 (true). The value of 1 in awk means perform the default action which is {print $0} so the entire line gets printed.

Afterwards the ++ comes into action and 1 is added to the array value, which now becomes 1. So that next time a line with the same value in $4 is encountered the value returned by the array is 1 which is then negated to 0 by the exclamation mark, so nothing will get printed.

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

03-22-2012

Registered User

19, 0

Join Date: Mar 2012

Last Activity: 4 October 2016, 9:18 AM EDT

Location: India

Posts: 19

Thanks Given: 13

Thanked 0 Times in 0 Posts

Thanks a lot Scrutinizer and Rangarasan

wanderingmind16

View Public Profile for wanderingmind16

Find all posts by wanderingmind16

UNIX for Dummies Questions & Answers

remove duplicates based on a field and criteria

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Trying to remove duplicates based on field and row

Discussion started by: newbie2010

2. Shell Programming and Scripting

Remove duplicates based on a field's value

Discussion started by: anniecarv

3. Shell Programming and Scripting

CSV with commas in field values, remove duplicates, cut columns

Discussion started by: krishnix

4. Shell Programming and Scripting

Remove the partial duplicates by checking the length of a field

Discussion started by: asyed

5. Shell Programming and Scripting

remove duplicates based on single column

Discussion started by: Diya123

6. Shell Programming and Scripting

Remove duplicates based on the two key columns

Discussion started by: kmsekhar

7. Shell Programming and Scripting

Remove duplicate lines (the first matching line by field criteria)

Discussion started by: joggdial3000

8. Shell Programming and Scripting

need Shell script for Sort BASED ON FIRST FIELD and PRINT THE WHOLE FILE WITHOUT DUPLICATES

Discussion started by: tuffEnuff

9. Shell Programming and Scripting

split large file based on field criteria

Discussion started by: asriva

10. Shell Programming and Scripting

remove lines based on score criteria

Discussion started by: smriti_shridhar