Find lines with duplicate values in a particular column

10-10-2019

Registered User

7, 0

Join Date: Jul 2019

Last Activity: 24 October 2019, 4:43 PM EDT

Posts: 7

Thanks Given: 7

Thanked 0 Times in 0 Posts

Find lines with duplicate values in a particular column

I have a file with 5 columns. I want to pull out all records where the value in column 4 is not unique. For example in the sample below, I would want it to print out all lines except for the last two.

Code:

40991764	2419	724	47182	Cand A
40992936	3591	724	47182	Cand B
40993016	3671	724	47182	Cand C
40993876	4531	724	10154	Strep A
40993878	4533	724	10154	Strep B
40993990	4645	724	58899	Cala A
40993991	4646	724	63849	Myco A

I tried this:

Code:

awk -F '\t' 'a=x[$4]{print a"\n"$0;} {x[$4]=$0;}'

It works well if there is only one duplicate per line (10154 above), but if there is more than 1 duplicate (47182 above), it prints out one of the matched duplicates twice (Cand B):

Code:

40991764	2419	724	47182	Cand A
40992936	3591	724	47182	Cand B
40992936	3591	724	47182	Cand B
40993016	3671	724	47182	Cand C
40993876	4531	724	10154	Strep A
40993878	4533	724	10154	Strep B

How can I get it to print each unique line only once?

kaktus

View Public Profile for kaktus

Find all posts by kaktus

10-10-2019

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Try this:

Code:

awk -F '\t' '{ if($4 in x){ print (x[$4]?x[$4]"\n":"")$0;x[$4]=""} else x[$4]=$0}'

edit: or this

Code:

awk -F '\t' '{if($4 in x){if(x[$4]) print x[$4]; print;x[$4]=""} else x[$4]=$0}'

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

10-10-2019

Registered User

7, 0

Join Date: Jul 2019

Last Activity: 24 October 2019, 4:43 PM EDT

Posts: 7

Thanks Given: 7

Thanked 0 Times in 0 Posts

Thanks! Both of these get me the desired output. I don't fully understand how it works though. Would you mind breaking it down?

kaktus

View Public Profile for kaktus

Find all posts by kaktus

10-10-2019

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

On worries, it works like this.

On first occurrence of an new $4 value ($4 in x) will be false so x[$4] is assigned to the record value.

On second occurrence $4 will be in x (we assigned it on first occurrence) and x[$4] will be non-blank so we do print x[$4]
which prints the first value then we do print to print current record and assign x[$4] to blank.

On Third and further occurrences $4 is still in x but the array item is blank now so we just print and assign x[$4] to blank again.

Edit:
One thing to be careful of is that awk will create and array item as soon as it's referenced for example:

Code:

$ awk 'BEGIN { print T["test"]; print ("test" in T) }'

1

Using key in array is safe and does not create an item:

Code:

$ awk 'BEGIN { print ("test" in T); print ("test" in T) }'
0
0

Last edited by Chubler_XL; 10-10-2019 at 11:05 PM..

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

10-10-2019

Registered User

7, 0

Join Date: Jul 2019

Last Activity: 24 October 2019, 4:43 PM EDT

Posts: 7

Thanks Given: 7

Thanked 0 Times in 0 Posts

Thanks, for the additional explanation. I can follow it now.

kaktus

View Public Profile for kaktus

Find all posts by kaktus

10-11-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Given your uniq provides all the options shown, try

Code:

sort -k4,4 file | uniq -D -f3 -w5
40993876    4531    724    10154    Strep A
40993878    4533    724    10154    Strep B
40991764    2419    724    47182    Cand A
40992936    3591    724    47182    Cand B
40993016    3671    724    47182    Cand C

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

UNIX for Beginners Questions & Answers

Find lines with duplicate values in a particular column

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Discussion started by: sajmar

2. Shell Programming and Scripting

Remove duplicate values in a column(not in the file)

Discussion started by: ratheeshjulk

3. Shell Programming and Scripting

Filter file to remove duplicate values in first column

Discussion started by: LMHmedchem

4. Shell Programming and Scripting

Identify duplicate values at first column in csv file

Discussion started by: deadyetagain

5. Shell Programming and Scripting

Get the average from column, and eliminate the duplicate values.

Discussion started by: jiam912

6. Shell Programming and Scripting

Check to identify duplicate values at first column in csv file

Discussion started by: avikaljain

7. UNIX for Dummies Questions & Answers

[SOLVED] remove lines that have duplicate values in column two

Discussion started by: pathunkathunk

8. Shell Programming and Scripting

Perl: filtering lines based on duplicate values in a column

Discussion started by: polsum

9. Shell Programming and Scripting

Find and replace duplicate column values in a row

Discussion started by: nuthalapati

10. Shell Programming and Scripting

have to retrieve the distinct values (not duplicate) from 2nd column and display

Discussion started by: shirdi