Find lines with duplicate values in a particular column


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Find lines with duplicate values in a particular column
# 1  
Old 10-10-2019
Find lines with duplicate values in a particular column

I have a file with 5 columns. I want to pull out all records where the value in column 4 is not unique. For example in the sample below, I would want it to print out all lines except for the last two.

Code:
40991764	2419	724	47182	Cand A
40992936	3591	724	47182	Cand B
40993016	3671	724	47182	Cand C
40993876	4531	724	10154	Strep A
40993878	4533	724	10154	Strep B
40993990	4645	724	58899	Cala A
40993991	4646	724	63849	Myco A

I tried this:
Code:
awk -F '\t' 'a=x[$4]{print a"\n"$0;} {x[$4]=$0;}'

It works well if there is only one duplicate per line (10154 above), but if there is more than 1 duplicate (47182 above), it prints out one of the matched duplicates twice (Cand B):

Code:
40991764	2419	724	47182	Cand A
40992936	3591	724	47182	Cand B
40992936	3591	724	47182	Cand B
40993016	3671	724	47182	Cand C
40993876	4531	724	10154	Strep A
40993878	4533	724	10154	Strep B

How can I get it to print each unique line only once?
# 2  
Old 10-10-2019
Try this:

Code:
awk -F '\t' '{ if($4 in x){ print (x[$4]?x[$4]"\n":"")$0;x[$4]=""} else x[$4]=$0}'

edit: or this
Code:
awk -F '\t' '{if($4 in x){if(x[$4]) print x[$4]; print;x[$4]=""} else x[$4]=$0}'

This User Gave Thanks to Chubler_XL For This Post:
# 3  
Old 10-10-2019
Thanks! Both of these get me the desired output. I don't fully understand how it works though. Would you mind breaking it down?
# 4  
Old 10-10-2019
On worries, it works like this.

On first occurrence of an new $4 value ($4 in x) will be false so x[$4] is assigned to the record value.

On second occurrence $4 will be in x (we assigned it on first occurrence) and x[$4] will be non-blank so we do print x[$4]
which prints the first value then we do print to print current record and assign x[$4] to blank.

On Third and further occurrences $4 is still in x but the array item is blank now so we just print and assign x[$4] to blank again.


Edit:
One thing to be careful of is that awk will create and array item as soon as it's referenced for example:

Code:
$ awk 'BEGIN { print T["test"]; print ("test" in T) }'

1

Using key in array is safe and does not create an item:
Code:
$ awk 'BEGIN { print ("test" in T); print ("test" in T) }'
0
0


Last edited by Chubler_XL; 10-10-2019 at 11:05 PM..
This User Gave Thanks to Chubler_XL For This Post:
# 5  
Old 10-10-2019
Thanks, for the additional explanation. I can follow it now.
# 6  
Old 10-11-2019
Given your uniq provides all the options shown, try
Code:
sort -k4,4 file | uniq -D -f3 -w5
40993876    4531    724    10154    Strep A
40993878    4533    724    10154    Strep B
40991764    2419    724    47182    Cand A
40992936    3591    724    47182    Cand B
40993016    3671    724    47182    Cand C

This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Dear folks I have a map file of around 54K lines and some of the values in the second column have the same value and I want to find them and delete all of the same values. I looked over duplicate commands but my case is not to keep one of the duplicate values. I want to remove all of the same... (4 Replies)
Discussion started by: sajmar
4 Replies

2. Shell Programming and Scripting

Remove duplicate values in a column(not in the file)

Hi Gurus, I have a file(weblog) as below abc|xyz|123|agentcode=sample code abcdeeess,agentcode=sample code abcdeeess,agentcode=sample code abcdeeess|agentadd=abcd stereet 23343,agentadd=abcd stereet 23343 sss|wwq|999|agentcode=sample1 code wqwdeeess,gentcode=sample1 code... (4 Replies)
Discussion started by: ratheeshjulk
4 Replies

3. Shell Programming and Scripting

Filter file to remove duplicate values in first column

Hello, I have a script that is generating a tab delimited output file. num Name PCA_A1 PCA_A2 PCA_A3 0 compound_00 -3.5054 -1.1207 -2.4372 1 compound_01 -2.2641 0.4287 -1.6120 3 compound_03 -1.3053 1.8495 ... (3 Replies)
Discussion started by: LMHmedchem
3 Replies

4. Shell Programming and Scripting

Identify duplicate values at first column in csv file

Input 1,ABCD,no 2,system,yes 3,ABCD,yes 4,XYZ,no 5,XYZ,yes 6,pc,noCode used to find duplicate with regard to 2nd column awk 'NR == 1 {p=$2; next} p == $2 { print "Line" NR "$2 is duplicated"} {p=$2}' FS="," ./input.csv Now is there a wise way to de-duplicate the entire line (remove... (4 Replies)
Discussion started by: deadyetagain
4 Replies

5. Shell Programming and Scripting

Get the average from column, and eliminate the duplicate values.

Dear Experts, Kindly help me please, I have a big file where there is duplicate values in col 11 till col 23, every 2 rows appers a new numbers, but in each row there is different coordinates x and y in col 57 till col 74. Please i will like to get a single value and average of the x and y... (8 Replies)
Discussion started by: jiam912
8 Replies

6. Shell Programming and Scripting

Check to identify duplicate values at first column in csv file

Hello experts, I have a requirement where I have to implement two checks on a csv file: 1. Check to see if the value in first column is duplicate, if any value is duplicate script should exit. 2. Check to verify if the value at second column is between "yes" or "no", if it is anything else... (4 Replies)
Discussion started by: avikaljain
4 Replies

7. UNIX for Dummies Questions & Answers

[SOLVED] remove lines that have duplicate values in column two

Hi, I've got a file that I'd like to uniquely sort based on column 2 (values in column 2 begin with "comp"). I tried sort -t -nuk2,3 file.txtBut got: sort: multi-character tab `-nuk2,3' "man sort" did not help me out Any pointers? Input: Output: (5 Replies)
Discussion started by: pathunkathunk
5 Replies

8. Shell Programming and Scripting

Perl: filtering lines based on duplicate values in a column

Hi I have a file like this. I need to eliminate lines with first column having the same value 10 times. 13 18 1 + chromosome 1, 122638287 AGAGTATGGTCGCGGTTG 13 18 1 + chromosome 1, 128904080 AGAGTATGGTCGCGGTTG 13 18 1 - chromosome 14, 13627938 CAACCGCGACCATACTCT 13 18 1 + chromosome 1,... (5 Replies)
Discussion started by: polsum
5 Replies

9. Shell Programming and Scripting

Find and replace duplicate column values in a row

I have file which as 12 columns and values like this 1,2,3,4,5 a,b,c,d,e b,c,a,e,f a,b,e,a,h if you see the first column has duplicate values, I need to identify (print it to console) the duplicate value (which is 'a') and also remove duplicate values like below. I could be in two... (5 Replies)
Discussion started by: nuthalapati
5 Replies

10. Shell Programming and Scripting

have to retrieve the distinct values (not duplicate) from 2nd column and display

I have a text file names test2 with 3 columns as below . We have to retrieve the distinct values (not duplicate) from 2nd column and display. I have used the below command but giving some error. NS3303 NS CRAFT LTD NS3303 NS CHIRON VACCINES LTD NS3303 NS ALLIED MEDICARE LTD NS3303 NS... (16 Replies)
Discussion started by: shirdi
16 Replies
Login or Register to Ask a Question