Remove duplicates according to their frequency in column


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove duplicates according to their frequency in column
# 1  
Old 10-19-2015
Remove duplicates according to their frequency in column

Hi all,

I have huge a tab-delimited file with the following format and I want to remove the duplicates according to their frequency based on Column2 and Column3.

Code:
Column1 Column2 Column3 Column4 Column5 Column6 Column7
1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3
3    user1    access1    word    word    3    1
4    user1    access2    word    word    2    1


In this case, the result should be:

Code:
1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3

because user1 with access1 occur twice. Moreover, in case the original list contains the following entry:

Code:
5    user1    access2    word    word    2    1

The result should be

Code:
2    user2    access2    word    word    5    3
5    user1    access2    word    word    2    1

because user1 with access1 and user2 with access2 occur twice, so the smaller numbers of Column6 and Column7 should be taken into consideration.

Thanks in advance for your time and consideration.
# 2  
Old 10-19-2015
Any attempts from your side?

---------- Post updated at 12:07 ---------- Previous update was at 12:03 ----------

And, why should line 5 be preferred to line 4? Except for field 1, they're identical.
This User Gave Thanks to RudiC For This Post:
# 3  
Old 10-19-2015
Hi,

Thank you for your reply. Lines 4 and 5 are identical, so no problem, it will be correct if it extracts line 4.

I am not familiar with awk but I have found the following command from a similar post but it seems that it doesn't work in my case.

Code:
awk '(NR==1);a[$2]<$3||d[$2]<$4{a[$2]=$3;d[$2]=$4;b[$2]=$0};END{for(i in b)if(b[i] !~ /ID/){print b[i]}}'

Thanks
# 4  
Old 10-19-2015
Well, try:
Code:
awk '
NR==1           {print
                 next
                }

                {LINE[$2,$3]=$0
                 FREQ[$2,$3]++
                 SUM[$2,$3]=$6+$7
                 if (FREQ[$2,$3] > MAX[$2]) MAX[$2] = FREQ[$2,$3]
                 if (MIN[$2] == 0 ||
                     SUM[$2,$3]  < MIN[$2]) MIN[$2] = SUM[$2,$3]
                }
END             {for (f in FREQ)        {split (f, TMP, SUBSEP)
                                         if     (FREQ[f] == MAX[TMP[1]] &&
                                                 SUM[f] == MIN[TMP[1]])
                                            print LINE[f]
                                        }
                }
' FS="\t" SUBSEP="\t" file
Column1    Column2    Column3    Column4    Column5    Column6    Column7
5    user1    access2    word    word    2    1
2    user2    access2    word    word    5    3

This User Gave Thanks to RudiC For This Post:
# 5  
Old 10-19-2015
Hi, thanks for your prompt reply Smilie

Unfortunately, it seems that the result is not completely correct. The correct result should have a unique user, so in Column2 user1 should appear only once based on the number of occurrences of Column 3. In case the number of occurrences is duplicated, then the smallest numbers of Column6 and Column7 should be taken into consideration.
I am sorry, but it is complicated and may be I didn't express my thought.

thanks
# 6  
Old 10-19-2015
There's only one single user1 and one single user2 in above result!?
# 7  
Old 10-19-2015
Hi,

The above result is correct, but when I tried with:

Code:
1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3
3    user1    access1    word    word    3    1
4    user1    access2    word    word    2    1

I got:

Code:
1    user1    access1    word    word    3    2
4    user1    access2    word    word    2    1

What is going wrong?

Best regards,

Last edited by corfuitl; 10-19-2015 at 09:02 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to Sum columns when other column has duplicates and append one column value to another with Care

Hi Experts, Please bear with me, i need help I am learning AWk and stuck up in one issue. First point : I want to sum up column value for column 7, 9, 11,13 and column15 if rows in column 5 are duplicates.No action to be taken for rows where value in column 5 is unique. Second point : For... (1 Reply)
Discussion started by: as7951
1 Replies

2. Shell Programming and Scripting

Remove duplicates

I have a file with the following format: fields seperated by "|" title1|something class|long...content1|keys title2|somhing class|log...content1|kes title1|sothing class|lon...content1|kes title3|shing cls|log...content1|ks I want to remove all duplicates with the same "title field"(the... (3 Replies)
Discussion started by: dtdt
3 Replies

3. Shell Programming and Scripting

Count frequency of unique values in specific column

Hi, I have tab-deliminated data similar to the following: dot is-big 2 dot is-round 3 dot is-gray 4 cat is-big 3 hot in-summer 5 I want to count the frequency of each individual "unique" value in the 1st column. Thus, the desired output would be as follows: dot 3 cat 1 hot 1 is... (5 Replies)
Discussion started by: owwow14
5 Replies

4. Shell Programming and Scripting

Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column

Hi Experts , we have a CDC file where we need to get the latest record of the Key columns Key Columns will be CDC_FLAG and SRC_PMTN_I and fetch the latest record from the CDC_PRCS_TS Can we do it with a single awk command. Please help.... (3 Replies)
Discussion started by: vijaykodukula
3 Replies

5. Shell Programming and Scripting

Remove duplicates within row and separate column

Hi all I have following kind of input file ESR1 PA156 leflunomide PA450192 leflunomide CHST3 PA26503 docetaxel Pa4586; thalidomide Pa34958; decetaxel docetaxel docetaxel I want to remove duplicates and I want to separate anything before and after PAxxxx entry into columns or... (1 Reply)
Discussion started by: manigrover
1 Replies

6. Shell Programming and Scripting

Request to check:remove duplicates only in first column

Hi all, I have an input file like this Now I have to remove duplicates only in first column and nothing has to be changed in second and third column. so that output would be Please let me know scripting regarding this (20 Replies)
Discussion started by: manigrover
20 Replies

7. Shell Programming and Scripting

remove duplicates based on single column

Hello, I am new to shell scripting. I have a huge file with multiple columns for example: I have 5 columns below. HWUSI-EAS000_29:1:105 + chr5 76654650 AATTGGAA HHHHG HWUSI-EAS000_29:1:106 + chr5 76654650 AATTGGAA B@HYL HWUSI-EAS000_29:1:108 + ... (4 Replies)
Discussion started by: Diya123
4 Replies

8. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Given a file such as this I need to remove the duplicates. 00060011 PAUL BOWSTEIN ad_waq3_921_20100826_010517.txt 00060011 PAUL BOWSTEIN ad_waq3_921_20100827_010528.txt 0624-01 RUT CORPORATION ad_sade3_10_20100827_010528.txt 0624-01 RUT CORPORATION ... (13 Replies)
Discussion started by: script_op2a
13 Replies

9. UNIX for Dummies Questions & Answers

Remove duplicates based on a column in fixed width file

Hi, How to output the duplicate record to another file. We say the record is duplicate based on a column whose position is from 2 and its length is 11 characters. The file is a fixed width file. ex of Record: DTYU12333567opert tjhi kkklTRG9012 The data in bold is the key on which... (1 Reply)
Discussion started by: Qwerty123
1 Replies
Login or Register to Ask a Question