Remove duplicates according to their frequency in column

10-19-2015

Registered User

26, 0

Join Date: Mar 2012

Last Activity: 27 October 2015, 7:11 AM EDT

Posts: 26

Thanks Given: 9

Thanked 0 Times in 0 Posts

Remove duplicates according to their frequency in column

Hi all,

I have huge a tab-delimited file with the following format and I want to remove the duplicates according to their frequency based on Column2 and Column3.

Code:

Column1 Column2 Column3 Column4 Column5 Column6 Column7
1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3
3    user1    access1    word    word    3    1
4    user1    access2    word    word    2    1

In this case, the result should be:

Code:

1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3

because user1 with access1 occur twice. Moreover, in case the original list contains the following entry:

Code:

5    user1    access2    word    word    2    1

The result should be

Code:

2    user2    access2    word    word    5    3
5    user1    access2    word    word    2    1

because user1 with access1 and user2 with access2 occur twice, so the smaller numbers of Column6 and Column7 should be taken into consideration.

Thanks in advance for your time and consideration.

corfuitl

View Public Profile for corfuitl

Find all posts by corfuitl

10-19-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Any attempts from your side?

---------- Post updated at 12:07 ---------- Previous update was at 12:03 ----------

And, why should line 5 be preferred to line 4? Except for field 1, they're identical.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

10-19-2015

Registered User

26, 0

Join Date: Mar 2012

Last Activity: 27 October 2015, 7:11 AM EDT

Posts: 26

Thanks Given: 9

Thanked 0 Times in 0 Posts

Hi,

Thank you for your reply. Lines 4 and 5 are identical, so no problem, it will be correct if it extracts line 4.

I am not familiar with awk but I have found the following command from a similar post but it seems that it doesn't work in my case.

Code:

awk '(NR==1);a[$2]<$3||d[$2]<$4{a[$2]=$3;d[$2]=$4;b[$2]=$0};END{for(i in b)if(b[i] !~ /ID/){print b[i]}}'

Thanks

corfuitl

View Public Profile for corfuitl

Find all posts by corfuitl

10-19-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Well, try:

Code:

awk '
NR==1           {print
                 next
                }

                {LINE[$2,$3]=$0
                 FREQ[$2,$3]++
                 SUM[$2,$3]=$6+$7
                 if (FREQ[$2,$3] > MAX[$2]) MAX[$2] = FREQ[$2,$3]
                 if (MIN[$2] == 0 ||
                     SUM[$2,$3]  < MIN[$2]) MIN[$2] = SUM[$2,$3]
                }
END             {for (f in FREQ)        {split (f, TMP, SUBSEP)
                                         if     (FREQ[f] == MAX[TMP[1]] &&
                                                 SUM[f] == MIN[TMP[1]])
                                            print LINE[f]
                                        }
                }
' FS="\t" SUBSEP="\t" file
Column1    Column2    Column3    Column4    Column5    Column6    Column7
5    user1    access2    word    word    2    1
2    user2    access2    word    word    5    3

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

10-19-2015

Registered User

26, 0

Join Date: Mar 2012

Last Activity: 27 October 2015, 7:11 AM EDT

Posts: 26

Thanks Given: 9

Thanked 0 Times in 0 Posts

Hi, thanks for your prompt reply

Unfortunately, it seems that the result is not completely correct. The correct result should have a unique user, so in Column2 user1 should appear only once based on the number of occurrences of Column 3. In case the number of occurrences is duplicated, then the smallest numbers of Column6 and Column7 should be taken into consideration.
I am sorry, but it is complicated and may be I didn't express my thought.

thanks

corfuitl

View Public Profile for corfuitl

Find all posts by corfuitl

10-19-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

There's only one single user1 and one single user2 in above result!?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

10-19-2015

Registered User

26, 0

Join Date: Mar 2012

Last Activity: 27 October 2015, 7:11 AM EDT

Posts: 26

Thanks Given: 9

Thanked 0 Times in 0 Posts

Hi,

The above result is correct, but when I tried with:

Code:

1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3
3    user1    access1    word    word    3    1
4    user1    access2    word    word    2    1

I got:

Code:

1    user1    access1    word    word    3    2
4    user1    access2    word    word    2    1

What is going wrong?

Best regards,

Last edited by corfuitl; 10-19-2015 at 09:02 AM..

corfuitl

View Public Profile for corfuitl

Find all posts by corfuitl

Shell Programming and Scripting

Remove duplicates according to their frequency in column

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to Sum columns when other column has duplicates and append one column value to another with Care

Discussion started by: as7951

2. Shell Programming and Scripting

Remove duplicates

Discussion started by: dtdt

3. Shell Programming and Scripting

Count frequency of unique values in specific column

Discussion started by: owwow14

4. Shell Programming and Scripting

Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column

Discussion started by: vijaykodukula

5. Shell Programming and Scripting

Remove duplicates within row and separate column

Discussion started by: manigrover

6. Shell Programming and Scripting

Request to check:remove duplicates only in first column

Discussion started by: manigrover

7. Shell Programming and Scripting

remove duplicates based on single column

Discussion started by: Diya123

8. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Discussion started by: script_op2a

9. UNIX for Dummies Questions & Answers

Remove duplicates based on a column in fixed width file

Discussion started by: Qwerty123