Remove duplicates in a dataframe (table) keeping all the different cells of just one of the columns

03-11-2019

Registered User

4, 0

Join Date: Feb 2019

Last Activity: 7 May 2019, 9:30 AM EDT

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

Remove duplicates in a dataframe (table) keeping all the different cells of just one of the columns

Hello all,
I need to filter a dataframe composed of several columns of data to remove the duplicates according to one of the columns. I did it with pandas. In the main time, I need that the last column that contains all different data ( not redundant) is conserved in the output like this:

Code:

A         B           C             D
a1        b1           c1            d1
a2       b2          c2           d2

output:

Code:

A         B           C             D
ad        bd       cd            d1,d2

where ad bd and cd are the dereplicated output rows and in D we have that for each of the unique rows we have all the data separated by a comma in one single cell for each unique row.

pedro88

View Public Profile for pedro88

Find all posts by pedro88

03-11-2019

Registered User

2,524, 241

Join Date: Dec 2007

Last Activity: 17 March 2020, 2:04 PM EDT

Posts: 2,524

Thanks Given: 173

Thanked 241 Times in 206 Posts

You may want to try to explain that again.
I know that I do not see how you get from that example of 3 lines to 2 lines.

joeyg

View Public Profile for joeyg

Find all posts by joeyg

03-11-2019

Registered User

4, 0

Join Date: Feb 2019

Last Activity: 7 May 2019, 9:30 AM EDT

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

Basically, I have a tabular file with 4 columns (A,B,C,D). and several rows (1,2,3,4,5,6,7,....)
Considering column A the data are redundant (like :

Code:

A                           B        C                  D
apple                  15        aaa           agcacagcagc
apple                  25        bbb         acgacgacgcga
banana               12        cccc        acagcgaagccga
cherry                 36        ddd        actgctgtcgagtag
berry                   55        eee        gactgatgctgtcgtc
banana               36        ffff         cacacgtgtgct

I need to output like:

Code:

A                         B              C            D
apple                25           aaa         agcacagcagc;acgacgacgcga
banana            36           cccc       acagcgaagccga;cacacgtgtgct
cherry              36           ddd        actgctgtcgagtag
berry                55            eee        gactgatgctgtcgtc

I don't really mind column C so whatever he keeps in the output it's ok. for column B I keep the higher ( I managed to do it with pandas but i'm not able to do the trick on column D)

thanks

pedro88

View Public Profile for pedro88

Find all posts by pedro88

03-11-2019

Registered User

489, 285

Join Date: Nov 2018

Last Activity: 30 October 2021, 10:47 AM EDT

Location: undefined

Posts: 489

Thanks Given: 382

Thanked 285 Times in 215 Posts

Code:

awk '
($1 in A)       { if($2 > A[$1][2]) A[$1][2] = $2
                        A[$1][4] = A[$1][4] ";" $4
                        next
                }
                { for(n = split($0, M); n; n--) A[$1][n] = M[n]
                }
END             { for(i in A) {
                        for(j = 1; j <= NF; j++) printf "%s ",  A[i][j]
                                print ""
                        }
                }' file

nezabudka

View Public Profile for nezabudka

Find all posts by nezabudka

03-14-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Moderator's Comments:

The title of this thread has been changed from:
Remove duplicates in a dataframe (table) keepping all the different cells of just one of the columns
to:
Remove duplicates in a dataframe (table) keeping all the different cells of just one of the columns
to make searches more likely to find desired threads.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-15-2019

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello pedro88,

Could you please try following too, I am reading Input_file 2 times here and output will be in same sequence in which $1 appears to be in Input_file.

Code:

awk '
FNR==NR{
  a[$1]=a[$1]>$2?a[$1]:$2
  b[$1]=a[$1]>$2?b[$1]?b[$1]:$0:$0
  next
}
($1 in a){
  print b[$1]
  delete a[$1]
}
'   Input_file  Input_file

Output will be as follows.

Code:

A                           B        C                  D
apple                  25        bbb         acgacgacgcga
banana               36        ffff         cacacgtgtgct
cherry                 36        ddd        actgctgtcgagtag
berry                   55        eee        gactgatgctgtcgtc

Thanks,
R. Singh

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

UNIX for Beginners Questions & Answers

Remove duplicates in a dataframe (table) keeping all the different cells of just one of the columns

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Sort and remove duplicates in directory based on first 5 columns:

Discussion started by: gnnsprapa

2. UNIX for Beginners Questions & Answers

Merge cells in all rows of a HTML table dynamically.

Discussion started by: Mounika

3. Shell Programming and Scripting

Remove duplicates by keeping the order intact

Discussion started by: magnus29

4. Shell Programming and Scripting

Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column

Discussion started by: vijaykodukula

5. Shell Programming and Scripting

CSV with commas in field values, remove duplicates, cut columns

Discussion started by: krishnix

6. UNIX Desktop Questions & Answers

Using grep to remove cells instead of lines

Discussion started by: evelibertine

7. UNIX for Dummies Questions & Answers

Two files; if cells match then copy over other columns

Discussion started by: Renyulb28

8. Shell Programming and Scripting

Search based on 1,2,4,5 columns and remove duplicates in the same file.

Discussion started by: onesuri

9. Shell Programming and Scripting

Remove duplicates based on the two key columns

Discussion started by: kmsekhar

10. Shell Programming and Scripting

Deleting table cells in a script

Discussion started by: phpfreak