Remove duplicates in a dataframe (table) keeping all the different cells of just one of the columns


Login or Register to Reply

 
Thread Tools Search this Thread
# 1  
Old 1 Week Ago
Remove duplicates in a dataframe (table) keeping all the different cells of just one of the columns

Hello all,
I need to filter a dataframe composed of several columns of data to remove the duplicates according to one of the columns. I did it with pandas. In the main time, I need that the last column that contains all different data ( not redundant) is conserved in the output like this:
Code:
A         B           C             D
a1        b1           c1            d1
a2       b2          c2           d2

output:
Code:
A         B           C             D
ad        bd       cd            d1,d2

where ad bd and cd are the dereplicated output rows and in D we have that for each of the unique rows we have all the data separated by a comma in one single cell for each unique row.
# 2  
Old 1 Week Ago
You may want to try to explain that again.
I know that I do not see how you get from that example of 3 lines to 2 lines.
# 3  
Old 1 Week Ago
Basically, I have a tabular file with 4 columns (A,B,C,D). and several rows (1,2,3,4,5,6,7,....)
Considering column A the data are redundant (like :
Code:
A                           B        C                  D
apple                  15        aaa           agcacagcagc
apple                  25        bbb         acgacgacgcga
banana               12        cccc        acagcgaagccga
cherry                 36        ddd        actgctgtcgagtag
berry                   55        eee        gactgatgctgtcgtc
banana               36        ffff         cacacgtgtgct

I need to output like:
Code:
A                         B              C            D
apple                25           aaa         agcacagcagc;acgacgacgcga
banana            36           cccc       acagcgaagccga;cacacgtgtgct
cherry              36           ddd        actgctgtcgagtag
berry                55            eee        gactgatgctgtcgtc

I don't really mind column C so whatever he keeps in the output it's ok. for column B I keep the higher ( I managed to do it with pandas but i'm not able to do the trick on column D)

thanks
# 4  
Old 1 Week Ago
Code:
awk '
($1 in A)       { if($2 > A[$1][2]) A[$1][2] = $2
                        A[$1][4] = A[$1][4] ";" $4
                        next
                }
                { for(n = split($0, M); n; n--) A[$1][n] = M[n]
                }
END             { for(i in A) {
                        for(j = 1; j <= NF; j++) printf "%s ",  A[i][j]
                                print ""
                        }
                }' file

# 5  
Old 4 Days Ago
Moderator's Comments:
Mod Comment The title of this thread has been changed from:
Remove duplicates in a dataframe (table) keepping all the different cells of just one of the columns
to:
Remove duplicates in a dataframe (table) keeping all the different cells of just one of the columns
to make searches more likely to find desired threads.
# 6  
Old 4 Days Ago
Hello pedro88,

Could you please try following too, I am reading Input_file 2 times here and output will be in same sequence in which $1 appears to be in Input_file.

Code:
awk '
FNR==NR{
  a[$1]=a[$1]>$2?a[$1]:$2
  b[$1]=a[$1]>$2?b[$1]?b[$1]:$0:$0
  next
}
($1 in a){
  print b[$1]
  delete a[$1]
}
'   Input_file  Input_file

Output will be as follows.

Code:
A                           B        C                  D
apple                  25        bbb         acgacgacgcga
banana               36        ffff         cacacgtgtgct
cherry                 36        ddd        actgctgtcgagtag
berry                   55        eee        gactgatgctgtcgtc

Thanks,
R. Singh
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
Sort and remove duplicates in directory based on first 5 columns: gnnsprapa UNIX for Beginners Questions & Answers 4 02-09-2018 05:50 PM
Merge cells in all rows of a HTML table dynamically. Mounika UNIX for Beginners Questions & Answers 17 02-05-2018 08:53 AM
Filtering duplicates based on lookup table and rules ritakadm Shell Programming and Scripting 4 10-10-2014 11:23 AM
Remove duplicates by keeping the order intact magnus29 Shell Programming and Scripting 1 11-22-2013 10:07 PM
Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column vijaykodukula Shell Programming and Scripting 3 04-26-2013 01:01 AM
Remove last few characters in a file but keeping Header and trailer intact nvuradi Shell Programming and Scripting 2 04-12-2012 02:16 PM
CSV with commas in field values, remove duplicates, cut columns krishnix Shell Programming and Scripting 4 12-08-2011 01:25 AM
Using grep to remove cells instead of whole lines evelibertine UNIX for Dummies Questions & Answers 2 10-25-2011 02:44 AM
Using grep to remove cells instead of lines evelibertine UNIX Desktop Questions & Answers 1 10-24-2011 09:35 PM
Merge Two Tables with duplicates in first table empyrean Shell Programming and Scripting 7 05-17-2011 09:44 AM
Two files; if cells match then copy over other columns Renyulb28 UNIX for Dummies Questions & Answers 3 04-15-2011 02:25 PM
Search based on 1,2,4,5 columns and remove duplicates in the same file. onesuri Shell Programming and Scripting 2 10-25-2010 05:00 AM
Remove duplicates based on the two key columns kmsekhar Shell Programming and Scripting 7 10-21-2010 11:12 AM
squeeze duplicates from a table Alex_P Shell Programming and Scripting 4 05-25-2010 04:21 AM
Deleting table cells in a script phpfreak Shell Programming and Scripting 15 12-13-2008 11:05 AM