Visit The New, Modern Unix Linux Community

Merging rows based on same ID in First column.

Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Merging rows based on same ID in First column.
# 1  
Merging rows based on same ID in First column.


I have a tab-delimited file with 3 columns :

BINPACKER.13259.1.p2    SSF48239    
BINPACKER.13259.1.p2    PF13243    
BINPACKER.13259.1.p2    G3DSA:
BINPACKER.13259.2.p2    SSF48239    
BINPACKER.13259.2.p2    PF13243    
BINPACKER.13259.2.p2    G3DSA:
BINPACKER.31705.4.p1    PF00176    GO:0005524
BINPACKER.31705.4.p1    SM00490    
BINPACKER.31705.4.p1    SSF52540    
BINPACKER.31705.4.p1    G3DSA:
BINPACKER.31705.4.p1    mobidb-lite
BINPACKER.31705.4.p1    SM00487    
BINPACKER.31705.4.p1    PS51194    
BINPACKER.31705.4.p1    cd00079    
BINPACKER.31705.4.p1    PF00271    
BINPACKER.31705.4.p1    PS51192    
BINPACKER.31705.4.p1    cd00046
BINPACKER.31705.4.p1    G3DSA:    
BINPACKER.31705.4.p1    SSF52540    
BINPACKER.9719.7.p1    PF00443    GO:0016579|GO:0036459
BINPACKER.9719.7.p1    SSF57850
BINPACKER.9719.7.p1    PS50235    
BINPACKER.9719.7.p1    mobidb-lite
BINPACKER.9719.7.p1    PF02148    GO:0008270
BINPACKER.9719.7.p1    SSF54001    
BINPACKER.9719.7.p1    mobidb-lite
BINPACKER.9719.7.p1    cd02669    GO:0000245|GO:0006397
BINPACKER.9719.7.p1    PS50271    GO:0008270
BINPACKER.9719.7.p1    SM00290    GO:0008270
BINPACKER.9719.7.p1    mobidb-lite
BINPACKER.9719.7.p1    mobidb-lite
BINPACKER.9719.7.p1    G3DSA:    
BINPACKER.9719.7.p1    G3DSA:
BINPACKER.937.4.p1    PS51032    GO:0003700|GO:0006355
BINPACKER.937.4.p1    PIRSF038123    GO:0003700
BINPACKER.937.4.p1    cd00018    GO:0003700|GO:0006355
BINPACKER.937.4.p1    SSF54171    GO:0003677
BINPACKER.937.4.p1    G3DSA:3.30.730.10    GO:0003700|GO:0006355
BINPACKER.937.4.p1    PR00367    GO:0003700|GO:0006355

I want to mege the rows based on first column with same ID. In column 2, I want only ID starting with PF and in 3rd column, want to concatenate all GO term seperated with comma. in each case there should be no duplicate eg:

BINPACKER.13259.1.p2    PF13243    NA
BINPACKER.13259.2.p2  PF13243                    NA
 BINPACKER.31705.4.p1    PF00176,PF00271    GO:0005524
BINPACKER.9719.7.p1    PF00443,PF02148    GO:0016579,GO:0036459,GO:0008270,GO:0000245,GO:0006397
BINPACKER.937.4.p1    NA    GO:0003700,GO:0006355,GO:0003677


Last edited by anjaliANJALI; 08-13-2019 at 01:43 PM..
# 2  
Great and thanks for posting.

Please show the code you have written so far and share your platform details.

# 3  
According to requirements PF02148 should be in column 2 of the output shown (line 3). Also, there should be in the output: BINPACKER.13259.2.p2 PF13243 NA
# 4  
Thankyou, you are right, i have edited my outpur.
# 5  
awk -F'\t' '
{column_one[$1]=$1; gsub(" *[,|] *", ",");
 if ($2 ~ /^PF/) {
    if (! length(pf_string[$1,$2])) out_pf_string[$1]=out_pf_string[$1] $2 ",";
 c=split($3, column_three, " *, *");
 for (i=1; i<=c; i++) {
     if (column_three[i] ~ /^GO/) {
        if (! length(go_string[$1,column_three[i]])) out_go_string[$1]=out_go_string[$1] column_three[i] ",";
   for (i in column_one) {
      sub(",*$", "", out_pf_string[i]);
      sub(",*$", "", out_go_string[i]);
      out_pf_string[i]=(length(out_pf_string[i])) ? out_pf_string[i] : "NA";
      out_go_string[i]=(length(out_go_string[i])) ? out_go_string[i] : "NA";
      print i, out_pf_string[i], out_go_string[i];
}' OFS='\t' infile

This User Gave Thanks to rdrtx1 For This Post:
# 6  
perl -lane'
  ($n, $p) =@F;
  $s{$n}++ or push @r, $n;
  $c{$n}{$p}++ or push @{$h{$n}}, $p;
  END {
    $" = ",\t";
    print "$_\t@{$h{$_}}" for @r;

It is concatenating all values and separate them with comma, but i don't know how to remove duplicate entries and also need to replace NA in columns having no entries as depicted in my desirable output.
# 7  
Try also
awk -F"[|       ]" '
$2 ~ /^PF/ &&
PF[$1] !~ $2    {PF[$1] = PF[$1] DLP[$1] $2
                 DLP[$1] = "," 
NF > 2          {for (i=3; i<=NF; i++) if (GO[$1] !~ $i)        {GO[$1] = GO[$1] DLG[$1] $i
                                                                 DLG[$1] = ","
END             {for (p in PF)  {print p, PF[p], GO[p]?GO[p]:"NA"
                                 delete GO[p]
                 for (g in GO)   print g, "NA", GO[g]
' OFS="\t" file
BINPACKER.13259.2.p2    PF13243    NA
BINPACKER.31705.4.p1    PF00176,PF00271    GO:0005524
BINPACKER.9719.7.p1     PF00443,PF02148    GO:0016579,GO:0036459,GO:0008270,GO:0000245,GO:0006397
BINPACKER.13259.1.p2    PF13243    NA
BINPACKER.937.4.p1      NA       GO:0003700,GO:0006355,GO:0003677

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #774
Difficulty: Medium
The finite state machine has more computational power than a Turing machine.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Merging multiple lines into single line based on one column

I Want to merge multiple lines based on the 1st field and keep into single record. SRC File: AAA_POC_DB.TAB1 AAA_POC_DB.TAB2 AAA_POC_DB.TAB3 AAA_POC_DB.TAB4 BBB_POC_DB.TAB1 BBB_POC_DB.TAB2 CCC_POC_DB.TAB6 OUTPUT ----------------- 'AAA_POC_DB','TAB1','TAB2','TAB3','TAB4'... (10 Replies)
Discussion started by: raju2016
10 Replies

2. UNIX for Dummies Questions & Answers

File merging based on column patterns

Hello :) I am in this situation: Input: two tab-delimited files, `File1` and `File2`. `File2` (`$2`) has to be parsed by patterns found in `File1` (`$1`). Expected output: tab-delimited file, `File3`. `File3` has to contain the same rows as `File2`, plus the corresponding value in... (5 Replies)
Discussion started by: dovah
5 Replies

3. UNIX for Dummies Questions & Answers

Merging lines based on one column

Hi, I have a file which I'd like to merge lines based on duplicates in one column while keeping the info for other columns. Let me simplify it by an example: File ESR1 ANASTROZOLE NA FDA_approved ESR1 CISPLATIN NA FDA_approved ESR1 DANAZOL agonist NA ESR1 EXEMESTANE NA FDA_approved... (3 Replies)
Discussion started by: JJ001
3 Replies

4. UNIX for Dummies Questions & Answers

merging rows into new file based on rows and first column

I have 2 files, file01= 7 columns, row unknown (but few) file02= 7 columns, row unknown (but many) now I want to create an output with the first field that is shared in both of them and then subtract the results from the rest of the fields and print there e.g. file 01 James|0|50|25|10|50|30... (1 Reply)
Discussion started by: A-V
1 Replies

5. Shell Programming and Scripting

Merging columns based on one or more column in two files

I have two files. FileA.txt 30910 rs7468327 36587 rs10814410 91857 rs9408752 105797 rs1133715 146659 rs2262038 152695 rs2810979 181843 rs3008128 182129 rs3008131 192118 rs3008170 FileB.txt 30910 1.9415219673 0 36431 1.3351312477 0.0107191428 36587 1.3169171182... (2 Replies)
Discussion started by: genehunter
2 Replies

6. Shell Programming and Scripting

Merging rows with same column 1 value

I have the following space-delimited input: 1 11.785710 117.857100 1 15 150 1 20 200 1 25 250 3 2.142855 21.428550 3 25 250 22 1.071435 10.714350 The first field is the ID number, the second field is the percentage of the total points that the person has and the third column is the number... (3 Replies)
Discussion started by: mdlloyd7
3 Replies

7. Shell Programming and Scripting

merging two files based on first column

I had two files file1 and file2. I want a o/p file(file3) like below using first column as ref. Pls give suggestion ass join is not working as the number of lines in each file is nealry 5 C? file1 --------------------- 404000324810001 Y 404000324810004 N 404000324810008 Y 404000324810009 N... (1 Reply)
Discussion started by: p_sai_ias
1 Replies

8. Shell Programming and Scripting

Merging 2 files based on a common column

Hi All, I do have 2 files file 1 has 4 tab delimited columns 234 a c dfgyu 294 b g fih 302 c h jzh 328 z c san 597 f g son File 2 has 2 tab delimted columns 234 23 302 24 597 24 I want to merge file 2 with file 1 based on the data common in both files which is the first column so... (6 Replies)
Discussion started by: Lucky Ali
6 Replies

9. Shell Programming and Scripting

column to rows based on another column...

Guys, i have a file in below format where the barcode's are uniq per site but could be repeated for different site. so i want to convert the site column to rows based on the barcode's as below output. your help is appreciated!!! input: SITE BARCODE QTY SP CP 10001 6281103890017 10 50 48... (5 Replies)
Discussion started by: malcomex999
5 Replies

10. Shell Programming and Scripting

merging column from two files based on identifier

Hi, I have two files consisting of two columns. So I want to merge column 2 if column 1 is the same. So heres an example of what I mean. FILE1 driver 444 car 333 hat 222 FILE2 driver 333 car 666 hat 999 So I want to merge the column 2's together so... (4 Replies)
Discussion started by: phil_heath
4 Replies

Featured Tech Videos