Merge rows in bid data file


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Merge rows in bid data file
# 1  
Old 03-08-2012
Merge rows in bid data file

Dear all,
Please help me ,,,,
if I have input file like this


Code:
A_AA960715   leucine-rich repeat-containing protein   GO:0006952   defense response     P
 A_AA960715   leucine-rich repeat-containing protein   GO:0008152   metabolic process              P 
      A_AA960715   leucine-rich repeat-containing protein   GO:0016491   oxidoreductase activity        F 
      A_AA960715   leucine-rich repeat-containing protein   GO:0007165   signal transduction            P

and want to have output like
Code:
A_AA960715 |   leucine-rich repeat-containing protein |  GO:0006952  GO:0008152 GO:0016491   GO:0007165   | defense response, metabolic  process, oxidoreductase activity,   signal transduction  | P, P, F, P



waiting for reply

AAWT

Last edited by pludi; 03-08-2012 at 07:43 AM..
# 2  
Old 03-08-2012
Hi AAWT,

I guess that input example is not enought to do an approach to the problem:

1.- Input data is well formated? Are there spaces at the beginning of the line?
2.- All lines begin with same string?
3.- Could have the second file different strings?
3.- Some lines are joined with spaces and other with commas. Is it correct?
# 3  
Old 03-08-2012
yes you are right I think,
this is tab delimited file, All lines in first column not begin with the same string, they are like a big file containing data like
Code:
A_AA960715   leucine-rich repeat-containing protein   GO:0006952   defense response                        P 
 A_AA960715   leucine-rich repeat-containing protein   GO:0008152   metabolic process                      P
 A_AA960715   leucine-rich repeat-containing protein   GO:0016491   oxidoreductase activity              F
 A_AA960715   leucine-rich repeat-containing protein   GO:0007165   signal transduction                     P
A_AA960716   protein                                GO:0007075   defense response                        P  
A_AA960716    protein                               GO:0008082   metabolic process                       P
 A_AA960716    protein                               GO:0016583   oxidoreductase activity               F        
A_AA960716    protein                                GO:0009295   signal transduction                     P

So I think now you could understand what I need in output

Regards

AAWT

Moderator's Comments:
Mod Comment Please use next time code tags for your code and data


---------- Post updated at 03:12 PM ---------- Previous update was at 02:05 PM ----------

So output I need should be like
Code:
A_AA960715 |   leucine-rich repeat-containing protein |  GO:0006952  GO:0008152 GO:0016491   GO:0007165   | defense response, metabolic  process, oxidoreductase activity,   signal transduction  | P, P, F, P
A_AA960716 |    protein                               |  GO:0007075  GO:0008082 GO:0016583   GO:0009295   | defense response, metabolic  process, oxidoreductase activity,   signal transduction  | P, P, F, P

waiting for reply

AAWT

Last edited by AAWT; 03-08-2012 at 10:04 AM..
# 4  
Old 03-09-2012
If your file is tab-delimited, as described, and records are grouped together in fours, then try...
Code:
$ cat file1
A_AA960715      leucine-rich repeat-containing protein  GO:0006952      defense response        P
A_AA960715      leucine-rich repeat-containing protein  GO:0008152      metabolic process       P
A_AA960715      leucine-rich repeat-containing protein  GO:0016491      oxidoreductase activity F
A_AA960715      leucine-rich repeat-containing protein  GO:0007165      signal transduction     P
A_AA960716      protein GO:0007075      defense response        P
A_AA960716      protein GO:0008082      metabolic process       P
A_AA960716      protein GO:0016583      oxidoreductase activity F
A_AA960716      protein GO:0009295      signal transduction     P

$ paste - - - - < file1 | awk -F \\t '{printf "%s | %-40s | %s %s %s %s | %s, %s, %s, %s | %s, %s, %s, %s \n", $1, $2, $3, $8, $13, $18, $4, $9, $14, $19, $5, $10, $15, $20}'
A_AA960715 | leucine-rich repeat-containing protein   | GO:0006952 GO:0008152 GO:0016491 GO:0007165 | defense response, metabolic process, oxidoreductase activity, signal transduction | P, P, F, P
A_AA960716 | protein                                  | GO:0007075 GO:0008082 GO:0016583 GO:0009295 | defense response, metabolic process, oxidoreductase activity, signal transduction | P, P, F, P
$

# 5  
Old 03-12-2012
Dear Ygor,
Thanks for your reply, what I got with this is actually not the required one
Code:
       A_AA960715 | leucine-rich repeat-containing protein   | GO:0006952 GO:0008152 GO:0016491   GO:0007165 | defense response, metabolic process, oxidoreductase activity,   signal transduction | P   
      , P   
     , F     
      , P      
  
      A_AA960715   | leucine-rich repeat-containing protein     | GO:0005618 GO:0006952 GO:0005618 GO:0055114 | cell wall, defense   response, cell wall, oxidation reduction | C  
  
      , P   
     , C   
      , P    
       A_AA960716   | leucine-rich repeat-containing protein     | GO:0032440 GO:0016023 GO:0016310 GO:0016301 | 2-alkenal reductase   activity, cytoplasmic membrane-bounded vesicle, phosphorylation, kinase   activity | F       , C   
        , P   
        , F    
  A_A960716   | leucine-rich repeat-containing protein     | GO:0007165 GO:0003746 GO:0006184 GO:0003924 | signal transduction,   translation elongation factor activity, GTP catabolic process, GTPase   activity | P       , F   
     , P   
 , F

First thing that its not all the time 4 lines some time the code in column one is is one time and some time even more then 10 times, and | sign is only to show column it should not be in file,
waiting for some help

Regards

Last edited by AAWT; 03-13-2012 at 10:07 AM..
# 6  
Old 03-16-2012
try this Smilie
Code:
# cat infile
A_AA960715   leucine-rich repeat-containing protein   GO:0006952   defense response                        A
 A_AA960715   leucine-rich repeat-containing protein   GO:0016491   oxidoreductase activity              C
A_AA960716   protein                                F   defense response                        E
A_AA960716    protein                                F   SON BU2                     GG
A_AA960716    protein                                F   SON BU2                     GG
A_AA960718    protein                                T   SON BU2                     GG
A_AA960718    protein                                T   SON BU2                     GG
A_AA960718    protein                                T   SON BU2                     GG
A_AA960717   protein                                X   defense response                        H
A_AA960717    protein                               X   metabolic process                       I
 A_AA960717    protein                               X   oxidoreductase activity               J
A_AA960717    protein                                X   signal transduction                     K
A_AA960719    protein                                Y   signal transduction                     WW
A_AA960719    protein                                Y   signal transduction                     WW
A_AA960717    protein                                Q   signal transduction                     K
A_AA960717    protein                                Q   signal transduction                     K
A_AA960717    protein                                Q   signal transduction                     K

Code:
# awk -vfs="  |  " -vf1=", " -vlast=3 '
function sumforsum(c){lc=0;for(i=c;i>=(c-last);i--){d[x++]=$i;lc++}}
function sumforcheck(j){cmp="";for(i=1;i<(j-last);i++)if(i==1)cmp=cmp $i;else cmp=cmp FS $i;}
function compx(){sumforcheck(NF);sumforsum(NF);fcmp=cmp;if(getline){sumforcheck(NF)} else {writex();printf "%s","\n";exit}}
function combinedprntf(i,f){for(j=i;j<x;j+=last+1)printf "%s%s%s%s",d[j],FS,d[j-1],f;printf "%s%s%s",d[j],FS,d[j-1]}
function stdprntf(i,f){if(i==last){x=x-lc;l="ok"}for(j=i;j<x;j+=last+1)printf "%s%s",d[j],f;;if(l=="ok")printf "%s",d[j];}
function loopx(){while(fcmp==cmp){compx();;};writex();;x=0;;;sumforsum(NF)}
function writex(){split(fcmp,a);for(i=1;i<=length(a);i++){printf "%s%s",a[i],FS;if(i==1||i==length(a))printf "%s",fs;}
for(k=0;k<last;k++){if(k==0){f=FS;stdprntf(last,f);printf "%s",fs}if(k==1){;combinedprntf(2,f1);printf "%s",fs}if(k==last-1){stdprntf(0,f1);}}}
{compx();loopx();printf "%s","\n";}' infile
A_AA960715   |  leucine-rich repeat-containing protein   |  GO:0006952 GO:0016491  |  defense response, oxidoreductase activity  |  A, C
A_AA960716   |  protein   |  F F F  |  defense response, SON BU2, SON BU2  |  E, GG, GG
A_AA960718   |  protein   |  T T T  |  SON BU2, SON BU2, SON BU2  |  GG, GG, GG
A_AA960717   |  protein   |  X X X X  |  defense response, metabolic process, oxidoreductase activity, signal transduction  |  H, I, J, K
A_AA960719   |  protein   |  Y Y  |  signal transduction, signal transduction  |  WW, WW
A_AA960717   |  protein   |  Q Q Q  |  signal transduction, signal transduction, signal transduction  |  K, K, K

keep in mind this code works for your input file..
$7 and $8 are two combine values to code , if input has more than column that it generates incorrect sequence.

regards
ygemici
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Merge data in lines from same file

Need help figuring out how to merge data from a file. I have a large txt file with some data that needs to be merged from separate lines into one line. Doug.G|3/12/2011|817-555-5555|Portland Doug.G|3/12/2011|817-555-5522|Portland Steve.F|1/11/2007|817-555-5111|Portland... (5 Replies)
Discussion started by: cdubu2
5 Replies

2. Shell Programming and Scripting

How to merge variable data from another file into specific place?

Hello, I'm trying to create multiple commands using a variable input from another file but am not getting any successful results. Basically, file1.txt contains multiple lines with single words: <file1.txt> yellow blue black white I want to create multiple echo commands with these... (8 Replies)
Discussion started by: demmel
8 Replies

3. Shell Programming and Scripting

How to merge the multiple data files as a single file?

Hi Experts, I have created multiple scripts and send the output to new file, getting this output to my mailbox on daily basis. I would like to send the all outputs to a single file, need to merge all file outputs on a single file. For example, Created script for df -h > df.doc grep... (7 Replies)
Discussion started by: seenuvasan1985
7 Replies

4. Shell Programming and Scripting

Merge matching rows

Hello, I need this output. thank you very much. input: Code: ***table***wood ***snack***top ***table***garfield ***big***zen ***table***cars output: Code: ***table***wood2345garfield2345cars ***snack***top ***big***zen (7 Replies)
Discussion started by: tara123
7 Replies

5. Shell Programming and Scripting

Merge the data from two servers into a single file

Hi All, Need your inputs for the below. I have 2 different servers 611 & 610, where i would be running two scripts. And would would be running one script from 611 at every 4 hours to merge the data from the 2 servers into 2 files and send a mail. so below is the code snippet for 611: ... (3 Replies)
Discussion started by: ss_ss
3 Replies

6. UNIX for Advanced & Expert Users

merge two tab delimited file with exact same number of rows in unix/linux

Hi I have two tab delimited file with different number of columns but same number of rows. I need to combine these two files in such a way that row 1 in file 2 comes adjacent to row 1 in file 1. For example: The content of file1: field1 field2 field3 a1 a2 a3 b1 b2 b3... (2 Replies)
Discussion started by: mary271
2 Replies

7. Shell Programming and Scripting

merge similar rows

I have a large file (10M lines) that contains two columns: a frequency and a string, ex: 3 aaaaa 4 bbbbb 2 ccccc 5 aaaaa 1 ddddd 4 ccccc I need to merge the lines whose string part is the same, while updating the frequency. The output should look like this: 8 aaaaa 4 bbbbb 5 ccccc... (2 Replies)
Discussion started by: tootles564
2 Replies

8. Shell Programming and Scripting

Merge two file data together based on specific pattern match

My input: File_1: 2000_t g1110.b1 abb.1 2001_t g1111.b1 abb.2 abb.2 g1112.b1 abb.3 2002_t . . File_2: 2000_t Ali england 135 abb.1 Zoe british 150 2001_t Ali england 305 g1111.b1 Lucy russia 126 (6 Replies)
Discussion started by: patrick87
6 Replies

9. Shell Programming and Scripting

How to merge rows into columns ????

Hi guz I want to merge multiple rows into a multiple columns based on the first column. The file has symbol // I want to break the symbool // and I nedd exactlynew column at that point the output will be like this please guyz help in this isssue!!!!! merging rows into columns ... (4 Replies)
Discussion started by: bogu0001
4 Replies

10. Shell Programming and Scripting

chop a data file into rows

A very naive question... I have a file which has many rows and many columns and I would like to chop off the rows and create a new file per row named after the first column of every row + 1. The data files look like: # Donades de la trajectoria de la particula 60001 # 1:T 2:Massa 3:Rx 4:Ry 5:Rz... (7 Replies)
Discussion started by: pau
7 Replies
Login or Register to Ask a Question