Combine multiple rows based on selected column keys


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Combine multiple rows based on selected column keys
# 1  
Old 08-09-2016
Combine multiple rows based on selected column keys

Hello
I want to collapse a file with multiple rows into consolidated lines of entries based on selected columns as the 'key'.

Example:

Code:
1 2 3 Abc def ghi
1 2 3 jkl mno p qrts
6 9 0 mno def Abc
7 8 4 Abc mno mno abc 
7 8 9 mno mno abc
7 8 9  mno j k

So if columns 1, 2 and 3 are designated as keys, output should be -
Code:
1 2 3 Abc def ghi jkl mno p qrts
6 9 0 mno def Abc
7 8 4 Abc mno mno abc 
7 8 9 mno mno abc mno j k

Thanks!

Last edited by rbatte1; 08-09-2016 at 04:36 AM.. Reason: Added CODE tags
# 2  
Old 08-09-2016
Hello linuxlearner123,

A warm welome to forums, hope you will enjoy learning here. Please use code tags as per forum rules for code/commands/Inputs you are using in your post. Could you please try following and let me know if this helps.

i- If you are not bothered about the order of lines in output then following may help you in same.
Code:
awk '{Q=$1 OFS $2 OFS $3;P=$0;$1=$2=$3="\b";A[Q]=A[Q]?A[Q] OFS $0:P} END{for(i in A){print A[i]}}'  Input_file

ii- If you want the output as sequence as Input_file then following may help you where I am reading Input_file 2 times.
Code:
awk 'FNR==NR{Q=$1 OFS $2 OFS $3;P=$0;$1=$2=$3="\b";A[Q]=A[Q]?A[Q] OFS $0:P;next} (($1 OFS $2 OFS $3) in A){print A[$1 OFS $2 OFS $3]}' Input_file  Input_file

I haven't tested both commands as I am not on a box right now, please check them and let us know how it goes then.

Thanks,
R. Singh
# 3  
Old 08-09-2016
Thank you for the prompt reply. I tried both the commands.
(i)
Code:
awk '{Q=$1 OFS $2 OFS $3;P=$0;$1=$2=$3="\b";A[Q]=A[Q]?A[Q] OFS $0:P} END{for(i in A){print A[i]}}' input_file
6 9 0 mno def Abc
7 8 4 Abc mno mno abc
7 8 9 mno mno abc mno j k
1 2 3 Abc def ghi jkl mno p qrts


(ii)
Code:
awk 'FNR==NR{Q=$1 OFS $2 OFS $3;P=$0;$1=$2=$3="\b";A[Q]=A[Q]?A[Q] OFS $0:P;next} (($1 OFS $2 OFS $3) in A){print A[$1 OFS $2 OFS $3]}' input_file input_file

1 2 3 Abc def ghi jkl mno p qrts
1 2 3 Abc def ghi jkl mno p qrts
6 9 0 mno def Abc
7 8 4 Abc mno mno abc
7 8 9 mno mno abc mno j k
7 8 9 mno mno abc mno j k

So while the order of the lines would be helpful, I think the first command should help. I can then sort the lines. Or I could use `uniq` on the second command.

Do you mind explaining either of the two commands a bit, so I can use and modify in the future? Thanks!

Last edited by rbatte1; 08-09-2016 at 04:35 AM.. Reason: Adjusted CODE tags to include the output
# 4  
Old 08-09-2016
Hello linuxlearner123,

For second command you could use following where it will NOT show duplicate lines.
Code:
awk 'FNR==NR{Q=$1 OFS $2 OFS $3;P=$0;$1=$2=$3="\b";A[Q]=A[Q]?A[Q] OFS $0:P;next} (($1 OFS $2 OFS $3) in A){print A[$1 OFS $2 OFS $3];delete A[$1 OFS $2 OFS $3]}' Input_file Input_file

Output will be as follows.
Code:
1 2 3 Abc def ghi jkl mno p qrts
6 9 0 mno def Abc
7 8 4 Abc mno mno abc
7 8 9 mno mno abc mno j k

Now coming to explanation part on same. Following may help you in same.
Code:
awk 'FNR==NR{               #### Here FNR and NR are the awk variables default ones. Where FNR and NR both represents the number of line for Input_file. Difference between FNR and NR is, FNR's value will be RESET whenever awk has completed a Input_file reading and reading another one. 
                                 So FNR==NR condition will be TRUE only when 1st file is getting read because NR's value will be increased till second time Input_file is being read.
Q=$1 OFS $2 OFS $3;         #### Creating a variable named Q, whose value is $1 OFS $2 OFS $3, means field 1st space 2nd field space 3rd field. Where OFS is output field seprator whose defaulkt value is SPACE.
P=$0;                       #### Creating a variable named P whose value is $0(current line's value).
$1=$2=$3="\b";              #### Here I am removing the value of 1st, 2nd and 3rd field is to backspace or setting it to NULL.
A[Q]=A[Q]?A[Q] OFS $0:P;    #### Creating an array here whose index is variable Q($1 $2 $3, as per your requirement), so first value for any index will be complete line and after that it will keep on concatenate the values to it's previous values.
next}                       #### Skip all next statements by awk's in built variable named next.
(($1 OFS $2 OFS $3) in A){  #### Now when 2nd Input_file is being read then this condition will be executed. Where I am checking $1 OFS $2 OFS $3 if they are present in array A then execute following statements.
print A[$1 OFS $2 OFS $3];  #### printing the value of array A whose index is $1 OFS $2 OFS $3, whose value we already got while reading previous Input_file.
delete A[$1 OFS $2 OFS $3]  #### Deleting the value of array A whose index is $1 OFS $2 OFS $3, so that we could avoid duplicate printing the lines.
}'  Input_file   Input_file #### Mentioning the Input_file 2 times to be read.

Thanks,
R. Singh

Last edited by RavinderSingh13; 08-09-2016 at 04:35 AM..
# 5  
Old 08-09-2016
Assingning a backspace character to a field does NOT remove it. Your above line will look like
Code:
<BS> FS <BS> FS <BS> FS <rest of record>

. While this may not be apparent when printing the line to a terminal, using it in further processing (analysing it, putting it into a DB) it may not react the way you would expect.
# 6  
Old 08-09-2016
Hello linuxlearner123,

Could you please try following too.
Code:
awk 'FNR==NR{for(i=4;i<=NF;i++){Q=Q?Q OFS $i:$i};A[$1 OFS $2 OFS $3]=A[$1 OFS $2 OFS $3]?A[$1 OFS $2 OFS $3] OFS Q:Q;Q="";next} (($1 OFS $2 OFS $3) in A){print $1 OFS $2 OFS $3 OFS A[$1 OFS $2 OFS $3];delete A[$1 OFS $2 OFS $3]}'  Input_file  Input_file

Output will be as follows.
Code:
1 2 3 Abc def ghi jkl mno p qrts
6 9 0 mno def Abc
7 8 4 Abc mno mno abc
7 8 9 mno mno abc mno j k

Following is a non-one liner form of above solution too.
Code:
awk 'FNR==NR{for(i=4;i<=NF;i++){
                                Q=Q?Q OFS $i:$i
                               };
             A[$1 OFS $2 OFS $3]=A[$1 OFS $2 OFS $3]?A[$1 OFS $2 OFS $3] OFS Q:Q;
             Q="";
             next
            }
     (($1 OFS $2 OFS $3) in A) {
                                print $1 OFS $2 OFS $3 OFS A[$1 OFS $2 OFS $3];
                                delete A[$1 OFS $2 OFS $3]
                               }
    '  Input_file  Input_file

EDIT: Adding explanation of above code too.
Code:
awk 'FNR==NR{for(i=4;i<=NF;i++){                                                     #### Mentioning condition FNR==NR, which will be TRUE when first Input_file is being read. If condition is TRUE then executing the statements into the braces. Where a for loop stars from the variable named i's value from 4 to till the number of fields into that line. 
                                                                                          We are starting from i's value as 4 because first 3 fields are indexes as per your requirement. 
                                Q=Q?Q OFS $i:$i                                      #### We are storing values of all fields apart from $1,$2 and $3 into a variable named Q.
                               };
             A[$1 OFS $2 OFS $3]=A[$1 OFS $2 OFS $3]?A[$1 OFS $2 OFS $3] OFS Q:Q;    #### Creating an array named A whose index is $1 OFS $2 OFS $3 and it's value is concatenating values for common indexes, so for very first time for any index it will be value of Q and then for next time(means when next line comes) it will be concatenated to previous values of array A.
             Q="";                                                                   #### Nullifying the value of variable Q so that it shouldn't be keep on increasing it's value and in next line it's value should be started from scratch only.
             next                                                                    #### Skipping all next statements by using next keyword here.
            }
     (($1 OFS $2 OFS $3) in A) {                                                     #### Now these statements will be executed when second Input_file is being read. If ($1 OFS $2 OFS $3)'s value is present as an index in array A then execute following statements.
                                print $1 OFS $2 OFS $3 OFS A[$1 OFS $2 OFS $3];      #### print the first, second and third field along with the value of array A whose index is current 1st, 2nd and 3rd fields.
                                delete A[$1 OFS $2 OFS $3]                           #### Delete the array A's value whose index is $1 OFS $2 OFS $3, so that duplicate results shouldn't be printed on same indexes for next lines.
                               }
    ' Input_file   Input_file                                                        #### Mentioning the Input_file here 2 times.

Thanks,
R. Singh

Last edited by RavinderSingh13; 08-09-2016 at 10:33 AM.. Reason: Added explanation for solution too now.
This User Gave Thanks to RavinderSingh13 For This Post:
# 7  
Old 08-10-2016
Quote:
Originally Posted by RudiC
Assingning a backspace character to a field does NOT remove it. Your above line will look like
Code:
<BS> FS <BS> FS <BS> FS <rest of record>

. While this may not be apparent when printing the line to a terminal, using it in further processing (analysing it, putting it into a DB) it may not react the way you would expect.
Yes, this is correct. When I tried the command with another file, it is essentially pasting rest of the record.

---------- Post updated at 02:20 AM ---------- Previous update was at 02:06 AM ----------

@R.Singh:
The updated solution seems to be working correctly, for now. Will update thread if I run into any issues. Thank you Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find All duplicates based on multiple keys

Hi All, Input.txt 123,ABC,XYZ1,A01,IND,I68,IND,NN 123,ABC,XYZ1,A01,IND,I67,IND,NN 998,SGR,St,R834,scot,R834,scot,NN 985,SGR0399,St,R180,T15,R180,T1,YY 985,SGR0399,St,R180,T15,R180,T1,NN 985,SGR0399,St,R180,T15,R180,T1,NN 2943,SGR?99,St,R68,Scot,R77,Scot,YY... (2 Replies)
Discussion started by: unme
2 Replies

2. Shell Programming and Scripting

Converting Single Column into Multiple rows, but with strings to specific tab column

Dear fellows, I need your help. I'm trying to write a script to convert a single column into multiple rows. But it need to recognize the beginning of the string and set it to its specific Column number. Each Line (loop) begins with digit (RANGE). At this moment it's kind of working, but it... (6 Replies)
Discussion started by: AK47
6 Replies

3. Shell Programming and Scripting

Please Help!!!! Awk for summing columns based on selected column value

a,b,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,aa,bb,cc,dd,ee,ff,gg,hh,ii a thru ii are digits and strings.... The awk needed....if coloumn 9 == i (coloumn 9 is string ), output the sum of x's(coloumn 22 ) in all records and sum of y's (coloumn 23 ) in all records in a file (records.txt).... (6 Replies)
Discussion started by: BrownBob
6 Replies

4. UNIX for Dummies Questions & Answers

merging rows into new file based on rows and first column

I have 2 files, file01= 7 columns, row unknown (but few) file02= 7 columns, row unknown (but many) now I want to create an output with the first field that is shared in both of them and then subtract the results from the rest of the fields and print there e.g. file 01 James|0|50|25|10|50|30... (1 Reply)
Discussion started by: A-V
1 Replies

5. Shell Programming and Scripting

awk command to print only selected rows in a particular column specified by column name

Dear All, I have a data file input.csv like below. (Only five column shown here for example.) Data1,StepNo,Data2,Data3,Data4 2,1,3,4,5 3,1,5,6,7 3,2,4,5,6 5,3,5,5,6 From this I want the below output Data1,StepNo,Data2,Data3,Data4 2,1,3,4,5 3,1,5,6,7 where the second column... (4 Replies)
Discussion started by: ks_reddy
4 Replies

6. Shell Programming and Scripting

Compare files column to column based on keys

Here is my situation. I need to compare two tab separated files (diff is not useful since there could be known difference between files). I have found similar posts , but not fully matching.I was thinking of writing a shell script using cut and grep and while loop but after going thru posts it... (2 Replies)
Discussion started by: blackjack101
2 Replies

7. Shell Programming and Scripting

Sum a column value based on multiple keys

Hi, I have below as i/p file: 5ABC 36488989 K 000010000ASB BYTRES 5PQR 45757754 K 000200005KPC HGTRET 5ABC 36488989 K 000045000ASB HGTRET 5GTH 36488989 K 000200200ASB BYTRES 5FTU ... (2 Replies)
Discussion started by: nirnkv
2 Replies

8. Shell Programming and Scripting

sorting csv file based on column selected

Hi all, in my csv file it'll look like this, and of course it may have more columns US to UK;abc-hq-jcl;multimedia UK to CN;def-ny-jkl;standard DE to DM;abc-ab-klm;critical FD to YM;la-yr-tym;standard HY to MC;la-yr-ytm;multimedia GT to KJ;def-ny-jrt;critical I would like to group... (4 Replies)
Discussion started by: tententen
4 Replies

9. UNIX for Dummies Questions & Answers

Joining files based on multiple keys

I need a script (perl or awk..anything is fine) to join 3 files based on three key columns. The no of non-key columns can vary in each file. The columns are delimited by semicolon. For example, File1 Dim1;Dim2;Dim3;Fact1;Fact2;Fact3;Fact4;Fact5 ---- data delimited by semicolon --- ... (1 Reply)
Discussion started by: Sebben
1 Replies

10. Shell Programming and Scripting

flags to suppress column output, # of rows selected in db2 sql in UNIX

Hello, I am new to db2 SQL in unix so bear with me while I try to explain the situation. I have a text file that has the contents of the where condition that I am using for a db2 SQL in UNIX ksh. Here is the snippet. if ; then echo "Begin processing VALUEs" ... (1 Reply)
Discussion started by: jerardfjay
1 Replies
Login or Register to Ask a Question