Combine multiple rows based on selected column keys

08-09-2016

Registered User

3, 0

Join Date: Aug 2016

Last Activity: 10 August 2016, 3:20 AM EDT

Posts: 3

Thanks Given: 1

Thanked 0 Times in 0 Posts

Combine multiple rows based on selected column keys

Hello
I want to collapse a file with multiple rows into consolidated lines of entries based on selected columns as the 'key'.

Example:

Code:

1 2 3 Abc def ghi
1 2 3 jkl mno p qrts
6 9 0 mno def Abc
7 8 4 Abc mno mno abc 
7 8 9 mno mno abc
7 8 9  mno j k

So if columns 1, 2 and 3 are designated as keys, output should be -

Code:

1 2 3 Abc def ghi jkl mno p qrts
6 9 0 mno def Abc
7 8 4 Abc mno mno abc 
7 8 9 mno mno abc mno j k

Thanks!

Last edited by rbatte1; 08-09-2016 at 04:36 AM.. Reason: Added CODE tags

linuxlearner123

View Public Profile for linuxlearner123

Find all posts by linuxlearner123

08-09-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello linuxlearner123,

A warm welome to forums, hope you will enjoy learning here. Please use code tags as per forum rules for code/commands/Inputs you are using in your post. Could you please try following and let me know if this helps.

i- If you are not bothered about the order of lines in output then following may help you in same.

Code:

awk '{Q=$1 OFS $2 OFS $3;P=$0;$1=$2=$3="\b";A[Q]=A[Q]?A[Q] OFS $0:P} END{for(i in A){print A[i]}}'  Input_file

ii- If you want the output as sequence as Input_file then following may help you where I am reading Input_file 2 times.

Code:

awk 'FNR==NR{Q=$1 OFS $2 OFS $3;P=$0;$1=$2=$3="\b";A[Q]=A[Q]?A[Q] OFS $0:P;next} (($1 OFS $2 OFS $3) in A){print A[$1 OFS $2 OFS $3]}' Input_file  Input_file

I haven't tested both commands as I am not on a box right now, please check them and let us know how it goes then.

Thanks,
R. Singh

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

08-09-2016

Registered User

3, 0

Join Date: Aug 2016

Last Activity: 10 August 2016, 3:20 AM EDT

Posts: 3

Thanks Given: 1

Thanked 0 Times in 0 Posts

Thank you for the prompt reply. I tried both the commands.
(i)

Code:

awk '{Q=$1 OFS $2 OFS $3;P=$0;$1=$2=$3="\b";A[Q]=A[Q]?A[Q] OFS $0:P} END{for(i in A){print A[i]}}' input_file
6 9 0 mno def Abc
7 8 4 Abc mno mno abc
7 8 9 mno mno abc mno j k
1 2 3 Abc def ghi jkl mno p qrts

(ii)

Code:

awk 'FNR==NR{Q=$1 OFS $2 OFS $3;P=$0;$1=$2=$3="\b";A[Q]=A[Q]?A[Q] OFS $0:P;next} (($1 OFS $2 OFS $3) in A){print A[$1 OFS $2 OFS $3]}' input_file input_file

1 2 3 Abc def ghi jkl mno p qrts
1 2 3 Abc def ghi jkl mno p qrts
6 9 0 mno def Abc
7 8 4 Abc mno mno abc
7 8 9 mno mno abc mno j k
7 8 9 mno mno abc mno j k

So while the order of the lines would be helpful, I think the first command should help. I can then sort the lines. Or I could use `uniq` on the second command.

Do you mind explaining either of the two commands a bit, so I can use and modify in the future? Thanks!

Last edited by rbatte1; 08-09-2016 at 04:35 AM.. Reason: Adjusted CODE tags to include the output

linuxlearner123

View Public Profile for linuxlearner123

Find all posts by linuxlearner123

08-09-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello linuxlearner123,

For second command you could use following where it will NOT show duplicate lines.

Code:

awk 'FNR==NR{Q=$1 OFS $2 OFS $3;P=$0;$1=$2=$3="\b";A[Q]=A[Q]?A[Q] OFS $0:P;next} (($1 OFS $2 OFS $3) in A){print A[$1 OFS $2 OFS $3];delete A[$1 OFS $2 OFS $3]}' Input_file Input_file

Output will be as follows.

Code:

1 2 3 Abc def ghi jkl mno p qrts
6 9 0 mno def Abc
7 8 4 Abc mno mno abc
7 8 9 mno mno abc mno j k

Now coming to explanation part on same. Following may help you in same.

Code:

awk 'FNR==NR{               #### Here FNR and NR are the awk variables default ones. Where FNR and NR both represents the number of line for Input_file. Difference between FNR and NR is, FNR's value will be RESET whenever awk has completed a Input_file reading and reading another one. 
                                 So FNR==NR condition will be TRUE only when 1st file is getting read because NR's value will be increased till second time Input_file is being read.
Q=$1 OFS $2 OFS $3;         #### Creating a variable named Q, whose value is $1 OFS $2 OFS $3, means field 1st space 2nd field space 3rd field. Where OFS is output field seprator whose defaulkt value is SPACE.
P=$0;                       #### Creating a variable named P whose value is $0(current line's value).
$1=$2=$3="\b";              #### Here I am removing the value of 1st, 2nd and 3rd field is to backspace or setting it to NULL.
A[Q]=A[Q]?A[Q] OFS $0:P;    #### Creating an array here whose index is variable Q($1 $2 $3, as per your requirement), so first value for any index will be complete line and after that it will keep on concatenate the values to it's previous values.
next}                       #### Skip all next statements by awk's in built variable named next.
(($1 OFS $2 OFS $3) in A){  #### Now when 2nd Input_file is being read then this condition will be executed. Where I am checking $1 OFS $2 OFS $3 if they are present in array A then execute following statements.
print A[$1 OFS $2 OFS $3];  #### printing the value of array A whose index is $1 OFS $2 OFS $3, whose value we already got while reading previous Input_file.
delete A[$1 OFS $2 OFS $3]  #### Deleting the value of array A whose index is $1 OFS $2 OFS $3, so that we could avoid duplicate printing the lines.
}'  Input_file   Input_file #### Mentioning the Input_file 2 times to be read.

Thanks,
R. Singh

Last edited by RavinderSingh13; 08-09-2016 at 04:35 AM..

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

08-09-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Assingning a backspace character to a field does NOT remove it. Your above line will look like

Code:

<BS> FS <BS> FS <BS> FS <rest of record>

. While this may not be apparent when printing the line to a terminal, using it in further processing (analysing it, putting it into a DB) it may not react the way you would expect.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-09-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello linuxlearner123,

Could you please try following too.

Code:

awk 'FNR==NR{for(i=4;i<=NF;i++){Q=Q?Q OFS $i:$i};A[$1 OFS $2 OFS $3]=A[$1 OFS $2 OFS $3]?A[$1 OFS $2 OFS $3] OFS Q:Q;Q="";next} (($1 OFS $2 OFS $3) in A){print $1 OFS $2 OFS $3 OFS A[$1 OFS $2 OFS $3];delete A[$1 OFS $2 OFS $3]}'  Input_file  Input_file

Output will be as follows.

Code:

1 2 3 Abc def ghi jkl mno p qrts
6 9 0 mno def Abc
7 8 4 Abc mno mno abc
7 8 9 mno mno abc mno j k

Following is a non-one liner form of above solution too.

Code:

awk 'FNR==NR{for(i=4;i<=NF;i++){
                                Q=Q?Q OFS $i:$i
                               };
             A[$1 OFS $2 OFS $3]=A[$1 OFS $2 OFS $3]?A[$1 OFS $2 OFS $3] OFS Q:Q;
             Q="";
             next
            }
     (($1 OFS $2 OFS $3) in A) {
                                print $1 OFS $2 OFS $3 OFS A[$1 OFS $2 OFS $3];
                                delete A[$1 OFS $2 OFS $3]
                               }
    '  Input_file  Input_file

EDIT: Adding explanation of above code too.

Code:

awk 'FNR==NR{for(i=4;i<=NF;i++){                                                     #### Mentioning condition FNR==NR, which will be TRUE when first Input_file is being read. If condition is TRUE then executing the statements into the braces. Where a for loop stars from the variable named i's value from 4 to till the number of fields into that line. 
                                                                                          We are starting from i's value as 4 because first 3 fields are indexes as per your requirement. 
                                Q=Q?Q OFS $i:$i                                      #### We are storing values of all fields apart from $1,$2 and $3 into a variable named Q.
                               };
             A[$1 OFS $2 OFS $3]=A[$1 OFS $2 OFS $3]?A[$1 OFS $2 OFS $3] OFS Q:Q;    #### Creating an array named A whose index is $1 OFS $2 OFS $3 and it's value is concatenating values for common indexes, so for very first time for any index it will be value of Q and then for next time(means when next line comes) it will be concatenated to previous values of array A.
             Q="";                                                                   #### Nullifying the value of variable Q so that it shouldn't be keep on increasing it's value and in next line it's value should be started from scratch only.
             next                                                                    #### Skipping all next statements by using next keyword here.
            }
     (($1 OFS $2 OFS $3) in A) {                                                     #### Now these statements will be executed when second Input_file is being read. If ($1 OFS $2 OFS $3)'s value is present as an index in array A then execute following statements.
                                print $1 OFS $2 OFS $3 OFS A[$1 OFS $2 OFS $3];      #### print the first, second and third field along with the value of array A whose index is current 1st, 2nd and 3rd fields.
                                delete A[$1 OFS $2 OFS $3]                           #### Delete the array A's value whose index is $1 OFS $2 OFS $3, so that duplicate results shouldn't be printed on same indexes for next lines.
                               }
    ' Input_file   Input_file                                                        #### Mentioning the Input_file here 2 times.

Thanks,
R. Singh

Last edited by RavinderSingh13; 08-09-2016 at 10:33 AM.. Reason: Added explanation for solution too now.

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

08-10-2016

Registered User

3, 0

Join Date: Aug 2016

Last Activity: 10 August 2016, 3:20 AM EDT

Posts: 3

Thanks Given: 1

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by RudiC

Assingning a backspace character to a field does NOT remove it. Your above line will look like

Code:

<BS> FS <BS> FS <BS> FS <rest of record>

. While this may not be apparent when printing the line to a terminal, using it in further processing (analysing it, putting it into a DB) it may not react the way you would expect.

Yes, this is correct. When I tried the command with another file, it is essentially pasting rest of the record.

---------- Post updated at 02:20 AM ---------- Previous update was at 02:06 AM ----------

@R.Singh:
The updated solution seems to be working correctly, for now. Will update thread if I run into any issues. Thank you

linuxlearner123

View Public Profile for linuxlearner123

Find all posts by linuxlearner123

Shell Programming and Scripting

Combine multiple rows based on selected column keys

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find All duplicates based on multiple keys

Discussion started by: unme

2. Shell Programming and Scripting

Converting Single Column into Multiple rows, but with strings to specific tab column

Discussion started by: AK47

3. Shell Programming and Scripting

Please Help!!!! Awk for summing columns based on selected column value

Discussion started by: BrownBob

4. UNIX for Dummies Questions & Answers

merging rows into new file based on rows and first column

Discussion started by: A-V

5. Shell Programming and Scripting

awk command to print only selected rows in a particular column specified by column name

Discussion started by: ks_reddy

6. Shell Programming and Scripting

Compare files column to column based on keys

Discussion started by: blackjack101

7. Shell Programming and Scripting

Sum a column value based on multiple keys

Discussion started by: nirnkv

8. Shell Programming and Scripting

sorting csv file based on column selected

Discussion started by: tententen

9. UNIX for Dummies Questions & Answers

Joining files based on multiple keys

Discussion started by: Sebben

10. Shell Programming and Scripting

flags to suppress column output, # of rows selected in db2 sql in UNIX

Discussion started by: jerardfjay