Awk: group multiple fields from different records

07-06-2018

Registered User

123, 1

Join Date: Apr 2012

Last Activity: 3 February 2020, 7:11 AM EST

Posts: 123

Thanks Given: 70

Thanked 1 Time in 1 Post

Awk: group multiple fields from different records

Hi,

My input looks like that:

Code:

A|123|qwer
A|456|tyui
A|456|wsxe
B|789|dfgh

Using awk, I am trying to get:

Code:

A|123;456|qwer;tyui;wsxe
B|789|dfgh

For records with same $1, group all the $2 in a field (without replicates), and all the $3 in a field (without replicates).

What I have tried:

Code:

echo -e "A|123|qwer\nA|456|tyui\nA|456|wsxe\nB|789|dfgh" | gawk 'BEGIN{FS=OFS="|"}{a[$1]=sprintf("%s%s", a[$1], a[$1] ~ /$2/ ? "":";"$2); b[$1]=sprintf("%s%s", b[$1], b[$1] ~ /$3/ ? "":";"$3)}END{for(i in a){print i FS a[i] FS b[i]}}'

(Wrong) output:

Code:

A|;123;456;456|;qwer;tyui;wsxe
B|;789|;dfgh

However, I still cannot manage to remove the duplicated strings inside fields $2 and $3.

Last edited by beca123456; 07-06-2018 at 09:12 AM..

beca123456

View Public Profile for beca123456

Find all posts by beca123456

07-06-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Do yourself a favour and start indenting / structuring your code for readability and understandability. Try

Code:

awk -F\| '
        {if (!(a[$1] ~ $2)) a[$1] = a[$1] DL[$1] $2
         if (!(b[$1] ~ $3)) b[$1] = b[$1] DL[$1] $3
         DL[$1] = ";"
        }
END     {for (i in a)   {print i, a[i], b[i]
                        }
        }
' OFS="|"  file
A|123;456|qwer;tyui;wsxe
B|789|dfgh

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

07-06-2018

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

The following variant does a precise lookup (to supress duplicates),
and does not need an array of delimiters:

Code:

awk '
BEGIN {
  FS=OFS="|"
  dl=";"
}
function strjoin(i, j){
  if (i=="") return j  # first element
  if (index((dl i dl), (dl j dl))) return i # duplicate
  return (i dl j) # join element
} 
{
  s2[$1]=strjoin (s2[$1], $2)
  s3[$1]=strjoin (s3[$1], $3)
}
END {
  for (i in s2) print i, s2[i], s3[i]
}
' file

This is a good demonstration of a function

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

07-06-2018

Registered User

123, 1

Join Date: Apr 2012

Last Activity: 3 February 2020, 7:11 AM EST

Posts: 123

Thanks Given: 70

Thanked 1 Time in 1 Post

Quote:

The following variant does a precise lookup (to supress duplicates)

I don't understand this statement. Both solutions seem to work just fine.
Is one more prone to errors than the other?

Last edited by beca123456; 07-06-2018 at 03:53 PM..

beca123456

View Public Profile for beca123456

Find all posts by beca123456

07-06-2018

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

The regular expression search ~ is different from the string search via index.
You'll see differences e.g. with the following input files

Code:

A|123|qwer
A|456|tyui
A|45|wsxe
B|789|dfgh

Code:

A|123|qwer
A|455|tyui
A|45*|wsxe
B|789|dfgh

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

07-06-2018

Registered User

123, 1

Join Date: Apr 2012

Last Activity: 3 February 2020, 7:11 AM EST

Posts: 123

Thanks Given: 70

Thanked 1 Time in 1 Post

Very good point !
I got it now, thanks !

beca123456

View Public Profile for beca123456

Find all posts by beca123456

07-06-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

You can "sharpen" or "narrow down" the regex to avoid false positive matches like

Code:

awk -F\| '
        {if (!(a[$1] ~ "(^|;)" $2 "(;|$)")) a[$1] = a[$1] DL[$1] $2
         if (!(b[$1] ~ "(^|;)" $3 "(;|$)")) b[$1] = b[$1] DL[$1] $3
         DL[$1] = ";"
        }
END     {for (i in a)   {print i, a[i], b[i]
                        }
        }
' OFS="|"  file

RudiC

View Public Profile for RudiC

Find all posts by RudiC

UNIX for Beginners Questions & Answers

Awk: group multiple fields from different records

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk for matching fields between files with repeated records

Discussion started by: jvoot

2. Shell Programming and Scripting

Print multiple fields with awk

Discussion started by: SkySmart

3. Shell Programming and Scripting

Shell Script to Group by Based on Multiple Fields in a file

Discussion started by: cnu_theprince

4. UNIX for Dummies Questions & Answers

Make all records with the same number of fields (awk)

Discussion started by: beca123456

5. Shell Programming and Scripting

awk multiple fields separators

Discussion started by: greycells

6. Shell Programming and Scripting

awk gsub multiple fields

Discussion started by: nakaedu

7. UNIX for Dummies Questions & Answers

keeping last record among group of records with common fields (awk)

Discussion started by: beca123456

8. Shell Programming and Scripting

how to parse with awk (using different fields), then group by a field?

Discussion started by: Josef_Stalin

9. Infrastructure Monitoring

Processing records as group - awk

Discussion started by: baskar

10. UNIX for Dummies Questions & Answers

AWK ??-print for fields within records in a file

Discussion started by: hyennah