awk script to (un)/concatenate fields in file

03-22-2010

Registered User

7, 0

Join Date: Mar 2010

Last Activity: 4 April 2010, 4:32 PM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

awk script to (un)/concatenate fields in file

Hi everyone,

I'm trying to use the "join" function for more than 1 field. Since it's not possible as it is, I want to take my input files and concatenate the joining fields as 1 field (separated by "|"). I wrote 2 awk script to do and undo it (see below). However I'm new to awk and I'm certain I could do it in a much more efficient way.

I found various topics around the question but often the syntax proposed is a bit of a mystery to me. For instance someone posted this:

BEGIN{FS=OFS="\t"}NR==FNR{a[$1$2]=$4;b[$1$2]=$5;c[$1$2]=$6;next}{$4=$4-a[$1$2];$5=$5-b[$1$2];$6=$6-c[$1$2]}1

what does the trailing '1' mean? what are there 2 separated {} and what distinguish them? finally, where can I find doc on that kind of questions (googling "awk trailing digit" didn't help me much!!)

Here are my scripts, I don't care much about syntax shortcuts, I only care about speed of execution!

any help would be greatly appreciated

to concatenate:

Code:

#!/bin/sh
#
# usage:
#     nawk -F$'\t' -v JF=3,5 -f concatene.awk ~/tmp/tmp15
#     nawk -F$'\t' -v JF=15,16,17,18 -f concatene.awk split/snp_j > concat
#
# JF stands for "join fields"
BEGIN { FS="\t";OFS="\t" }
{ 
    if (NR==1) {    # to do it only once (NR starts at 1)
        N=split(JF,JFS,",");
        for (i=1;i<=N;i++) {    # reverse it
            RJFS[JFS[i]] = i;
        }
    }

    LINE="";
    for (FIELD_INDEX=1 ; FIELD_INDEX<=N ; FIELD_INDEX++ ) {
        LINE=(FIELD_INDEX==1 ? "" : LINE"|")$JFS[FIELD_INDEX];
    }
    for (FIELD_INDEX=1 ; FIELD_INDEX<=NF ; FIELD_INDEX++ ) {
        if (!RJFS[FIELD_INDEX]) {
            LINE=LINE"\t"$FIELD_INDEX;
        }
    }
    print LINE;
}

example:
input: a b c d e f
output: c|e a b d f

to "un"concatenate:

Code:

#!/bin/sh
# nawk -F$'\t' -v JF=3,5 -f unconcatene.awk test
BEGIN { FS="\t";OFS="\t" }
{ 
    if (NR==1) {    # to do it only once (NR starts at 1)
        N=split(JF,JFS,",");
        for (i=1;i<=N;i++) {    # reverse it
            RJFS[JFS[i]] = i;
        }
    }

    N2=split($1,JFS2,"|");    # N=N2
    for (i=1;i<=N;i++) {    # reverse it
        RJFS[JFS[i]] = JFS2[i];
    }

    SIZE=NF-1+N;
    FIELD_INDEX=2;
    LINE="";
    for (NEW_FIELD_INDEX=1 ; NEW_FIELD_INDEX<=SIZE ; NEW_FIELD_INDEX++ ) {
        LINE=LINE(NEW_FIELD_INDEX==1 ? "" : "\t");
        if (RJFS[NEW_FIELD_INDEX]) {
            LINE=(LINE)RJFS[NEW_FIELD_INDEX];
        } else {
            LINE=(LINE)$FIELD_INDEX;        
            FIELD_INDEX++;
        }
    }
    print LINE;
}

Thanks!!

example:
input: c|e a b d f
output: a b c d e f

Anthony

anthony.cros

View Public Profile for anthony.cros

Find all posts by anthony.cros

03-23-2010

Registered User

511, 29

Join Date: Sep 2008

Last Activity: 10 November 2015, 2:16 AM EST

Location: In the beautiful World...

Posts: 511

Thanks Given: 10

Thanked 29 Times in 29 Posts

Can you explain a little bit more on how you want to get this..

Code:

input: a b c d e f
output: c|e a b d f

malcomex999

View Public Profile for malcomex999

Find all posts by malcomex999

03-23-2010

Registered User

7, 0

Join Date: Mar 2010

Last Activity: 4 April 2010, 4:32 PM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi,

First thanks for responding!
I'm not sure what you mean by "how" I want to get this but I'll give you a more thorough example:

I have this file for instance (TSV):

Code:

a    b    c    d    e    f
g    h    i    j    k    l
m    n    o    p    q    r

And say I want to join fields 2, 3 and 6 with 3 columns of another file. Because join uses only 1 field, I want to put the fields 2, 3 and 6 together separated by only pipe (as opposed to my other fields separated with tabs). So the result of the concatene.awk script will give me the following:

Code:

b|c|f    a    d    e
h|i|l    g    j    k
n|o|r    m    p    q

to do so in the current script, I pass "2,3,6" as a parameter and for each line create two arrays like:
(example for the first line only)
JFS[0] = b, JFS[1] = c, JFS[2] = f
RJFS[2] = b, RJFS[3] = c, RJFS[6] = f
from there I rebuild my line by first going through JFS with a pipe separation, then adding the other fields with a tab separation by going through the NF fields and ignoring the ones for which RJFS[field] exist.

Hope this makes more sense! I bet there is a way to do it in a much more optimized way though..!

anthony.cros

View Public Profile for anthony.cros

Find all posts by anthony.cros

03-23-2010

Registered User

7,747, 559

Join Date: Feb 2007

Last Activity: 20 April 2020, 11:28 AM EDT

Location: The Netherlands

Posts: 7,747

Thanks Given: 139

Thanked 559 Times in 520 Posts

Can you post some lines of your input files and the desired output?

Franklin52

View Public Profile for Franklin52

Find all posts by Franklin52

03-23-2010

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Code:

nawk -f anthony.awk myFile
OR
nawk -v jf='2,3,6' -f anthony.awk myFile

anthony.awk:

Code:

BEGIN {
  FS=OFS="\t"

  SEP_jf="|"
  if (!jf) jf="3,5"

  n=split(jf, jfA, ",")
  for(i=1;i<=n;i++)
    jfO[jfA[i]]
}
{
  line=jfS=""
  for(i=1;i<=NF;i++)
    if (i in jfO)
       jfS=(jfS)?jfS SEP_jf $i: $i
    else
      line=(line)?line OFS $i:$i
  print jfS, line
}

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

03-23-2010

Registered User

7, 0

Join Date: Mar 2010

Last Activity: 4 April 2010, 4:32 PM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks a lot for your response,

i see that using BEGIN is cleaner than my "if (NR==1)" and that "if (i in JFO)" exists is good to know!!

Franklin, my post from 9:55 describes it pretty well, what info are you missing?

anthony.cros

View Public Profile for anthony.cros

Find all posts by anthony.cros

Shell Programming and Scripting

awk script to (un)/concatenate fields in file

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Concatenate a string and number and compare that with another string in awk script

Discussion started by: bhagya123

2. Shell Programming and Scripting

awk script concatenate two column and perform mutiplication

Discussion started by: as7951

3. Shell Programming and Scripting

XML Fields comparison using awk script

Discussion started by: VasuKukkapalli

4. Shell Programming and Scripting

awk script to parse case with information in two fields of file

Discussion started by: cmccabe

5. Shell Programming and Scripting

How to get fields and get output with awk or shell script.?

Discussion started by: sabercats

6. Shell Programming and Scripting

Comparing two csv file fields using awk script

Discussion started by: rajak.net

7. UNIX for Advanced & Expert Users

Concatenate lines in file shell script

Discussion started by: systemoper

8. Shell Programming and Scripting

Need awk script to compare 2 fields in fixed length file.

Discussion started by: Muga801

9. Shell Programming and Scripting

Get 4 character each from 2 different fields concatenate and add as a new field

Discussion started by: ajithshankar@ho

10. Shell Programming and Scripting

awk sed cut? to rearrange random number of fields into 3 fields

Discussion started by: axo959