Join files on multiple fields

07-28-2015

Registered User

16, 0

Join Date: Mar 2015

Last Activity: 5 October 2017, 4:20 PM EDT

Posts: 16

Thanks Given: 12

Thanked 0 Times in 0 Posts

Join files on multiple fields

Hello all,

I want to join 2 tabbed files on the first 2 fields, and filling the missing values with 0. The 3rd column in each file is constant for the entire file.

file1

Code:

12658699	ST5	XX2720	0	1	0	1					
53039541	ST5	XX2720	1	0	1.5	1

file2

Code:

53039541	ST5	X23	0	1	0	1					
1267456	ST1	X23	1	0	1.4	1

Desired output

Code:

12658699	ST5	XX2720	0	1	0	1	X23	0	0	0	0
53039541	ST5	XX2720	1	0	1.5	1	X23	0	1	0	1
1267456	ST1	XX2720	0	0	0	0	X23	1	0	1.4	1

Its throwing me an error when I do the following, after sorting..

Code:

join  -a1 -a2  -t$'\t'   -1 1,2 -2 1,2 file1 file2
join: invalid field number: `1,2'

Please help !

sheetalk

View Public Profile for sheetalk

Find all posts by sheetalk

07-28-2015

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello sheetalk,

I haven't tested in different scenarios but could you please check following if that helps you.

Code:

awk -F"\t" 'FNR==NR{A[$1]=$0;next} ($1 in A){Q=$1;$1="";gsub(/^[[:space:]]+/,X,$0);print A[Q] OFS $0;delete A[Q];next} !($1 in A){$3="XX2720" OFS 0 OFS 0 OFS 0 OFS 0 OFS "X23";print $0} END{for(i in A){gsub(/[[:space:]]+$/,X,A[i]);print A[i] OFS "X23" OFS 0 OFS 0 OFS 0 OFS 0}}' OFS="\t" file1 file2

Output will be as follows.

Code:

53039541 ST5    XX2720  1       0       1.5     1       X23     0       1       0       1
1267456  ST1    XX2720  0       0       0       0       X23     1       0       1.4     1
12658699 ST5    XX2720  0       1       0       1       X23     0       0       0       0

EDIT: Adding a non-one liner form of solution on same.

Code:

 awk -F"\t" 'FNR==NR{
                        A[$1]=$0;
                        next
                   }
                        ($1 in A){
                                        Q=$1;
                                        $1="";
                                        gsub(/^[[:space:]]+/,X,$0);
                                        print A[Q] OFS $0;
                                        delete A[Q];
                                        next
                                 }
                        !($1 in A){
                                        $3="XX2720" OFS 0 OFS 0 OFS 0 OFS 0 OFS "X23";
                                        print $0
                                  }
            END    {
                         for(i in A){
                                        gsub(/[[:space:]]+$/,X,A[i]);
                                        print A[i] OFS "X23" OFS 0 OFS 0 OFS 0 OFS 0
                                    }
                   }
           ' OFS="\t" file1 file2

Thanks,
R. Singh

Last edited by RavinderSingh13; 07-28-2015 at 12:42 PM.. Reason: Added a non-one liner form for solution too now

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

07-28-2015

Registered User

16, 0

Join Date: Mar 2015

Last Activity: 5 October 2017, 4:20 PM EDT

Posts: 16

Thanks Given: 12

Thanked 0 Times in 0 Posts

thanks ! can the 3rd column be picked up on the fly and not hard-coded ? The reason is I have multiple pairs of files to join..and I want to do them in a loop rather than hardcoding the names...

sheetalk

View Public Profile for sheetalk

Find all posts by sheetalk

07-28-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

The error msg is self explaining: join can have one single field per file only to join on.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

07-28-2015

Registered User

16, 0

Join Date: Mar 2015

Last Activity: 5 October 2017, 4:20 PM EDT

Posts: 16

Thanks Given: 12

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by RudiC

The error msg is self explaining: join can have one single field per file only to join on.

yes, is there any alternative? I have to join on the first 2 fields, and not on the first field..also I believe the solution provided is also based on just joining on the first field ?

sheetalk

View Public Profile for sheetalk

Find all posts by sheetalk

07-28-2015

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

See if this is of any use.

Code:

#!/usr/bin/perl
#
use strict;
use warnings;

# two files must be given at command line
my $first_file = shift or die;
my $second_file = shift or die;

my %data;
my $f;

# process lines from first file
open $f, "<", $first_file or die "$!\n";
while(<$f>) {
    # split line into fields
    my @fields = split;
    # create a key based on first two fields separated by tab
    my $key = join "\t", @fields[0..1];
    # add to data structure and append a list as place holder
    # for second file data
    $data{$key} = [@fields[2..$#fields], ("X23", 0, 0, 0, 0)];
}
close $f;

# process all lines from second file
open $f, "<", $second_file or die "$!\n";
while(<$f>) {
    my @fields = split;
    my $key = join "\t", @fields[0..1];
    # the same key exist in first and second file
    # remove the place holder data
    if(exists $data{$key}){
        $data{$key} = [@{$data{$key}}[0..4], @fields[2..$#fields]];
    # key only exist in second file. Add padding in front.
    }else{
        $data{$key} = [("XX2720", 0, 0, 0, 0), @fields[2..$#fields]]
    }
}
close $f;

for my $k (keys %data) {
    print join "\t", ($k, @{$data{$k}});
    print "\n";
}

Save: mergex.pl
Run: perl mergex.pl file1 file2

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

07-29-2015

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Note that this is technically not a join operation since you are also merging on fields that the files do not have in common, so the join command would not work anyway.

Also, these are TAB separated files and you are leaving out the last 5 empty fields in each file.

I cut off the last 5 fields is to assume there are no spaces in the fields and there are no other empty fields and use the default FS instead of \t plus use $1=$1 so the empty fields are discarded. If that is OK than this could work:

Code:

awk '
  BEGIN {
    OFS="\t"
  }
  FNR==1{
    X[++c]=$3                             # save the third fied; X[1] of the first file X[2] of the second one.
  }
  {
    $1=$1                                 # Discard last 5 trailing fields because FS is default
    i=$1 OFS $2                           # set i to the first two fields, separated by OFS
  } 
  NR==FNR {                               # process first file
    A[i]=$0                               # Put record from first file into array A with first two fields as index
    next
  }
                                          # process second file
  i in A {                                # if the record exist in A, print joined record
    print A[i],$3,$4,$5,$6,$7
    delete A[i]                           # delete record since index was matched
    next
  }
  {                                       # if the record does not exist in A
    print i,X[1],0,0,0,0,$3,$4,$5,$6,$7   # print it with the remaing fields zeroed and the saved 3rd field of file 1
  } 
  END {
    for(i in A)                           # For the remaining records in file 1 that were not matched
      print A[i],X[2],0,0,0,0             # print them with the remaing fields zeroed and the saved 3rd field of file 2
  }
' file1 file2                             # process file1 first and then file 2

If that is not OK then the fields need to be assigned differently (or the trailing fields need to be discardedd in a different way) and FS should be set to \t

Last edited by Scrutinizer; 07-29-2015 at 08:54 AM..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

Shell Programming and Scripting

Join files on multiple fields

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Join multiple files

Discussion started by: fat

2. Shell Programming and Scripting

Join fields comparing 4 fields using awk

Discussion started by: aksijain

3. Shell Programming and Scripting

Join fields from files with duplicate lines

Discussion started by: xan.amini

4. Shell Programming and Scripting

awk program to join 2 fields of different files

Discussion started by: abhisheksunkari

5. UNIX for Dummies Questions & Answers

How to use the the join command to join multiple files by a common column

Discussion started by: evelibertine

6. UNIX for Dummies Questions & Answers

Need help with Join on multiple fields

Discussion started by: shunter0810

7. Shell Programming and Scripting

How to join multiple files?

Discussion started by: theFinn

8. UNIX for Dummies Questions & Answers

Join 2 files with multiple columns: awk/grep/join?

Discussion started by: InfoSeeker

9. Shell Programming and Scripting

join on a file with multiple lines, fields

Discussion started by: crimper

10. Shell Programming and Scripting

join on multiple fields

Discussion started by: reggiej