Merging two text files by a column and filling in the missing values

06-16-2012

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Here is a generalization of the solution to the two-file problem:

Code:

#!/usr/bin/env bash

# @(#) s1	Demonstrate multi-file join, marked for missing data.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
  head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C join multi-join

# Set up working files from sacred files.
s=( data? )
for ((i=0;i<5;i++ ))
do
  cp ${s[$i]} w${i}
done

# Display working files.
pl " Input data files:"
head w* expected-output.txt

for (( i=0;i<5;i++))
do
  # Compare working file i with all files j, i != j
  for ((j=0;j<5;j++))
  do
    [[ $i -eq $j ]] && continue
	db "comparing i=$i to j=$j"
    join -v 2 <( sort -k1,1 w$i ) <( sort -k1,1 w$j ) |
    awk '{print $1,"X"}' >> w$i
  done
done

# Join all 5 augmented, working files.
pl " Results of multi-join:"
# multi-join -d " " w* |
mjb w* |
sort -k1,1 |
tee f1

# Compare computed with standard.
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe;  pe " Results not verifiable." >&2 )

exit 0

producing:

Code:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
join (GNU coreutils) 6.10
multi-join (local) 1.6

-----
 Input data files:
==> w0 <==
AB 1
CD 1
EF 1

==> w1 <==
EF 2
GH 2
IJ 2

==> w2 <==
IJ 3
KL 3
MN 3

==> w3 <==
MN 4
OP 4
RS 4

==> w4 <==
RS 5
TU 5
VW 5

==> expected-output.txt <==
AB 1 X X X X
CD 1 X X X X
EF 1 2 X X X
GH X 2 X X X
IJ X 2 3 X X
KL X X 3 X X
MN X X 3 4 X
OP X X X 4 X
RS X X X 4 5
TU X X X X 5

-----
 Results of multi-join:
AB 1 X X X X
CD 1 X X X X
EF 1 2 X X X
GH X 2 X X X
IJ X 2 3 X X
KL X X 3 X X
MN X X 3 4 X
OP X X X 4 X
RS X X X 4 5
TU X X X X 5
VW X X X X 5

-----
 Comparison of 11 created lines with 11 lines of desired results:
 Succeeded -- files have same content.

The code mjb is a perl script that does the multiple-file join, but without any frills: no option processing, very little error processing, used to illustrate the idea, rather than a production code.

See man pages for details on other utilities.

Best wishes ... cheers, drl

mjb.pl (1.2 KB)

These 2 Users Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

07-31-2012

Registered User

193, 0

Join Date: May 2011

Last Activity: 21 September 2015, 10:44 PM EDT

Posts: 193

Thanks Given: 94

Thanked 0 Times in 0 Posts

Thank you so much for the mjb.pl script. It works wonderfully. I was wondering if there was a way to modify the script so that it is able to join files with multiple columns (2+) and fill in the missing columns with X. Example input:

File 1:

Code:

AA 11 22
BB 12 23
CC 32 23

File 2:

Code:

BB 12 23 
CC 34 56
DD 11 22

File 3:

Code:

AA 12 21
CC 87 90
DD 10 20

Output:

Code:

AA 11 22 X X 12 21
BB 12 23 12 23 X X 
CC 32 23 34 56 87 90
DD X X 11 22 10 20

Thanks!

evelibertine

View Public Profile for evelibertine

Find all posts by evelibertine

08-01-2012

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Here is the script modified to treat n=2 fields as a group by making them into a single string, tokens separated by an underscore, "_". The utilities will then treat the group as a single entity:

Code:

#!/usr/bin/env bash

# @(#) s1	Demonstrate multi-file join, marked for missing data, 2 tokens.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
  head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C join multi-join mjb pass-fail sed

rm w*
# Set up working files from sacred files.
# join n=2 fields with "_" to keep them together.
s=( data? )
# for ((i=0;i<5;i++ ))
for ((i=0;i<3;i++ ))
do
  # cp ${s[$i]} w${i}
  awk '{ print $1,$2"_"$3 }' ${s[$i]} > w${i}
done

# Display working files.
pl " Input data files:"
head w* expected-output.txt

# for (( i=0;i<5;i++))
for (( i=0;i<3;i++))
do
  # Compare working file i with all files j, i != j
  # for ((j=0;j<5;j++))
  for ((j=0;j<3;j++))
  do
    [[ $i -eq $j ]] && continue
	db "comparing i=$i to j=$j"
    join -v 2 <( sort -k1,1 w$i ) <( sort -k1,1 w$j ) |
    # awk '{print $1,"X"}' >> w$i
    awk '{print $1,"X_X"}' >> w$i
  done
done

# Join all 5 augmented, working files.
pl " Results of multi-join:"
# multi-join -d " " w* |
mjb w* |
sort -k1,1 |
sed 's/_/ /g' |
tee f1

# Compare computed with standard.
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe;  pe " Results not verifiable." >&2 )

exit 0

producing:

Code:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
join (GNU coreutils) 6.10
multi-join (local) 1.6
mjb - ( local: RepRev 1.2, ~/bin/mjb, 2012-03-06 )
pass-fail - ( local: RepRev 1.2, ~/bin/pass-fail, 2012-06-14 )
sed GNU sed version 4.1.5

-----
 Input data files:
==> w0 <==
AA 11_22
BB 12_23
CC 32_23

==> w1 <==
BB 12_23
CC 34_56
DD 11_22

==> w2 <==
AA 12_21
CC 87_90
DD 10_20

==> expected-output.txt <==
AA 11 22 X X 12 21
BB 12 23 12 23 X X 
CC 32 23 34 56 87 90
DD X X 11 22 10 20

-----
 Results of multi-join:
AA 11 22 X X 12 21
BB 12 23 12 23 X X
CC 32 23 34 56 87 90
DD X X 11 22 10 20

-----
 Comparison of 4 created lines with 4 lines of desired results:
expected-output.txt f1 differ: char 38, line 2
 Failed -- files not identical -- detailed comparison follows.
 Succeeded by ignoring whitespace differences.

For groups other than 2, modify the appropriate lines. To make it automatically handle n=arbitrary tokens on each line will require some thought.

Best wishes ... cheers, drl

These 2 Users Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

UNIX for Dummies Questions & Answers

Merging two text files by a column and filling in the missing values

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extracting values based on line-column numbers from multiple text files

Discussion started by: Bastami

2. UNIX for Dummies Questions & Answers

Dynamically merging 2 files on header values

Discussion started by: kushagra

3. Shell Programming and Scripting

Merging 2 text files when there is a common time stamp column in them

Discussion started by: ks_reddy

4. UNIX Desktop Questions & Answers

merging files and add missing rows

Discussion started by: A-V

5. UNIX for Dummies Questions & Answers

Match values/IDs from column and text files

Discussion started by: ad23

6. UNIX for Dummies Questions & Answers

Comparing two text files by a column and printing values that do not match

Discussion started by: evelibertine

7. UNIX for Dummies Questions & Answers

Merging two text files by a column

Discussion started by: evelibertine

8. UNIX for Dummies Questions & Answers

Merging two text files by a column

Discussion started by: evelibertine

9. Shell Programming and Scripting

Filling in missing columns

Discussion started by: gisele_l

10. Shell Programming and Scripting

Merging column files

Discussion started by: swapna321