Merging two text files by a column and filling in the missing values


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Merging two text files by a column and filling in the missing values
# 8  
Old 06-16-2012
Hi.

Here is a generalization of the solution to the two-file problem:
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate multi-file join, marked for missing data.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
  head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C join multi-join

# Set up working files from sacred files.
s=( data? )
for ((i=0;i<5;i++ ))
do
  cp ${s[$i]} w${i}
done

# Display working files.
pl " Input data files:"
head w* expected-output.txt

for (( i=0;i<5;i++))
do
  # Compare working file i with all files j, i != j
  for ((j=0;j<5;j++))
  do
    [[ $i -eq $j ]] && continue
	db "comparing i=$i to j=$j"
    join -v 2 <( sort -k1,1 w$i ) <( sort -k1,1 w$j ) |
    awk '{print $1,"X"}' >> w$i
  done
done

# Join all 5 augmented, working files.
pl " Results of multi-join:"
# multi-join -d " " w* |
mjb w* |
sort -k1,1 |
tee f1

# Compare computed with standard.
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe;  pe " Results not verifiable." >&2 )

exit 0

producing:
Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
join (GNU coreutils) 6.10
multi-join (local) 1.6

-----
 Input data files:
==> w0 <==
AB 1
CD 1
EF 1

==> w1 <==
EF 2
GH 2
IJ 2

==> w2 <==
IJ 3
KL 3
MN 3

==> w3 <==
MN 4
OP 4
RS 4

==> w4 <==
RS 5
TU 5
VW 5

==> expected-output.txt <==
AB 1 X X X X
CD 1 X X X X
EF 1 2 X X X
GH X 2 X X X
IJ X 2 3 X X
KL X X 3 X X
MN X X 3 4 X
OP X X X 4 X
RS X X X 4 5
TU X X X X 5

-----
 Results of multi-join:
AB 1 X X X X
CD 1 X X X X
EF 1 2 X X X
GH X 2 X X X
IJ X 2 3 X X
KL X X 3 X X
MN X X 3 4 X
OP X X X 4 X
RS X X X 4 5
TU X X X X 5
VW X X X X 5

-----
 Comparison of 11 created lines with 11 lines of desired results:
 Succeeded -- files have same content.

The code mjb is a perl script that does the multiple-file join, but without any frills: no option processing, very little error processing, used to illustrate the idea, rather than a production code.

See man pages for details on other utilities.

Best wishes ... cheers, drl
These 2 Users Gave Thanks to drl For This Post:
# 9  
Old 07-31-2012
Thank you so much for the mjb.pl script. It works wonderfully. I was wondering if there was a way to modify the script so that it is able to join files with multiple columns (2+) and fill in the missing columns with X. Example input:

File 1:
Code:
AA 11 22
BB 12 23
CC 32 23

File 2:
Code:
BB 12 23 
CC 34 56
DD 11 22

File 3:
Code:
AA 12 21
CC 87 90
DD 10 20

Output:
Code:
AA 11 22 X X 12 21
BB 12 23 12 23 X X 
CC 32 23 34 56 87 90
DD X X 11 22 10 20

Thanks!
# 10  
Old 08-01-2012
Hi.

Here is the script modified to treat n=2 fields as a group by making them into a single string, tokens separated by an underscore, "_". The utilities will then treat the group as a single entity:
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate multi-file join, marked for missing data, 2 tokens.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
  head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C join multi-join mjb pass-fail sed

rm w*
# Set up working files from sacred files.
# join n=2 fields with "_" to keep them together.
s=( data? )
# for ((i=0;i<5;i++ ))
for ((i=0;i<3;i++ ))
do
  # cp ${s[$i]} w${i}
  awk '{ print $1,$2"_"$3 }' ${s[$i]} > w${i}
done

# Display working files.
pl " Input data files:"
head w* expected-output.txt

# for (( i=0;i<5;i++))
for (( i=0;i<3;i++))
do
  # Compare working file i with all files j, i != j
  # for ((j=0;j<5;j++))
  for ((j=0;j<3;j++))
  do
    [[ $i -eq $j ]] && continue
	db "comparing i=$i to j=$j"
    join -v 2 <( sort -k1,1 w$i ) <( sort -k1,1 w$j ) |
    # awk '{print $1,"X"}' >> w$i
    awk '{print $1,"X_X"}' >> w$i
  done
done

# Join all 5 augmented, working files.
pl " Results of multi-join:"
# multi-join -d " " w* |
mjb w* |
sort -k1,1 |
sed 's/_/ /g' |
tee f1

# Compare computed with standard.
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe;  pe " Results not verifiable." >&2 )

exit 0

producing:
Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
join (GNU coreutils) 6.10
multi-join (local) 1.6
mjb - ( local: RepRev 1.2, ~/bin/mjb, 2012-03-06 )
pass-fail - ( local: RepRev 1.2, ~/bin/pass-fail, 2012-06-14 )
sed GNU sed version 4.1.5

-----
 Input data files:
==> w0 <==
AA 11_22
BB 12_23
CC 32_23

==> w1 <==
BB 12_23
CC 34_56
DD 11_22

==> w2 <==
AA 12_21
CC 87_90
DD 10_20

==> expected-output.txt <==
AA 11 22 X X 12 21
BB 12 23 12 23 X X 
CC 32 23 34 56 87 90
DD X X 11 22 10 20

-----
 Results of multi-join:
AA 11 22 X X 12 21
BB 12 23 12 23 X X
CC 32 23 34 56 87 90
DD X X 11 22 10 20

-----
 Comparison of 4 created lines with 4 lines of desired results:
expected-output.txt f1 differ: char 38, line 2
 Failed -- files not identical -- detailed comparison follows.
 Succeeded by ignoring whitespace differences.

For groups other than 2, modify the appropriate lines. To make it automatically handle n=arbitrary tokens on each line will require some thought.

Best wishes ... cheers, drl
These 2 Users Gave Thanks to drl For This Post:
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extracting values based on line-column numbers from multiple text files

Dear All, I have to solve the following problems with multiple tab-separated text file but I don't know how. Any help would be greatly appreciated. I have access to Linux mint (but not as a professional). I have multiple tab-delimited files with the following structure: file1: 1 44 2 ... (5 Replies)
Discussion started by: Bastami
5 Replies

2. UNIX for Dummies Questions & Answers

Dynamically merging 2 files on header values

Hi All, I have 2 files which i need to merge together based on the column names provided in the file. The first line in both files are header records. The first file has fixed columns but second file can have subset of the columns from file 1 File 1: ... (6 Replies)
Discussion started by: kushagra
6 Replies

3. Shell Programming and Scripting

Merging 2 text files when there is a common time stamp column in them

Dear Unix experts and users I have 2 kinds of files like below, of which I need to merge them in the order of time. File1: Date_Time Context D1 D2 04/19/2013_23:48:54.819 ABCD x x 04/19/2013_23:48:55.307 ABCD x x 04/19/2013_23:48:55.823 ABCD x ... (7 Replies)
Discussion started by: ks_reddy
7 Replies

4. UNIX Desktop Questions & Answers

merging files and add missing rows

hello all, I have files that have a specific way for naming the first column they are make of five names in Pattern of 3 Y = (no case sensitive) so the files are names $Y-$Y-$Y or $X-$Y-$Z depending how we look they only exist of the pattern exist now I want to create a file from them that... (9 Replies)
Discussion started by: A-V
9 Replies

5. UNIX for Dummies Questions & Answers

Match values/IDs from column and text files

Hello, I am trying to modify 2 files, to yield results in a 3rd file. File-1 is a 8-columned file, separted with tab. 1234:1 xyz1234 blah blah blah blah blah blah 1234:1 xyz1233 blah blah blah blah blah blah 1234:1 abc1234 blah blah blah blah blah blah n/a RRR0000 blah blah blah... (1 Reply)
Discussion started by: ad23
1 Replies

6. UNIX for Dummies Questions & Answers

Comparing two text files by a column and printing values that do not match

I have two text files where the first three columns are exactly the same. I want to compare the fourth column of the text files and if the values are different, print that row into a new output file. How do I go about doing that? File 1: 100 rs3794811 0.01 0.3434 100 rs8066551 0.01... (8 Replies)
Discussion started by: evelibertine
8 Replies

7. UNIX for Dummies Questions & Answers

Merging two text files by a column

I have two text files. One has two columns and looks like below: rs# otherallele_freq rs10399749 0 rs4030303 0 rs4030300 0 rs940550 1.000 rs13328714 0 rs11490937 0 rs6683466 0 rs12025928 1.000 rs6650104 0 rs11240781 0... (5 Replies)
Discussion started by: evelibertine
5 Replies

8. UNIX for Dummies Questions & Answers

Merging two text files by a column

So I have two text files. The first one looks like this: refsnp_id chr_name chrom_start 1 rs1000000 12 126890980 2 rs10000010 4 21618674 3 rs10000012 4 1357325 4 rs10000013 4 37225069 5 rs1000002 3 183635768 And the second one looks like this: AUC rs1000000 0.03 0.1240 AUC ... (4 Replies)
Discussion started by: evelibertine
4 Replies

9. Shell Programming and Scripting

Filling in missing columns

Hi all, I have a file that contains about 1000 rows and 800 columns. Nearly every row has 800 columns but some DONT. I want to extend the rows that dont have values with NA's. Here is an example: my file bob 2 4 5 6 8 9 4 5 tar 2 4 5 4 3 2 9 1 bro 3 5 3 4 yar 2 ... (7 Replies)
Discussion started by: gisele_l
7 Replies

10. Shell Programming and Scripting

Merging column files

Hi,Iam new to Unix.I have a file FileA which is a variable length file where each column is seperated by delimitter "|". FileA: SrNo Name Address 1-234|name1|Addr1 1-34|name2|Addr2 1-2345|name3|Addr3 FileB: SrNo Address 1-34<<06 SPACES>>Addr1<<8 spaces>> 1-234<<05... (1 Reply)
Discussion started by: swapna321
1 Replies
Login or Register to Ask a Question