Merging two text files by a column and filling in the missing values

06-14-2012

Registered User

193, 0

Join Date: May 2011

Last Activity: 21 September 2015, 10:44 PM EDT

Posts: 193

Thanks Given: 94

Thanked 0 Times in 0 Posts

Merging two text files by a column and filling in the missing values

Hi,

I have to text files that I want to merge by the first column. The values in the first column pretty much match for the first part. However there are some values that are present in column 1 and not present in column 2 or vice versa. For such values I would like to substitute X for the corresponding columns. Example:

Code:

File 1:
AB   1
CD   2
EF   3
GH   4

File 2: 
CD  4
EF  5
GH  6
IJ  7

Output
AB 1 X
CD 2 4
EF 3 6
GH 4 6
IJ X 7

Thanks!

evelibertine

View Public Profile for evelibertine

Find all posts by evelibertine

06-14-2012

Registered User

2,524, 241

Join Date: Dec 2007

Last Activity: 17 March 2020, 2:04 PM EDT

Posts: 2,524

Thanks Given: 173

Thanked 241 Times in 206 Posts

just thinking...

If you used the COMM command, after cutting the first column of both files, you would know records that are in common. Save to FileCommon.
Then, maybe a PASTE command to create those in common. Your output lines 2-3-4, to FileB.
GREP -V what is in-common (your FileCommon) against the first file; with AWK for example to create those X entries. Your output line 1, to FileA.
Do the same to create FileC, from the 2nd file. Your output line 5 to FileC.
Finally combine the three files.

joeyg

View Public Profile for joeyg

Find all posts by joeyg

06-14-2012

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

This probably follows the thought train of joeyg, but uses join to find the initial missing items. The key option for this is "-v". The missing items are appended to the other file with a little awk script. The files are sorted and joined again to produce the result. The data files are changed, so they are copied from a "sacred" version. The commands context, specimen, and pass-fail are local, so not available outside my group, but the script is designed so the absence will not cause an error.

Code:

#!/usr/bin/env bash

# @(#) s1	Demonstrate join with missing values.

# Section 1, setup, pre-solution, $Revision: 1.26 $".
# Infrastructure details, environment, debug commands for forum posts. 
# Uncomment export command to run as external user: not context, pass-fail.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
LC_ALL=C ; LANG=C ; export LC_ALL LANG
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
  head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
C=$HOME/bin/context && [ -f $C ] && $C join awk
set -o nounset

cp sacred1 data1
FILE1=data1
cp sacred2 data2
FILE2=data2

# Display sample data files.
pe
specimen $FILE1 $FILE2 expected-output.txt || (
head $FILE1
pe
head $FILE2 
pe 
head expected-output.txt
)

# Section 2, solution.
pl " Results:"
db " Section 2: solution."
join -v 1 <( sort -k1,1 $FILE1 ) <( sort -k1,1 $FILE2 ) |
awk '{print $1,"X"}' >> $FILE2 

join -v 2 <( sort -k1,1 $FILE1 ) <( sort -k1,1 $FILE2 ) |
awk '{print $1,"X"}' >> $FILE1 

join <( sort -k1,1 $FILE1 ) <( sort -k1,1 $FILE2 ) |
tee f1

# Section 3, post-solution, check results, clean-up, etc.
db " Section 3: check solution."
(C=$HOME/bin/pass-fail && [ -f $C ] && $C;exit 0) 

exit 0

producing:

Code:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
join (GNU coreutils) 6.10
awk GNU Awk 3.1.5

Whole: 5:0:5 of 4 lines in file "data1"
AB   1
CD   2
EF   3
GH   4

Whole: 5:0:5 of 4 lines in file "data2"
CD  4
EF  5
GH  6
IJ  7

Whole: 5:0:5 of 5 lines in file "expected-output.txt"
AB 1 X
CD 2 4
EF 3 5
GH 4 6
IJ X 7

-----
 Results:
AB 1 X
CD 2 4
EF 3 5
GH 4 6
IJ X 7

-----
 Comparison of 5 created lines with 5 lines of desired results:
 Succeeded -- files have same content.

See man pages for details.

Best wishes ... cheers, drl

These 2 Users Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

06-15-2012

Registered User

193, 0

Join Date: May 2011

Last Activity: 21 September 2015, 10:44 PM EDT

Posts: 193

Thanks Given: 94

Thanked 0 Times in 0 Posts

Is there an easy way to alter this script so that instead of just two files, it merges five files by the first column filling in the blanks with X? Thanks!

evelibertine

View Public Profile for evelibertine

Find all posts by evelibertine

06-15-2012

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Quote:

Originally Posted by evelibertine

Is there an easy way to alter this script so that instead of just two files, it merges five files by the first column filling in the blanks with X? Thanks!

Please post samples and expected output ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

06-15-2012

Registered User

193, 0

Join Date: May 2011

Last Activity: 21 September 2015, 10:44 PM EDT

Posts: 193

Thanks Given: 94

Thanked 0 Times in 0 Posts

Code:

File 1:
AB 1
CD 1
EF 1

File 2:
EF 2
GH 2
IJ 2

File 3:
IJ 3
KL 3 
MN 3 

File 4:
MN 4
OP 4
RS 4

File 5:
RS 5
TU 5
VW 5

Output:
AB 1 X X X X
CD 1 X X X X
EF 1 2 X X X 
GH X 2 X X X
IJ X 2 3 X X 
KL X X 3 X X 
MN X X 3 4 X 
OP X X X 4 X 
RS X X X 4 5 
TU X X X X 5
VW X X X X 5

evelibertine

View Public Profile for evelibertine

Find all posts by evelibertine

06-16-2012

Registered User

18, 0

Join Date: Jul 2010

Last Activity: 7 December 2013, 12:35 PM EST

Posts: 18

Thanks Given: 7

Thanked 0 Times in 0 Posts

Hi evelibertine ,

Please hava a look at this code , fit for 2 files.

Quote:

(
for loop_1 in `cat file_1 file_2 | awk '{print $1}'| sort -u `
do
if [ `grep -c $loop_1 file_1` -gt 0 ]
then
col2=`grep $loop_1 file_1 | awk '{print $2}'`
else
col2=X
fi
if [ `grep -c $loop_1 file_2` -gt 0 ]
then
col3=`grep $loop_1 file_2 | awk '{print $2}'`
else
col3=X
fi
echo $loop_1 $col2 $col3
## or redirect it to an out file..
done
)

Last edited by dipanchandra; 06-16-2012 at 07:57 AM.. Reason: modifying

dipanchandra

View Public Profile for dipanchandra

Find all posts by dipanchandra

UNIX for Dummies Questions & Answers

Merging two text files by a column and filling in the missing values

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extracting values based on line-column numbers from multiple text files

Discussion started by: Bastami

2. UNIX for Dummies Questions & Answers

Dynamically merging 2 files on header values

Discussion started by: kushagra

3. Shell Programming and Scripting

Merging 2 text files when there is a common time stamp column in them

Discussion started by: ks_reddy

4. UNIX Desktop Questions & Answers

merging files and add missing rows

Discussion started by: A-V

5. UNIX for Dummies Questions & Answers

Match values/IDs from column and text files

Discussion started by: ad23

6. UNIX for Dummies Questions & Answers

Comparing two text files by a column and printing values that do not match

Discussion started by: evelibertine

7. UNIX for Dummies Questions & Answers

Merging two text files by a column

Discussion started by: evelibertine

8. UNIX for Dummies Questions & Answers

Merging two text files by a column

Discussion started by: evelibertine

9. Shell Programming and Scripting

Filling in missing columns

Discussion started by: gisele_l

10. Shell Programming and Scripting

Merging column files

Discussion started by: swapna321