Join multiple files by column with awk

09-18-2010

Registered User

30, 1

Join Date: Sep 2010

Last Activity: 9 March 2015, 9:06 PM EDT

Location: Lusaka, Zambia

Posts: 30

Thanks Given: 6

Thanked 1 Time in 1 Post

I have a similar situation to this but mine must be simpler though! But I cant seem to figure out how to solve the problem.

I have 100 file each with a header of up to 11 lines and the number of columns and lines are the same in all files.

I waaant to get the first and second column of the first file and then then get the 2nd column of the rest of the files and combine them into one file.

Following this I have the code below but which seems to work but the mistake is it sorts the output with repsect to the first column but thats not what I waant.

Code:

#! /bin/bash
# reset
#title(n) = sprintf("column %d", n)

#set yrange [0:20]
#set xrange [0.35:2.5]

WHINY_USERS=0 awk 'NR==FNR{ a[$1]=$2; s[$1]=$1 " " $2; next } {
  s[$1] = s[$1] " " $2; a[$1]=$2
}
END{for(i in s) {print s[i]}}' ~/test/*.txt

and here are 2 example files

File_1

Code:

; xfAzisum output file
; fits: e20011730124_xform.fits
; date: Sun Aug 29 15:30:46 2010
; inner radius: 2.2600000
; thickness: 0.500000
; divisions: 128
; x-axis: magnetic local time
;
; MLT       value      std    num_pixels	num_ind_measurements
; --------- --------- --------- ----------	--------------------
       0.09     19.83     14.58     78.00     15.00
       0.28     16.37      8.71     88.00     16.00
       0.47     23.62     14.32     85.00     15.00
       0.66     21.05     15.55     87.00     18.00
       0.84     27.06     14.25     88.00     14.00
       1.03     40.82     16.26     85.00     16.00
       1.22     43.94     14.44     87.00     14.00
       1.41     57.34      8.14     88.00     15.00
       1.59     67.33     15.14     87.00     15.00
       1.78     59.81     14.15     87.00     15.00
       1.97     76.75     23.44     85.00     14.00
       2.16     81.19     33.64     89.00     14.00
       2.34     67.60     25.53     86.00     13.00
       2.53     88.59     27.84     87.00     14.00
       2.72     74.00     22.88     87.00     14.00
       2.91     95.32     32.64     81.00     14.00
       3.09     91.51     29.59     95.00     15.00
       3.28    108.04     20.41     87.00     13.00
       3.47     85.54     24.75     87.00     13.00
       3.66     90.88     32.68     86.00     13.00
       3.84     79.36     28.87     89.00     15.00
       4.03     85.57     31.73     85.00     13.00
       4.22     80.39     28.05     87.00     13.00
       4.41     80.41     27.46     87.00     15.00
       4.59     77.25     21.63     88.00     14.00
       4.78     72.69     23.48     87.00     14.00
       4.97     69.76     24.77     85.00     15.00

File_2

Code:

; xfAzisum output file
; fits: e20011730225_xform.fits
; date: Sun Aug 29 15:30:48 2010
; inner radius: 2.2600000
; thickness: 0.500000
; divisions: 128
; x-axis: magnetic local time
;
; MLT       value      std    num_pixels	num_ind_measurements
; --------- --------- --------- ----------	--------------------
       0.09     23.50     15.69     78.00     12.00
       0.28     29.01     13.76     88.00     12.00
       0.47     26.51     14.09     85.00     10.00
       0.66     27.74     14.19     87.00     12.00
       0.84     28.46     14.08     88.00     11.00
       1.03     31.00     19.09     85.00     10.00
       1.22     36.56     16.43     87.00     12.00
       1.41     41.90     16.05     88.00     12.00
       1.59     49.73     17.51     87.00     12.00
       1.78     67.46     21.26     87.00     13.00
       1.97     67.41     24.18     85.00     10.00
       2.16     66.96     22.83     89.00     13.00
       2.34     79.56     16.04     86.00     10.00
       2.53     75.30     14.85     87.00     11.00
       2.72     77.60     20.36     87.00     10.00
       2.91     75.49     21.37     81.00      9.00
       3.09     92.31     19.54     95.00     14.00
       3.28     83.30     19.47     87.00     11.00
       3.47     89.87     18.38     87.00     11.00
       3.66     80.11     22.17     86.00     11.00
       3.84     92.18     28.36     89.00     12.00
       4.03     96.61     27.01     85.00     14.00
       4.22     91.94     28.70     87.00     10.00
       4.41     95.22     32.53     87.00     11.00
       4.59     89.51     30.41     88.00     12.00
       4.78     79.13     21.77     87.00     13.00
       4.97     71.90     17.68     85.00     12.00
       5.16     75.75     13.20     88.00     10.00
       5.34     61.50     17.21     87.00     11.00
       5.53     62.85     15.60     85.00     11.00
       5.72     60.16     23.02     88.00     12.00
       5.91     58.88     12.69     78.00     12.00
       6.09     53.16     11.01     97.00     13.00
       6.28     59.17     17.71     88.00      9.00
       6.47     75.35     18.00     85.00     13.00
       6.66     85.04     18.50     87.00     14.00
       6.84     86.22     14.26     88.00     12.00
       7.03     94.68     17.87     85.00     10.00
       7.22    102.22     23.22     87.00     10.00
       7.41    108.77     20.58     88.00     11.00
       7.59    108.88     20.75     87.00     11.00
       7.78    105.19     20.57     87.00      9.00
       7.97    105.75     25.69     85.00     10.00
       8.16     98.74     24.04     89.00     12.00
       8.34    100.46     30.22     86.00     12.00
       8.53     97.77     27.85     87.00     11.00
       8.72    108.62     29.81     87.00     14.00
       8.91    105.22     29.87     81.00     12.00
       9.09    108.14     25.23     95.00     15.00
       9.28    116.98     23.84     87.00     13.00
       9.47    112.20     19.08     87.00     12.00
       9.66    112.63     32.53     86.00     13.00
       9.84    136.50     37.32     89.00     14.00
      10.03    135.01     26.41     85.00     12.00
      10.22    153.68     21.48     87.00     12.00
      10.41    147.13     19.67     87.00     12.00
      10.59    140.11     21.85     88.00     12.00
      10.78    124.04     25.96     87.00     12.00
      10.97    124.65     31.79     85.00     13.00

the script above almost does what I want, but the problem is how it sorts the final output with respect to the first column.

Please help!
Thanks

Last edited by Scott; 09-18-2010 at 06:03 PM.. Reason: Code tags, please...

malandisa

View Public Profile for malandisa

Find all posts by malandisa

09-18-2010

Registered User

13, 1

Join Date: Jun 2010

Last Activity: 20 June 2011, 10:25 AM EDT

Posts: 13

Thanks Given: 0

Thanked 1 Time in 1 Post

I usually prefer perl over any shell script for this kind of jobs, but I am propably biased since I don't really know awk well enough.

However, here's a simple implementation using bash. It may be necessary to modify depending on what kind of OS/tools you've got under you. If tail doesn't understand the + -prefix, grep -v ^\; might be used as a substitute.

Quote:

I have 100 file each with a header of up to 11 lines and the number of columns and lines are the same in all files.

Number of lines didn't match in the example files, but I assume that was a mistake. This script assumes they are of same length, or at least that the first file is longest.

This is used so that filenames to process are given on stdin, not on command line.

Eg.

Code:

$ find ~/test/ -name '*.txt' | ./add_second_columns.sh > results.txt

Code:

#!/bin/bash
# add_second_columns.sh

# every file has a header of 11 lines, which we'll remove before processing
HEADER_LENGTH=11
# every file has columns separated by a char
COL_SEP=" "
# output column separator
COL_OSEP=$'\t'
# if there's an error, using a number greater than 0 will exit with that
# number
EXIT_ON_ERROR=1

# we'll get the files from stdin, so we should never run into the argc/argv
# problem

# later on we'll avoid pipes in while loop by saving the output of a file
# to a string and parsing it by newlines, so here we'll set input field
# separator to a newline
IFS=$'\n'

# we'll handle the first file separately, since we want to get the
# first two fields, instead of the second field only
read FIRST
if [ -e "$FIRST" ] ; then
	I=0
	RDATA=`<"$FIRST" tail -n +$HEADER_LENGTH | cut -d "$COL_SEP" -f -2`
	for E in $RDATA ; do
		((++I))
		DATA[$I]=$(echo "$E" | tr "$COL_SEP" "$COL_OSEP")
	done
else
	echo "ERROR: Couldn't open file '$FIRST'" >&2
	if [ $EXIT_ON_ERROR -gt 0 ] ; then
		exit $EXIT_ON_ERROR
	fi
fi

# process the rest of files
while read ENTRY ; do
	if [ -e "$ENTRY" ] ; then
		I=0
		RDATA=`<"$ENTRY" tail -n +$HEADER_LENGTH | cut -d "$COL_SEP" -f 2`
		for E in $RDATA ; do
			((++I))
			# schlemiel the painter, anyone?
			DATA[$I]="${DATA[$I]}${COL_OSEP}$E"
		done
	else
		echo "ERROR: Couldn't open file '$ENTRY'" >&2
		if [ $EXIT_ON_ERROR -gt 0 ] ; then
			exit $EXIT_ON_ERROR
		fi
	fi
done

# print everything
I=0
while [ $I -le ${#DATA[*]} ] ; do
	((++I))
	echo "${DATA[$I]}"
done

Last edited by ikki; 09-18-2010 at 05:34 PM.. Reason: fixed a typo x2

ikki

View Public Profile for ikki

Find all posts by ikki

09-18-2010

Registered User

30, 1

Join Date: Sep 2010

Last Activity: 9 March 2015, 9:06 PM EDT

Location: Lusaka, Zambia

Posts: 30

Thanks Given: 6

Thanked 1 Time in 1 Post

Thanks iki,

That worked! Thanks! The fields in my data files are delimited by 5 spaces and so I couldn't get the 'cut' part to work correctly. When I changed the COL_SEP to have 5 spaces it complained that the delimiter must be 1 character. I don't know why that does not want to work. So I used 'awk' in place of 'cut'. I still want to know how to use cut when the delimiter is more than 1 space

Thanks again

malandisa

View Public Profile for malandisa

Find all posts by malandisa

09-18-2010

Registered User

13, 1

Join Date: Jun 2010

Last Activity: 20 June 2011, 10:25 AM EDT

Posts: 13

Thanks Given: 0

Thanked 1 Time in 1 Post

Quote:

Originally Posted by malandisa

So I used 'awk' in place of 'cut'. I still want to know how to use cut when the delimiter is more than 1 space.

It's not possible with cut, it's designed to work with a single character as the field delimeter, that's why awk is propably the better choice here. Other way, if one really wants to use cut or perhaps just to sanitize data, would be to replace all multiple repetitions of space by a single space (or tab). That of course assumes that any field can't be empty (or non-space, to be exact), or the result would be skewed.

For future reference: there are at least two easy ways of squeezing multiple instances of a char to a single instance (below). sed is a powerful companion to awk, or so I'm told

, but even alone I find it a very useful tool:

Code:

sed -e 's/ \{1,\}/ /g' | cut... # I usually use the gnu version, since I have easier time remembering what chars I should escape and what not
tr -s " " | cut...

ikki

View Public Profile for ikki

Find all posts by ikki

Shell Programming and Scripting

Join multiple files by column with awk

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Join, merge, fill NULL the void columns of multiples files like sql "LEFT JOIN" by using awk

Discussion started by: yjacknewton

2. Shell Programming and Scripting

Join columns across multiple lines in a Text based on common column using BASH

Discussion started by: nv186000

3. Shell Programming and Scripting

Join 2nd column of multiple files

Discussion started by: paolo.kunder

4. UNIX for Dummies Questions & Answers

Join with awk different column

Discussion started by: radius

5. Shell Programming and Scripting

Awk: Multiple Replace In Column From Two Different Files

Discussion started by: siramitsharma

6. UNIX for Dummies Questions & Answers

How to use the the join command to join multiple files by a common column

Discussion started by: evelibertine

7. Shell Programming and Scripting

Awk - join multiple files

Discussion started by: quincyjones

8. Shell Programming and Scripting

Join and awk max column

Discussion started by: jacobs.smith

9. Shell Programming and Scripting

Join multiple files based on 1 common column

Discussion started by: quincyjones

10. UNIX for Dummies Questions & Answers

Join 2 files with multiple columns: awk/grep/join?

Discussion started by: InfoSeeker