Join multiple files by column with awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Join multiple files by column with awk
# 8  
Old 09-18-2010
I have a similar situation to this but mine must be simpler though! But I cant seem to figure out how to solve the problem.

I have 100 file each with a header of up to 11 lines and the number of columns and lines are the same in all files.

I waaant to get the first and second column of the first file and then then get the 2nd column of the rest of the files and combine them into one file.

Following this I have the code below but which seems to work but the mistake is it sorts the output with repsect to the first column but thats not what I waant.

Code:
#! /bin/bash
# reset
#title(n) = sprintf("column %d", n)

#set yrange [0:20]
#set xrange [0.35:2.5]

WHINY_USERS=0 awk 'NR==FNR{ a[$1]=$2; s[$1]=$1 " " $2; next } {
  s[$1] = s[$1] " " $2; a[$1]=$2
}
END{for(i in s) {print s[i]}}' ~/test/*.txt

and here are 2 example files

File_1
Code:
; xfAzisum output file
; fits: e20011730124_xform.fits
; date: Sun Aug 29 15:30:46 2010
; inner radius: 2.2600000
; thickness: 0.500000
; divisions: 128
; x-axis: magnetic local time
;
; MLT       value      std    num_pixels	num_ind_measurements
; --------- --------- --------- ----------	--------------------
       0.09     19.83     14.58     78.00     15.00
       0.28     16.37      8.71     88.00     16.00
       0.47     23.62     14.32     85.00     15.00
       0.66     21.05     15.55     87.00     18.00
       0.84     27.06     14.25     88.00     14.00
       1.03     40.82     16.26     85.00     16.00
       1.22     43.94     14.44     87.00     14.00
       1.41     57.34      8.14     88.00     15.00
       1.59     67.33     15.14     87.00     15.00
       1.78     59.81     14.15     87.00     15.00
       1.97     76.75     23.44     85.00     14.00
       2.16     81.19     33.64     89.00     14.00
       2.34     67.60     25.53     86.00     13.00
       2.53     88.59     27.84     87.00     14.00
       2.72     74.00     22.88     87.00     14.00
       2.91     95.32     32.64     81.00     14.00
       3.09     91.51     29.59     95.00     15.00
       3.28    108.04     20.41     87.00     13.00
       3.47     85.54     24.75     87.00     13.00
       3.66     90.88     32.68     86.00     13.00
       3.84     79.36     28.87     89.00     15.00
       4.03     85.57     31.73     85.00     13.00
       4.22     80.39     28.05     87.00     13.00
       4.41     80.41     27.46     87.00     15.00
       4.59     77.25     21.63     88.00     14.00
       4.78     72.69     23.48     87.00     14.00
       4.97     69.76     24.77     85.00     15.00

File_2
Code:
; xfAzisum output file
; fits: e20011730225_xform.fits
; date: Sun Aug 29 15:30:48 2010
; inner radius: 2.2600000
; thickness: 0.500000
; divisions: 128
; x-axis: magnetic local time
;
; MLT       value      std    num_pixels	num_ind_measurements
; --------- --------- --------- ----------	--------------------
       0.09     23.50     15.69     78.00     12.00
       0.28     29.01     13.76     88.00     12.00
       0.47     26.51     14.09     85.00     10.00
       0.66     27.74     14.19     87.00     12.00
       0.84     28.46     14.08     88.00     11.00
       1.03     31.00     19.09     85.00     10.00
       1.22     36.56     16.43     87.00     12.00
       1.41     41.90     16.05     88.00     12.00
       1.59     49.73     17.51     87.00     12.00
       1.78     67.46     21.26     87.00     13.00
       1.97     67.41     24.18     85.00     10.00
       2.16     66.96     22.83     89.00     13.00
       2.34     79.56     16.04     86.00     10.00
       2.53     75.30     14.85     87.00     11.00
       2.72     77.60     20.36     87.00     10.00
       2.91     75.49     21.37     81.00      9.00
       3.09     92.31     19.54     95.00     14.00
       3.28     83.30     19.47     87.00     11.00
       3.47     89.87     18.38     87.00     11.00
       3.66     80.11     22.17     86.00     11.00
       3.84     92.18     28.36     89.00     12.00
       4.03     96.61     27.01     85.00     14.00
       4.22     91.94     28.70     87.00     10.00
       4.41     95.22     32.53     87.00     11.00
       4.59     89.51     30.41     88.00     12.00
       4.78     79.13     21.77     87.00     13.00
       4.97     71.90     17.68     85.00     12.00
       5.16     75.75     13.20     88.00     10.00
       5.34     61.50     17.21     87.00     11.00
       5.53     62.85     15.60     85.00     11.00
       5.72     60.16     23.02     88.00     12.00
       5.91     58.88     12.69     78.00     12.00
       6.09     53.16     11.01     97.00     13.00
       6.28     59.17     17.71     88.00      9.00
       6.47     75.35     18.00     85.00     13.00
       6.66     85.04     18.50     87.00     14.00
       6.84     86.22     14.26     88.00     12.00
       7.03     94.68     17.87     85.00     10.00
       7.22    102.22     23.22     87.00     10.00
       7.41    108.77     20.58     88.00     11.00
       7.59    108.88     20.75     87.00     11.00
       7.78    105.19     20.57     87.00      9.00
       7.97    105.75     25.69     85.00     10.00
       8.16     98.74     24.04     89.00     12.00
       8.34    100.46     30.22     86.00     12.00
       8.53     97.77     27.85     87.00     11.00
       8.72    108.62     29.81     87.00     14.00
       8.91    105.22     29.87     81.00     12.00
       9.09    108.14     25.23     95.00     15.00
       9.28    116.98     23.84     87.00     13.00
       9.47    112.20     19.08     87.00     12.00
       9.66    112.63     32.53     86.00     13.00
       9.84    136.50     37.32     89.00     14.00
      10.03    135.01     26.41     85.00     12.00
      10.22    153.68     21.48     87.00     12.00
      10.41    147.13     19.67     87.00     12.00
      10.59    140.11     21.85     88.00     12.00
      10.78    124.04     25.96     87.00     12.00
      10.97    124.65     31.79     85.00     13.00

the script above almost does what I want, but the problem is how it sorts the final output with respect to the first column.

Please help!
Thanks

Last edited by Scott; 09-18-2010 at 06:03 PM.. Reason: Code tags, please...
# 9  
Old 09-18-2010
I usually prefer perl over any shell script for this kind of jobs, but I am propably biased since I don't really know awk well enough.

However, here's a simple implementation using bash. It may be necessary to modify depending on what kind of OS/tools you've got under you. If tail doesn't understand the + -prefix, grep -v ^\; might be used as a substitute.

Quote:
I have 100 file each with a header of up to 11 lines and the number of columns and lines are the same in all files.
Number of lines didn't match in the example files, but I assume that was a mistake. This script assumes they are of same length, or at least that the first file is longest.

This is used so that filenames to process are given on stdin, not on command line.

Eg.

Code:
$ find ~/test/ -name '*.txt' | ./add_second_columns.sh > results.txt

Code:
#!/bin/bash
# add_second_columns.sh

# every file has a header of 11 lines, which we'll remove before processing
HEADER_LENGTH=11
# every file has columns separated by a char
COL_SEP=" "
# output column separator
COL_OSEP=$'\t'
# if there's an error, using a number greater than 0 will exit with that
# number
EXIT_ON_ERROR=1

# we'll get the files from stdin, so we should never run into the argc/argv
# problem

# later on we'll avoid pipes in while loop by saving the output of a file
# to a string and parsing it by newlines, so here we'll set input field
# separator to a newline
IFS=$'\n'

# we'll handle the first file separately, since we want to get the
# first two fields, instead of the second field only
read FIRST
if [ -e "$FIRST" ] ; then
	I=0
	RDATA=`<"$FIRST" tail -n +$HEADER_LENGTH | cut -d "$COL_SEP" -f -2`
	for E in $RDATA ; do
		((++I))
		DATA[$I]=$(echo "$E" | tr "$COL_SEP" "$COL_OSEP")
	done
else
	echo "ERROR: Couldn't open file '$FIRST'" >&2
	if [ $EXIT_ON_ERROR -gt 0 ] ; then
		exit $EXIT_ON_ERROR
	fi
fi

# process the rest of files
while read ENTRY ; do
	if [ -e "$ENTRY" ] ; then
		I=0
		RDATA=`<"$ENTRY" tail -n +$HEADER_LENGTH | cut -d "$COL_SEP" -f 2`
		for E in $RDATA ; do
			((++I))
			# schlemiel the painter, anyone?
			DATA[$I]="${DATA[$I]}${COL_OSEP}$E"
		done
	else
		echo "ERROR: Couldn't open file '$ENTRY'" >&2
		if [ $EXIT_ON_ERROR -gt 0 ] ; then
			exit $EXIT_ON_ERROR
		fi
	fi
done

# print everything
I=0
while [ $I -le ${#DATA[*]} ] ; do
	((++I))
	echo "${DATA[$I]}"
done


Last edited by ikki; 09-18-2010 at 05:34 PM.. Reason: fixed a typo x2
# 10  
Old 09-18-2010
Thanks iki,

That worked! Thanks! The fields in my data files are delimited by 5 spaces and so I couldn't get the 'cut' part to work correctly. When I changed the COL_SEP to have 5 spaces it complained that the delimiter must be 1 character. I don't know why that does not want to work. So I used 'awk' in place of 'cut'. I still want to know how to use cut when the delimiter is more than 1 space

Thanks again
# 11  
Old 09-18-2010
Quote:
Originally Posted by malandisa
So I used 'awk' in place of 'cut'. I still want to know how to use cut when the delimiter is more than 1 space.
It's not possible with cut, it's designed to work with a single character as the field delimeter, that's why awk is propably the better choice here. Other way, if one really wants to use cut or perhaps just to sanitize data, would be to replace all multiple repetitions of space by a single space (or tab). That of course assumes that any field can't be empty (or non-space, to be exact), or the result would be skewed.



For future reference: there are at least two easy ways of squeezing multiple instances of a char to a single instance (below). sed is a powerful companion to awk, or so I'm told Smilie, but even alone I find it a very useful tool:

Code:
sed -e 's/ \{1,\}/ /g' | cut... # I usually use the gnu version, since I have easier time remembering what chars I should escape and what not
tr -s " " | cut...

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Join, merge, fill NULL the void columns of multiples files like sql "LEFT JOIN" by using awk

Hello, This post is already here but want to do this with another way Merge multiples files with multiples duplicates keys by filling "NULL" the void columns for anothers joinning files file1.csv: 1|abc 1|def 2|ghi 2|jkl 3|mno 3|pqr file2.csv: 1|123|jojo 1|NULL|bibi... (2 Replies)
Discussion started by: yjacknewton
2 Replies

2. Shell Programming and Scripting

Join columns across multiple lines in a Text based on common column using BASH

Hello, I have a file with 2 columns ( tableName , ColumnName) delimited by a Pipe like below . File is sorted by ColumnName. Table1|Column1 Table2|Column1 Table5|Column1 Table3|Column2 Table2|Column2 Table4|Column3 Table2|Column3 Table2|Column4 Table5|Column4 Table2|Column5 From... (6 Replies)
Discussion started by: nv186000
6 Replies

3. Shell Programming and Scripting

Join 2nd column of multiple files

Dear All, I have many files formatted like this: file1.txt: 1/2-SBSRNA4 18 A1BG 3 A1BG-AS1 6 A1CF 0 A2LD1 1 A2M 1160 file2.txt 1/2-SBSRNA4 53 A1BG 1 A1BG-AS1 7 A1CF 0 A2LD1 3 A2M 2780 (5 Replies)
Discussion started by: paolo.kunder
5 Replies

4. UNIX for Dummies Questions & Answers

Join with awk different column

hi guys, i need help I need to join file2 to file1 when column 3 in my file1 and column 1 in my file2 in the same string file1 AA|RR|ESKIM RE|DD|RED WE|WW|SUPSS file2 ESKIM|ES SUPSS|SS Output AA|RR|ESKIM|ES RE|DD|RED| WE|WW|SUPSS|SS (3 Replies)
Discussion started by: radius
3 Replies

5. Shell Programming and Scripting

Awk: Multiple Replace In Column From Two Different Files

Master_1.txt 2372,MTS,AP 919821,Airtel,DL 0819,MTS,MUM 919849788001,Airtel,AP 1430,Aircel MP,20 405899143999999,MTS,KRL USSDLIKE,MTS,DEL Master_2.txt 919136,DL 9664,RAJ 919143,KOL 9888,PUN Input File: (4 Replies)
Discussion started by: siramitsharma
4 Replies

6. UNIX for Dummies Questions & Answers

How to use the the join command to join multiple files by a common column

Hi, I have 20 tab delimited text files that have a common column (column 1). The files are named GSM1.txt through GSM20.txt. Each file has 3 columns (2 other columns in addition to the first common column). I want to write a script to join the files by the first common column so that in the... (5 Replies)
Discussion started by: evelibertine
5 Replies

7. Shell Programming and Scripting

Awk - join multiple files

Is it possible to join all the files with input1 based on 1st column? input1 a b c d e f input2 a b input3 a e input4 c (2 Replies)
Discussion started by: quincyjones
2 Replies

8. Shell Programming and Scripting

Join and awk max column

Hi Friends, I have a file1 with 3400 records that are tab separated and I have a file2 with 6220 records. I want to merge both these files. I tried using join file1 and file2 after sorting. But, the records should be (3400*6220 = 21148000). Instead, I get only around 11133567. Is there anything... (13 Replies)
Discussion started by: jacobs.smith
13 Replies

9. Shell Programming and Scripting

Join multiple files based on 1 common column

I have n files (for ex:64 files) with one similar column. Is it possible to combine them all based on that column ? file1 ax100 20 30 40 ax200 22 33 44 file2 ax100 10 20 40 ax200 12 13 44 file2 ax100 0 0 4 ax200 2 3 4 (9 Replies)
Discussion started by: quincyjones
9 Replies

10. UNIX for Dummies Questions & Answers

Join 2 files with multiple columns: awk/grep/join?

Hello, My apologies if this has been posted elsewhere, I have had a look at several threads but I am still confused how to use these functions. I have two files, each with 5 columns: File A: (tab-delimited) PDB CHAIN Start End Fragment 1avq A 171 176 awyfan 1avq A 172 177 wyfany 1c7k A 2 7... (3 Replies)
Discussion started by: InfoSeeker
3 Replies
Login or Register to Ask a Question