Merge csvs with column headers

03-04-2015

Registered User

11, 0

Join Date: Feb 2015

Last Activity: 14 August 2015, 5:54 PM EDT

Posts: 11

Thanks Given: 10

Thanked 0 Times in 0 Posts

Merge csvs with column headers

hello gurus,

Somebody must have done this before, I couldn't find anything. Please redirect me if this was solved before, and if not please help.

To the problem now, I have multiple csv files (about 1000) which I need to concatenate by column header. The final file should have a superset of all columns of the csvs, and corresponding missing values should be blank.

All files are the same directory. Note the randomness and spaces in filenames.

File 2012 IBM 1

Code:

A,B,C
d1,d2,d3
d4,d5,d6

DELL2013 FILE_2

Code:

A,C,D
d12,d22,d32
d42,d52,d62

HP File 3

Code:

B,C,D,E
d22,d122,d132,
d32,d152,d262,142

My output should look like

Code:

A,B,C,D,E
d1,d2,d3,,
d4,d5,d6,,
d12,,d22,d32,
d42,,d52,d62,
,d22,d122,d132,
,d32,d152,d262,142

I can do this for files with common columns in the same order, but this cant incorporate new columns in the superfile

Code:

awk 'FNR==1 && NR!=1{next;}{print}' *.csv

Last edited by abh.kumar; 03-04-2015 at 12:45 PM..

abh.kumar

View Public Profile for abh.kumar

Find all posts by abh.kumar

03-05-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You need to process the 1st lines of the files 1st to get a list of the headings and then process the entire files a 2nd time to print the fields in the correct order in a file with all of the headings. Some implementations of the awk utility have a nextfile command that skips the rest of the file; others don't (and, therefore, need to read each line of each file both times the file is read). With your sample files, the difference won't be noticed; but with large files, implementations with nextfile will run faster.

You said your sample input files were named 2012 IBM 1, DELL2013 FILE_2, and HP File 3, but your sample code processes files with names ending with .csv. The following awk script assumes your sample input filenames are really 2012 IBM 1.csv, DELL2013 FILE_2.csv, and HP File 3.csv. Note that if you redirect the output of this script and you place the output in the same directory in which your input files are located you must not create the output file with a name ending in .csv.

Code:

awk '
BEGIN {	FS = OFS = ","
}
FNR == 1 {
	if(!(FILENAME in files)) {
		# This is the 1st time we have seen this file, print extra
		# heading names to the output header line and add new fields
		# from this file to headings list...
		for(i = 1; i <= NF; i++)
			if(!($i in h)) {
				printf("%s%s", nh ? "," : "", $i)
				h[$i] = ++nh
			}
		files[FILENAME]
		if(file1 == "")
			file1 = FILENAME
		# If your version of awk does not have a "nextfile" command,
		# comment out the "nextfile" line below AND uncomment every
		# line after this line containing the string "flushfile".
		nextfile
		# flushfile = 1
	} else {
		# This is the 2nd time we have seen this file.  Set the output
		# field positions for fields in this file...
		# Clear hf[]...
		split("", hf, "")
		for(i = 1; i <= NF; i++)
			hf[h[$i]] = i
		# Get output field order for this file...
		for(i = 1; i <= nh; i++) {
			of[i] = hf[i]
		}
		# If this is the 2nd time we have seen the 1st file in the list,
		# teminate the output file heading line...
		if(file1 == FILENAME)
			print ""
		# flushfile = 0
	}
	next
}
# flushfile { next }	# Flush lines 2 to the end of the file the 1st time we
			# see this file on systems that do not have an awk
			# "nextfile" command.
{	# Print contents of lines 2 through the end of the file the 2nd time we
	# read the file.
	for(i = 1; i <= nh; i++)
		printf("%s%s", of[i] ? $of[i] : "", (i == nh) ? ORS : OFS)
}' *.csv *.csv

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk.

This has been tested on a Mac OS X system as shown above and with the flushfile lines uncommented and the nextfile line commented. Either way it produced the output you said you were trying to get. (Note, however, that if the names of the input files were different, the order of the output fields might change.)

Note, however, that there are limits to the length of argument lists that can be handled when invoking commands like awk. If you get an argument list too long error, you can preprocess the 1st lines of your *.csv files to create a file with just the headings from the combined files. Then you can modify the script to work when you only pass awk operands naming the preprocessed file and the *.csv files just once instead of twice. If that argument list is still too long, you could modify the script to just read a file containing the names of the files to be processed.

Hope this helps.

Note that if you tell us what OS and shell you're using when you ask for help, we can give you more personalized answers that will work on your system.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-05-2015

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

The following solution reads the input file once, storing all in memory, and prints in the END section.

Code:

awk -F, '
(FNR==1) {
  # header line
  for (i=1; i<=NF; i++) {
    if (!($i in H)) {
      # H[] is for this hash-lookup and O[] for in-order lookup
      ++nh
      H[$i]=nh
      O[nh]=$i
    }
    # column mapping for this file
    map[i]=H[$i]
  }
  next
}
{
  # store fields in F array
  ++nr
  for (i=1; i<=NF; i++) {F[nr,map[i]]=$i}
}
END {
  # print header line
  for (i=1; i<=nh; i++) {printf "%s", sep O[i]; sep=FS}
  printf RS
  # print F array
  for (n=1; n<=nr; n++) {
    # line n
    sep=""
    for (i=1; i<=nh; i++) {
      printf "%s", sep F[n,i]
      sep=FS 
    }
    printf RS
  }
}
' *.csv

Last edited by MadeInGermany; 03-05-2015 at 10:53 AM.. Reason: realized that previous solution stores only headers and mapping

These 2 Users Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

03-05-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by MadeInGermany

The following solution reads the input file once, storing all in memory, and prints in the END section.

... ... ...

Hi MadeInGermany,
Good point. With about a thousand input files (and no indication of their sizes), I thought that storing all of the file data could consume a lot of memory and chose to read the files twice to reduce the memory footprint. I should have explicitly stated that when I posted my suggestion.

Depending on file naming conventions and file sizes, reducing the size of the argument list may well be more important than reducing the memory footprint.

- Don

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-05-2015

Registered User

11, 0

Join Date: Feb 2015

Last Activity: 14 August 2015, 5:54 PM EDT

Posts: 11

Thanks Given: 10

Thanked 0 Times in 0 Posts

Thanks a lot to the both of you..this will not only help me, but several others in future

abh.kumar

View Public Profile for abh.kumar

Find all posts by abh.kumar

Shell Programming and Scripting

Merge csvs with column headers

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Row bind multiple csv files having different column headers

Discussion started by: ks_reddy

2. Shell Programming and Scripting

Capturing column headers in an array

Discussion started by: LMHmedchem

3. Shell Programming and Scripting

Sar -u generates multiple column headers in csv file

Discussion started by: a1_win

4. Shell Programming and Scripting

Merge column headers and transpose

Discussion started by: giuliangiuseppe

5. Shell Programming and Scripting

Transpose field names from column headers to values in one column

Discussion started by: popesk

6. Shell Programming and Scripting

Matching words based on column headers

Discussion started by: bha148

7. UNIX for Dummies Questions & Answers

Sort by Column Headers

Discussion started by: cgk1983

8. Shell Programming and Scripting

Merging of files with different headers to make combined headers file

Discussion started by: marut_ashu

9. Shell Programming and Scripting

Excel Column Headers

Discussion started by: pk_eee

10. Shell Programming and Scripting

Removing Headers and a Column

Discussion started by: DerangedNick