Merge csvs with column headers


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Merge csvs with column headers
# 1  
Old 03-04-2015
Merge csvs with column headers

hello gurus,

Somebody must have done this before, I couldn't find anything. Please redirect me if this was solved before, and if not please help.


To the problem now, I have multiple csv files (about 1000) which I need to concatenate by column header. The final file should have a superset of all columns of the csvs, and corresponding missing values should be blank.

All files are the same directory. Note the randomness and spaces in filenames.

File 2012 IBM 1
Code:
A,B,C
d1,d2,d3
d4,d5,d6


DELL2013 FILE_2
Code:
A,C,D
d12,d22,d32
d42,d52,d62


HP File 3
Code:
B,C,D,E
d22,d122,d132,
d32,d152,d262,142

My output should look like

Code:
A,B,C,D,E
d1,d2,d3,,
d4,d5,d6,,
d12,,d22,d32,
d42,,d52,d62,
,d22,d122,d132,
,d32,d152,d262,142


I can do this for files with common columns in the same order, but this cant incorporate new columns in the superfile

Code:
awk 'FNR==1 && NR!=1{next;}{print}' *.csv


Last edited by abh.kumar; 03-04-2015 at 12:45 PM..
# 2  
Old 03-05-2015
You need to process the 1st lines of the files 1st to get a list of the headings and then process the entire files a 2nd time to print the fields in the correct order in a file with all of the headings. Some implementations of the awk utility have a nextfile command that skips the rest of the file; others don't (and, therefore, need to read each line of each file both times the file is read). With your sample files, the difference won't be noticed; but with large files, implementations with nextfile will run faster.

You said your sample input files were named 2012 IBM 1, DELL2013 FILE_2, and HP File 3, but your sample code processes files with names ending with .csv. The following awk script assumes your sample input filenames are really 2012 IBM 1.csv, DELL2013 FILE_2.csv, and HP File 3.csv. Note that if you redirect the output of this script and you place the output in the same directory in which your input files are located you must not create the output file with a name ending in .csv.

Code:
awk '
BEGIN {	FS = OFS = ","
}
FNR == 1 {
	if(!(FILENAME in files)) {
		# This is the 1st time we have seen this file, print extra
		# heading names to the output header line and add new fields
		# from this file to headings list...
		for(i = 1; i <= NF; i++)
			if(!($i in h)) {
				printf("%s%s", nh ? "," : "", $i)
				h[$i] = ++nh
			}
		files[FILENAME]
		if(file1 == "")
			file1 = FILENAME
		# If your version of awk does not have a "nextfile" command,
		# comment out the "nextfile" line below AND uncomment every
		# line after this line containing the string "flushfile".
		nextfile
		# flushfile = 1
	} else {
		# This is the 2nd time we have seen this file.  Set the output
		# field positions for fields in this file...
		# Clear hf[]...
		split("", hf, "")
		for(i = 1; i <= NF; i++)
			hf[h[$i]] = i
		# Get output field order for this file...
		for(i = 1; i <= nh; i++) {
			of[i] = hf[i]
		}
		# If this is the 2nd time we have seen the 1st file in the list,
		# teminate the output file heading line...
		if(file1 == FILENAME)
			print ""
		# flushfile = 0
	}
	next
}
# flushfile { next }	# Flush lines 2 to the end of the file the 1st time we
			# see this file on systems that do not have an awk
			# "nextfile" command.
{	# Print contents of lines 2 through the end of the file the 2nd time we
	# read the file.
	for(i = 1; i <= nh; i++)
		printf("%s%s", of[i] ? $of[i] : "", (i == nh) ? ORS : OFS)
}' *.csv *.csv

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk.

This has been tested on a Mac OS X system as shown above and with the flushfile lines uncommented and the nextfile line commented. Either way it produced the output you said you were trying to get. (Note, however, that if the names of the input files were different, the order of the output fields might change.)

Note, however, that there are limits to the length of argument lists that can be handled when invoking commands like awk. If you get an argument list too long error, you can preprocess the 1st lines of your *.csv files to create a file with just the headings from the combined files. Then you can modify the script to work when you only pass awk operands naming the preprocessed file and the *.csv files just once instead of twice. If that argument list is still too long, you could modify the script to just read a file containing the names of the files to be processed.

Hope this helps.

Note that if you tell us what OS and shell you're using when you ask for help, we can give you more personalized answers that will work on your system.
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 03-05-2015
The following solution reads the input file once, storing all in memory, and prints in the END section.
Code:
awk -F, '
(FNR==1) {
  # header line
  for (i=1; i<=NF; i++) {
    if (!($i in H)) {
      # H[] is for this hash-lookup and O[] for in-order lookup
      ++nh
      H[$i]=nh
      O[nh]=$i
    }
    # column mapping for this file
    map[i]=H[$i]
  }
  next
}
{
  # store fields in F array
  ++nr
  for (i=1; i<=NF; i++) {F[nr,map[i]]=$i}
}
END {
  # print header line
  for (i=1; i<=nh; i++) {printf "%s", sep O[i]; sep=FS}
  printf RS
  # print F array
  for (n=1; n<=nr; n++) {
    # line n
    sep=""
    for (i=1; i<=nh; i++) {
      printf "%s", sep F[n,i]
      sep=FS 
    }
    printf RS
  }
}
' *.csv


Last edited by MadeInGermany; 03-05-2015 at 10:53 AM.. Reason: realized that previous solution stores only headers and mapping
These 2 Users Gave Thanks to MadeInGermany For This Post:
# 4  
Old 03-05-2015
Quote:
Originally Posted by MadeInGermany
The following solution reads the input file once, storing all in memory, and prints in the END section.

... ... ...
Hi MadeInGermany,
Good point. With about a thousand input files (and no indication of their sizes), I thought that storing all of the file data could consume a lot of memory and chose to read the files twice to reduce the memory footprint. I should have explicitly stated that when I posted my suggestion.

Depending on file naming conventions and file sizes, reducing the size of the argument list may well be more important than reducing the memory footprint.

- Don
# 5  
Old 03-05-2015
Thanks a lot to the both of you..this will not only help me, but several others in future
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Row bind multiple csv files having different column headers

All, I guess by this time someone asked this kind of question, but sorry I am unable to find after a deep search. Here is my request I have many files out of which 2 sample files provided below. File-1 (with A,B as column headers) A,B 1,2 File-2 (with C, D as column headers) C,D 4,5 I... (7 Replies)
Discussion started by: ks_reddy
7 Replies

2. Shell Programming and Scripting

Capturing column headers in an array

Hello, I am processing a tab delimited text file and need to grab all of the column headers in an array. The input looks like, num Name PCA_A1 PCA_A2 PCA_A3 0 compound_00 -3.5054 -1.1207 -2.4372 1 compound_01 -2.2641 0.4287 ... (5 Replies)
Discussion started by: LMHmedchem
5 Replies

3. Shell Programming and Scripting

Sar -u generates multiple column headers in csv file

Hi All, The below sar -u command generates multiple column headers in csv file Expected output should print column headers only once in the csv file shell script: $cat sar_cpu_EBS.sh #!/bin/bash while ; do sar -u 15 1 | awk '/^/ {print $1,$2,$4,$6,$7}' | tr -s ' ' ',' >>... (6 Replies)
Discussion started by: a1_win
6 Replies

4. Shell Programming and Scripting

Merge column headers and transpose

Hello Everyone! I am new on this forum and this is my first post. I wish to apologize for my, not canonical, English. I would like to solve this problem but I have no clue of how do it!I will be grateful if someone could help me! I have a table like this: gene TF1 TF2 TF3 TF4 gene1 1 2 3 4... (5 Replies)
Discussion started by: giuliangiuseppe
5 Replies

5. Shell Programming and Scripting

Transpose field names from column headers to values in one column

Hi All, I'm looking for a script which can transpose field names from column headers to values in one column. for example, the input is: IDa;IDb;IDc;PARAM1;PARAM2;PARAM3; a;b;c;p1val;p2val;p3val; d;e;f;p4val;p5val;p6val; g;h;i;p7val;p8val;p9val; into the output like this: ... (6 Replies)
Discussion started by: popesk
6 Replies

6. Shell Programming and Scripting

Matching words based on column headers

Hi , Pls help on this. Input file: NAME1 BSC1 TEXT ID 1 MAINSFAIL TEXT ID 2 DGON TEXT ID 3 lOADONDG NAME2 BSC2 TEXT ID 1 DGON TEXT ID 3 lOADONG (1 Reply)
Discussion started by: bha148
1 Replies

7. UNIX for Dummies Questions & Answers

Sort by Column Headers

Hi All, I am new to UNIX can you please help me to sort a file with different columns my file looks like this $ cat gaut.txt UID PID PPID PGID SID C STIME TTY TIME CMD liveuser 3008 2892 3008 3008 0 11:58 ? 00:00:00 gnome-session liveuser 3019 1 ... (8 Replies)
Discussion started by: cgk1983
8 Replies

8. Shell Programming and Scripting

Merging of files with different headers to make combined headers file

Hi , I have a typical situation. I have 4 files and with different headers (number of headers is varible ). I need to make such a merged file which will have headers combined from all files (comman coluns should appear once only). For example - File 1 H1|H2|H3|H4 11|12|13|14 21|22|23|23... (1 Reply)
Discussion started by: marut_ashu
1 Replies

9. Shell Programming and Scripting

Excel Column Headers

cat ABC.log | egrep "Error 500" >> /tmp/Logs.log egrep "<Mango>.*<.Mango>" Logs.log | sed -e "s/^.*<Mango/<Mango/" | cut -f2 -d">"| cut -f1 -d"<" >> /tmp/temp1.xls egrep "<Apple>.*<.Apple>" Logs.log | sed -e "s/^.*<Apple/<Apple/" | cut -f2 -d">"| cut -f1 -d"<" >> /tmp/temp2.xls print Heading1,... (1 Reply)
Discussion started by: pk_eee
1 Replies

10. Shell Programming and Scripting

Removing Headers and a Column

I have a text file in unix with a layout like this Column 1 - 1-12 Column 2 - 13-39 Column 3 - 40-58 Column 4 - 59-85 Column 5 - 86-120 Columbn 6 - 121-131 The file also has a header on the first 6 lines of each page. Each page is 51 lines long. So I want to remove the header from each... (30 Replies)
Discussion started by: DerangedNick
30 Replies
Login or Register to Ask a Question