Merge CSV files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Merge CSV files
# 1  
Old 06-01-2015
Merge CSV files

I have lot of csv file collected from script like below :

Name of files (some examples) there are thousands of it:
Code:
192.168.0.123_251_18796_1433144473.csv
192.168.0.123_251_18796_1433144772.csv
192.168.0.123_251_18796_1433145073.csv
192.168.0.123_251_18796_1433145372.csv

Content of each files :

192.168.0.123_251_18796_1433144473.csv :
Code:
"Date","Inbound","Outbound"
"2015-06-01 13:45:00","102061994.58","30970432.36"
"2015-06-01 13:50:00","101806973.26","31345909.05"
"2015-06-01 13:55:00","104203841.45","31944063.26"
"2015-06-01 14:00:00","87749793.77","31403776.19"
"2015-06-01 14:05:00","74034336.32","31371706.30"
"2015-06-01 14:10:00","58456052.80","30244690.38"
"2015-06-01 14:15:00","68841405.45","31038223.80"
"2015-06-01 14:20:00","91816301.42","31397426.60"
"2015-06-01 14:25:00","109179816.57","31436652.92"
"2015-06-01 14:30:00","113248200.09","31885767.39"
"2015-06-01 14:35:00","94289765.97","30544861.40"
"2015-06-01 14:40:00","102619582.08","31742486.29"
"2015-06-01 14:45:00","",""

192.168.0.123_251_18796_1433144772.csv
Code:
"Date","Inbound","Outbound"
"2015-06-01 13:50:00","101806973.26","31345909.05"
"2015-06-01 13:55:00","104203841.45","31944063.26"
"2015-06-01 14:00:00","87749793.77","31403776.19"
"2015-06-01 14:05:00","74034336.32","31371706.30"
"2015-06-01 14:10:00","58456052.80","30244690.38"
"2015-06-01 14:15:00","68841405.45","31038223.80"
"2015-06-01 14:20:00","91816301.42","31397426.60"
"2015-06-01 14:25:00","109179816.57","31436652.92"
"2015-06-01 14:30:00","113248200.09","31885767.39"
"2015-06-01 14:35:00","94289765.97","30544861.40"
"2015-06-01 14:40:00","102619582.08","31742486.29"
"2015-06-01 14:45:00","84352207.67","30581944.65"
"2015-06-01 14:50:00","",""

192.168.0.123_251_18796_1433145073.csv
Code:
"Date","Inbound","Outbound"
"2015-06-01 13:55:00","104203841.45","31944063.26"
"2015-06-01 14:00:00","87749793.77","31403776.19"
"2015-06-01 14:05:00","74034336.32","31371706.30"
"2015-06-01 14:10:00","58456052.80","30244690.38"
"2015-06-01 14:15:00","68841405.45","31038223.80"
"2015-06-01 14:20:00","91816301.42","31397426.60"
"2015-06-01 14:25:00","109179816.57","31436652.92"
"2015-06-01 14:30:00","113248200.09","31885767.39"
"2015-06-01 14:35:00","94289765.97","30544861.40"
"2015-06-01 14:40:00","102619582.08","31742486.29"
"2015-06-01 14:45:00","84352207.67","30581944.65"
"2015-06-01 14:50:00","93056971.42","30470065.37"
"2015-06-01 14:55:00","",""

192.168.0.123_251_18796_1433145372.csv
Code:
"Date","Inbound","Outbound"
"2015-06-01 14:00:00","87749793.77","31403776.19"
"2015-06-01 14:05:00","74034336.32","31371706.30"
"2015-06-01 14:10:00","58456052.80","30244690.38"
"2015-06-01 14:15:00","68841405.45","31038223.80"
"2015-06-01 14:20:00","91816301.42","31397426.60"
"2015-06-01 14:25:00","109179816.57","31436652.92"
"2015-06-01 14:30:00","113248200.09","31885767.39"
"2015-06-01 14:35:00","94289765.97","30544861.40"
"2015-06-01 14:40:00","102619582.08","31742486.29"
"2015-06-01 14:45:00","84352207.67","30581944.65"
"2015-06-01 14:50:00","93056971.42","30470065.37"
"2015-06-01 14:55:00","78773346.56","29866197.88"
"2015-06-01 15:00:00","",""

Each files collected csv data from one hour data, and every files differ each other with the interval 5 minutes ahead based on the timestamp.

I want to merge all of the files so it will merged in one big file with the content of each column sorted based on the timestamp.
I think some perl script can do this, but have no idea how to do this.

Expected result :
Code:
"Date","Inbound","Outbound"
"2015-06-01 13:45:00","102061994.58","30970432.36"
"2015-06-01 13:50:00","101806973.26","31345909.05"
"2015-06-01 13:55:00","104203841.45","31944063.26"
"2015-06-01 14:00:00","87749793.77","31403776.19"
"2015-06-01 14:05:00","74034336.32","31371706.30"
"2015-06-01 14:10:00","58456052.80","30244690.38"
"2015-06-01 14:15:00","68841405.45","31038223.80"
"2015-06-01 14:20:00","91816301.42","31397426.60"
"2015-06-01 14:25:00","109179816.57","31436652.92"
"2015-06-01 14:30:00","113248200.09","31885767.39"
"2015-06-01 14:35:00","94289765.97","30544861.40"
"2015-06-01 14:40:00","102619582.08","31742486.29"
"2015-06-01 14:45:00","84352207.67","30581944.65"
"2015-06-01 14:50:00","93056971.42","30470065.37"
"2015-06-01 14:55:00","78773346.56","29866197.88"
"2015-06-01 15:00:00","",""

Regards and Thanks

---------- Post updated at 03:42 PM ---------- Previous update was at 03:17 PM ----------

Answering my own question.
I use this one liner shell :

Code:
cat *.csv |  grep -v '"Date","Inbound","Outbound"' | sort -u -k1 | uniq | sed '/,\"\",\"\"/d'

It work.

But I open to maybe the best solution from other folks.

TIA.
# 2  
Old 06-01-2015
In cases like these, when I merge files, I will throw in an extra field to convey the filename of where the record came in the event I want to get rid of the separate files. So I would throw in a fourth field like "source" and put the filename in it. You don't have to include this extra field in your sort (so you could do sort by fields date, inbound, and outbound (disclude source) and use "," as the delimiter for the sort keys.
# 3  
Old 06-01-2015
If your grep has the -h option (but that is not standard grep), your solution could be reduced somewhat further still, to something like this:
Code:
grep -Fhve '"Date","Inbound","Outbound"' -e '","",""' *.csv | sort -u

If there are too many .csv files you may run into line length limitations. Then this is something you could try:
Code:
find . -name "*.csv" -exec grep -Fhve '"Date","Inbound","Outbound"' -e '","",""' {} + | sort -u

--
Without the -h option you would need cat:
Code:
cat *.csv | grep -Fve '"Date","Inbound","Outbound"' -e '","",""' | sort -u

or
Code:
find . -name "*.csv" -exec cat {} + | grep -Fve '"Date","Inbound","Outbound"' -e '","",""' | sort -u


Last edited by Scrutinizer; 06-01-2015 at 03:09 PM..
# 4  
Old 06-02-2015
rk4k,
The output produced by your script and the scripts suggested by Scrutinizer do not match the output you said you wanted. The heading line and the last line that you showed in your expected result are not included in the output they produce.

I haven't tried a performance comparison, but the following uses fewer processes and pushes less data through the pipes so it could be faster. With your sample input files, the following code:
Code:
for file in *.csv
do	printf '%s\n' "$file"
done | awk '
{	file = $0
	while((getline < file) == 1) {
		if(/Date/ && ! count[$0]++) {
			print
			continue
		}
		if(/"",""/ && $0 > trailer) {
			trailer = $0
			continue
		}
		if(! count[$0]++)
			print | "sort"
	}
	close(file)
}
END {	close("sort")
	print trailer
}'

produces the output:
Code:
"Date","Inbound","Outbound"
"2015-06-01 13:45:00","102061994.58","30970432.36"
"2015-06-01 13:50:00","101806973.26","31345909.05"
"2015-06-01 13:55:00","104203841.45","31944063.26"
"2015-06-01 14:00:00","87749793.77","31403776.19"
"2015-06-01 14:05:00","74034336.32","31371706.30"
"2015-06-01 14:10:00","58456052.80","30244690.38"
"2015-06-01 14:15:00","68841405.45","31038223.80"
"2015-06-01 14:20:00","91816301.42","31397426.60"
"2015-06-01 14:25:00","109179816.57","31436652.92"
"2015-06-01 14:30:00","113248200.09","31885767.39"
"2015-06-01 14:35:00","94289765.97","30544861.40"
"2015-06-01 14:40:00","102619582.08","31742486.29"
"2015-06-01 14:45:00","84352207.67","30581944.65"
"2015-06-01 14:50:00","93056971.42","30470065.37"
"2015-06-01 14:55:00","78773346.56","29866197.88"
"2015-06-01 15:00:00","",""

which seems to be what you said you wanted.

And, even with thousands of .csv files in the current directory, it shouldn't have any problems with E2BIG errors that you might get expanding the argument list for cat *.csv or grep ... *.csv.
These 2 Users Gave Thanks to Don Cragun For This Post:
# 5  
Old 06-02-2015
Hello all,

I've found Don's code much more responsive to sort the csv files.
Sorry for the heading and last lines in my first request. It's not mandatory, all I need is the csv results.

Thanks for all the responses.

Much obliged.

Best Regards.
# 6  
Old 06-04-2015
Quote:
Originally Posted by Don Cragun
[..]
I haven't tried a performance comparison, but the following uses fewer processes and pushes less data through the pipes so it could be faster. [..]
The number of processes between the various non-find solutions does not differ much, but I never realized that the amount of data through the pipes could matter that much. Thank you for that insight..

I did some tests with 10.000 input file and it was up to a factor 5 when selection is done before the pipe (obviously depending on the amount of data that can be saved by selecting before the pipe)..

--
My system does not have a lot of line length limitations , so I was able to use this script (since the header and footer do not matter):
Code:
awk '/[0-9]"$/ && !A[$0]++' *.csv

Which is a little bit faster still, since it doesn't even use a pipe..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Merge the three csv files as one according to first coloumn.

I have three files with similar pattern i need to merge all the coloumns side by side from all three files according to the first coloumn example as shown below I mentioned 5 coloumns only in example but i have around 15 coloumns in each file. file1: Name,Samples,Error,95RT,90RT... (4 Replies)
Discussion started by: Raghuram717
4 Replies

2. Shell Programming and Scripting

Compare and merge two big CSV files

Hi all, i need help. I have two csv files with a huge amount of data. I need the first column of the first file, to be compared with the data of the second, to have at the end a file with the data not present in the second file. Example File1: (only one column) profile_id 57036226... (11 Replies)
Discussion started by: SirMannu
11 Replies

3. Shell Programming and Scripting

I am trying to merge all csv files from source path into 1 file

I am trying to merge all csv files from source path into one single csv file in target. but getting error message: hadoop fs -cat /user/hive/warehouse/stage.db/PK_CLOUD_CHARGE/TCH-charge_*.csv > /user/hive/warehouse/stage.db/PK_CLOUD_CHARGE/final/TCH_pb_charge.csv getting error message:... (0 Replies)
Discussion started by: cplusplus1
0 Replies

4. UNIX for Dummies Questions & Answers

Merge two csv files using column name

Hi all, I have two separate csv files(comma delimited) file 1 and file 2. File 1 contains PAN,NAME,Salary AAAAA5467D,Raj,50000 AAFAC5467D,Ram,60000 BDCFA5677D,Kumar,90000 File 2 contains PAN,NAME,Dept,Salary ASDFG6756T,Karthik,ABC,450000 QWERT8765Y,JAX,CDR,780000... (5 Replies)
Discussion started by: Nivas
5 Replies

5. UNIX for Dummies Questions & Answers

Need help combining txt files w/ multiple lines into csv single cell - also need data merge

:confused:Hello -- i just joined the forums. I am a complete noob -- only about 1 week into learning how to program anything... and starting with linux. I am working in Linux terminal. I have a folder with a bunch of txt files. Each file has several lines of html code. I want to combine... (2 Replies)
Discussion started by: jetsetter
2 Replies

6. Shell Programming and Scripting

Merge *.csv files, each in separate sheets

Does anyone know how to Merge *.csv files, each in seperate sheets? (7 Replies)
Discussion started by: frhling
7 Replies

7. UNIX for Dummies Questions & Answers

Merge all csv files in one folder considering only 1 header row and ignoring header of all others

Friends, I need help with the following in UNIX. Merge all csv files in one folder considering only 1 header row and ignoring header of all other files. FYI - All files are in same format and contains same headers. Thank you (4 Replies)
Discussion started by: Shiny_Roy
4 Replies

8. Shell Programming and Scripting

Merge CSV files and create a column with the filename from the original file

Hello everyone!! I am not completely new to shell script but I havent been able to find the answer to my problem and I'm sure there are some smart brains here up for the challenge :D. I have several CSV files that I need to combine into one, but I also need to know where each row came from.... (7 Replies)
Discussion started by: fransanchezoria
7 Replies

9. Shell Programming and Scripting

Merge 2 CSV files using sed

Help in writing a script using sed which updates fileOne with the contents from fileTwo Example: Contents of fileOne 1,111111 2,897823 3,235473 4,222222 Contents of fileTwo 1,111111,A,1,2 4,222222,A,2,2 5,374632,A,3,2 6,374654,A,4,2 Final File should be: 1,111111,A,1,2... (9 Replies)
Discussion started by: NewToSed
9 Replies

10. Shell Programming and Scripting

Merge 2 csv files with awk

I have 2 files pipe delimted and want to merge them based on a key e.g file 1 123$aaa$yyy$zzz 345$xab$yzy$zyz 456$sss$ttt$foo 799$aaa$ggg$dee file 2 123$hhh 345$ddd 456$xxx 888$zzz so if the key is the first field, and the result should be the common key between file 1 and 2 (6 Replies)
Discussion started by: loloAix
6 Replies
Login or Register to Ask a Question