Concatenate column values when header is Matching from multiple files


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Concatenate column values when header is Matching from multiple files
# 1  
Old 09-24-2016
Concatenate column values when header is Matching from multiple files

there can be n number of columns but the number of columns and header name will remain same in all 3 files. Files are tab Delimited.
a.txt

Code:
Name 9/1 9/2
X 1 7
y 2 8
z 3 9
a 4 10
b 5 11
c 6 12

b.xt
Code:
Name  9/1 9/2
X 13 19
y 14 20
z 15 21
a 16 22
b 17 23
c 18 24

c.txt
Code:
Name  9/1 9/2
X 25 31
y 26 32
z 27 33
a 28 34
b 29 35
c 30 36

required output

Code:
Name  9/1 9/2
X 1/13/25 7/19/31
y 2/14/26 8/20/32
z 3/15/27 9/21/33
a 4/16/28 10/22/34
b 5/17/29 11/23/35
c 6/16/30 12/24/36

I Want to merge all three file on the basis of matching header and if values are same in the row then print one value and if different then concatenate values from all three files under that column. like in file a.txt,b.txt,c.txt, row 2 has same value under column name "NAME" so print only X but column "9/1" has different values so print one header " 9/1" and under it print all values from three files e.g. 1/13/25.

Code:
join <(sort a.txt) <(sort b.txt) <(sort c.txt) >out.txt

Code:
 awk 'FNR==NR{a[$1]=$2;next} ($1 in a){ print $0, a[$1]}' a.txt b.txt c.txt

the above commands are not concatenating the values of same headers but it is printing all the rows from file a then all rows from file b then c

Last edited by Nina2910; 09-24-2016 at 12:35 AM..
# 2  
Old 09-24-2016
Hello Nina2910,

Welcome to forums, hope you will enjoy learning/sharing knowledge here. Could you please try following and let me know how it goes.
Code:
awk 'function get(a,b,array1){array1[a]=a in array1?array1[a]"/"b:b} FNR==NR{A[$1]=$2;B[$1]=$3;S[++i]=$1;next} FNR!=NR && FNR>1{get($1,$2,A);get($1,$3,B)} END{for(j=1;j<=i;j++){print S[j], A[S[j]],B[S[j]]}}' *.txt

If you have more txt files than a.txt,b.txtandc.txt then you could use following.
Code:
awk 'function get(a,b,array1){array1[a]=a in array1?array1[a]"/"b:b} FNR==NR{A[$1]=$2;B[$1]=$3;S[++i]=$1;next} FNR!=NR && FNR>1{get($1,$2,A);get($1,$3,B)} END{for(j=1;j<=i;j++){print S[j], A[S[j]],B[S[j]]}}'  a.txt  b.txt  c.txt

Output will be as follows.
Code:
Name 9/1 9/2
X 1/13/25 7/19/31
y 2/14/26 8/20/32
z 3/15/27 9/21/33
a 4/16/28 10/22/34
b 5/17/29 11/23/35
c 6/18/30 12/24/36

EDIT: Adding a non-one liner form of solution.
Code:
awk 'function get(a,b,array1){
				array1[a]=a in array1?array1[a]"/"b:b
		             } 
     FNR==NR                 {
				A[$1]=$2;
				B[$1]=$3;
				S[++i]=$1;
				next
                             } 
     FNR!=NR && FNR>1        {
				get($1,$2,A);
				get($1,$3,B)
			     } 
     END                     {
				for(j=1;j<=i;j++){
							print S[j], A[S[j]],B[S[j]]
						 }
                             }
    ' *.txt

Thanks,
R. Singh

Last edited by RavinderSingh13; 09-24-2016 at 01:17 AM.. Reason: Adding a non-one liner form of solution successfully too now.
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
Old 09-24-2016
Thank you so much Ravinder for welcoming me and giving the solution. It worked as expected

Two requests
1) could you please let me know if each file has n number of columns for example 32 columns in files a 32 columns in file b and 32 columns in file c. column header and column number will remain same in each file.how can I get the desired result.

FIles are Tab delimited
a.txt

Code:
Name 9/1 9/2 9/3.................9/30

b.txt
Code:
 Name 9/1 9/2 9/3....................9/30

c.txt
Code:
Name 9/1 9/2 9/3......................9/30

2) can you please explain the code sorry still learning

Last edited by Nina2910; 09-24-2016 at 01:35 AM..
# 4  
Old 09-24-2016
Hello Nina,

So number of columns were not mentioned before so solution was according to it, so following may help you with dynamic number of columns but condition is Input_files should have same number of columns(based on row vice), if this condition is TRUE(which seems to be as per your post) then following may help you in same. Let's say following are the Input_files.
Code:
cat a.txt
Name 9/1 9/2
X 1 7 7 8 9 10 12 12 12 12 12 13
y 2 8 3 4 5 6 7 8 8 9 1 212 12 12
z 3 9 1 1 2
a 4 10 1
b 5 11
c 6 12 12 13 14

cat b.txt
Name 9/1 9/2
X 1 7 7 8 9 10 12 12 12 12 12 13
y 22 4 13 2 53 26 17 18 82 19 111 12 13 4
z 3 9 1 1 2
a 4 10 1
b 5 11
c 6 12 12 13 14

cat c.txt
Name 9/1 9/2
X 1 7 7 8 9 10 12 12 12 12 12 13
y 2 8 3 4 5 6 7 8 8 9 1 212 12 12
z 3 9 1 1 2
a 4 10 1
b 5 11
c 6 12 12 13 14

Then following is the code for same.
Code:
paste *.txt | awk 'function get(field){q=NF/3;for(i=field;i<=NF;i+=q){$field=i>field?$field "/" $i:$field;}} NR>1{for(j=2;j<=NF/3;j++){get(j)};for(j=NF/3+1;j<=NF;j++){$j=""};sub(/[[:space:]]+$/,X,$0);print}'

Output will be as follows.
Code:
X 1/1/1 7/7/7 7/7/7 8/8/8 9/9/9 10/10/10 12/12/12 12/12/12 12/12/12 12/12/12 12/12/12 13/13/13                          
y 2/22/2 8/4/8 3/13/3 4/2/4 5/53/5 6/26/6 7/17/7 8/18/8 8/82/8 9/19/9 1/111/1 212/12/212 12/13/12 12/4/12                              
z 3/3/3 9/9/9 1/1/1 1/1/1 2/2/2            
a 4/4/4 10/10/10 1/1/1        
b 5/5/5 11/11/11      
c 6/6/6 12/12/12 12/12/12 13/13/13 14/14/14

Also in above code wherever I have used NF/3 so it is because of 3 Input_files are there if they vary in numbers then you could change according to their number. Kindly do let me know if this helps you or if you have any issues on same.
EDIT: Adding a non-one liner form of solution too here.
Code:
paste *.txt | awk 'function get(field){
					q=NF/3;
					for(i=field;i<=NF;i+=q){
								$field=i>field?$field "/" $i:$field;
							       }
                                      } 
                   NR>1               {
					for(j=2;j<=NF/3;j++)   {
								get(j)
                                                               };
					for(j=NF/3+1;j<=NF;j++){
								$j=""
							       };
                                        sub(/[[:space:]]+$/,X,$0);
					print
 				      }
                  '

Thanks,
R. Singh

Last edited by RavinderSingh13; 09-24-2016 at 03:06 AM.. Reason: Adding a non-one liner form of solution on same too succcessfully now.
This User Gave Thanks to RavinderSingh13 For This Post:
# 5  
Old 09-24-2016
You could also try the following...

Note that you say that your files are <tab> delimited, but all of the sample data you have shown us uses one or two <space> characters to separate fields; not a <tab> character.

The following will work with input files with fields separated by one or more blanks (where a blank is a <space> or a <tab>). This will not work if you have input files that really do use a <tab> as a field separator and some of your field data contains a <space>. It will work with any number of input files. It will work with any number of fields in a line (as long as all files have the same number of fields). The output header is taken from the 1st line of the 1st input file. The first line of all other input files are ignored. If the name on a data line is the same as the name on the header line in the 1st input file, that data will be merged into the output header line (i.e., it is assumed that the name used in the header in the 1st input file is not used as a name on any non-header line in any of the input files). Output fields will be separated by a <tab> character. The names of the files to be processed are not built into this script, they must be supplied as command-line arguments to the following script:
Code:
#!/bin/ksh
awk '	# Use the awk utility to interpret the following script...
BEGIN {	# Set output field separator.
	OFS = "\t"
}
NR == 1 || FNR > 1 {
	# Gather data from the 1st line in the 1st file (the header is supposed
	# to be the same in all input files) and from the 2nd line on in every
	# input file...
	# If we have not seen the name found in the first field before...
	if(!($1 in name)) {
		# Add the 1st first to the list of known names, increment the
		# number of names we have seen, and note the output line number
		# where this name should appear in the output....
		name[order[++nc] = $1]
		# and initialize the data for each output field for this name
		# from the corresponding input fields on this line.
		for(i = 2; i <= NF; i++)
			d[$1, i] = $i
	} else	# And if we have seen this name before, add data to be output
		# for this name to the accumalated data we have seen before for
		# this name.
		for(i = 2; i <= NF; i++)
			d[$1, i] = d[$1, i] "/" $i
}
END {	# Now that we have hit EOF on the last input file, print the accumulated
	# output.  For each name seen...
	for(i = 1; i <= nc; i++) {
		# Print the name...
		printf("%s", order[i])
		# and for the remaining fields...
		for(j = 2; j <= NF; j++)
			# print the output field separator followed by the
			# accumulated data for this name and field number.
			printf("%s%s", OFS, d[order[i], j])
		# and after the last field has been printed, add an aoutput
		# record separator.
		print ""
	}
}' "$@"	# Terminate the awk script and use the command line arguments as the
	# list of files to be processed.

This was written and tested using a Korn shell, but will work with any shell that uses Bourne shell syntax. If you save this script in a file named merger and make it executable:
Code:
chmod +x merger

and execute it with the pathnames of your sample input files:
Code:
./merger a.txt b.txt c.txt

it produces the output:
Code:
Name	9/1	9/2
X	1/13/25	7/19/31
y	2/14/26	8/20/32
z	3/15/27	9/21/33
a	4/16/28	10/22/34
b	5/17/29	11/23/35
c	6/18/30	12/24/36

Note that the output you said you wanted on the last line of the output was:
Code:
c 6/16/30 12/24/36

which, in addition to using <space> as a field separator instead of <tab>, also has 16 as the data from the 2nd column of the last line in b.txt instead of the value 18 that was contained in that field in your sample input file.
# 6  
Old 09-24-2016
Hello Nina2910,

So in previous code of mine(POST#4)we need to manually provide number of .txt file into code(where we are using NF/3). So converting code into a script and skipping pain of manually putting number of .txt files into following script, which is successfully run and tested in BASH. One more thing my code not checking about hearders of any of the Input_files.
Code:
cat script.ksh
COUNT=$(ls *.txt | wc -l)
paste *.txt | awk -vcount="$COUNT" 'function get(field){
							q=NF/count;
							for(i=field;i<=NF;i+=q){
										$field=i>field?$field "/" $i:$field;
							       		       }
                                                       } 
                   NR>1               {
					for(j=2;j<=NF/count;j++)   {
								get(j)
                                                               };
					for(j=NF/count+1;j<=NF;j++){
								$j=""
							       };
                                        sub(/[[:space:]]+$/,X,$0);
					print
 				      }
                  '

Thanks,
R. Singh

Last edited by RavinderSingh13; 09-24-2016 at 03:45 AM..
This User Gave Thanks to RavinderSingh13 For This Post:
# 7  
Old 09-24-2016
For any number or order of fields, any order of lines, missing fields across files, try
Code:
awk '
FNR == 1        {for (n=split($0, T); n>1; n--) if (!(T[n] in HDK)) HDK[T[n]] = ++HDC
                 F1 = T[1]
                 next
                }

!($1 in NM)     {NM[$1]
                 NMS[++SQ] = $1
                }

                {for (i=2; i<=NF; i++) NC[$1,T[i]] = NC[$1,T[i]] (NC[$1,T[i]]?"/":"") $i
                }

END             {printf "%s", F1
                 for (h in HDK) printf "\t%s", h
                 printf RS

                 for (i=1; i<=SQ; i++)  {printf "%s", NMS[i]
                                         for (h in HDK) printf "\t%s", NC[NMS[i],h]
                                         printf RS
                                        } 
                }
'  file[123]
Name	9/1	9/2	9/3
X	1/25	7/19/31	13
y	2/26	8/20/32	14
z	3/27	9/33	
a	4/28	10/22/34	16
b	5/29	11/23/35	17
c	6/30	12/24/36	18
Z		21	15

The second input file has an upper case "Z" as a name, and "9/3" as a new and unique column header. Output fields are separated by <TAB> chars, the misleading output picture is due to the varying length of Col 2...
This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Comparing same column from two files, printing whole row with matching values

First I'd like to apologize if I opened a thread which is already open somewhere. I did a bit of searching but could quite find what I was looking for, so I will try to explaing what I need. I'm writing a script on our server, got to a point where I have two files with results. Example: File1... (6 Replies)
Discussion started by: mitabrev83
6 Replies

2. Shell Programming and Scripting

Concatenate values in the first column based on the second column.

I have a file (myfile.txt) with contents like this: 1.txt apple is 3.txt apple is 5.txt apple is 2.txt apple is a 7.txt apple is a 8.txt apple is a fruit 4.txt orange not a fruit 6.txt zero isThe above file is already sorted using this command: sort -k2 myfile.txtMy objective is to get... (3 Replies)
Discussion started by: shoaibjameel123
3 Replies

3. Shell Programming and Scripting

Extracting values based on line-column numbers from multiple text files

Dear All, I have to solve the following problems with multiple tab-separated text file but I don't know how. Any help would be greatly appreciated. I have access to Linux mint (but not as a professional). I have multiple tab-delimited files with the following structure: file1: 1 44 2 ... (5 Replies)
Discussion started by: Bastami
5 Replies

4. Shell Programming and Scripting

Sum column values matching other field

this is part of a KT i am going thru. i am writing a script in bash shell, linux where i have 2 columns where 1st signifies the nth hour like 00, 01, 02...23 and 2nd the file size. sample data attached. Desired output is 3 columns which will give the nth hour, number of entries in nth hour and... (3 Replies)
Discussion started by: alpha_1
3 Replies

5. Shell Programming and Scripting

Sum values of specific column in multiple files, considering ranges defined in another file

I have a file (let say file B) like this: File B: A1 3 5 A1 7 9 A2 2 5 A3 1 3 The first column defines a filename and the other two define a range in that specific file. In the same directory, I have also three more files (File A1, A2 and A3). Here is 10 sample lines... (3 Replies)
Discussion started by: Bastami
3 Replies

6. Shell Programming and Scripting

Compare values in two files. For matching rows print corresponding values from File 1 in File2.

- I have two files (File 1 and File 2) and the contents of the files are mentioned below. - I am trying to compare the values of Column1 of File1 with Column1 of File2. If a match is found, print the corresponding value from Column2 of File1 in Column5 of File2. - I tried to modify and use... (10 Replies)
Discussion started by: Santoshbn
10 Replies

7. UNIX for Dummies Questions & Answers

shift values in one column as header for values in another column

Hi Gurus, I have a tab separated text file with two columns. I would like to make the first column values as headings for the second column values. Ex. >value1 subjects >value2 priorities >value3 requirements ...etc and I want to have a file >value1 subjects >value2 priorities... (4 Replies)
Discussion started by: Unilearn
4 Replies

8. UNIX for Dummies Questions & Answers

Rename a header column by adding another column entry to the header column name

Hi All, I have a file example.csv which looks like this GrpID,TargetID,Signal,Avg_Num CSCH74_1_1,2007,61,256 CSCH74_1_1,212007,647,679 CSCH74_1_1,12007,3,32 CSCH74_1_1,207,299,777 I want the output as GrpID,TragetID,Signal-CSCH74_1_1,Avg_Num CSCH74_1_1,2007,61,256... (1 Reply)
Discussion started by: Vavad
1 Replies

9. Shell Programming and Scripting

Rename a header column by adding another column entry to the header column name URGENT!!

Hi All, I have a file example.csv which looks like this GrpID,TargetID,Signal,Avg_Num CSCH74_1_1,2007,61,256 CSCH74_1_1,212007,647,679 CSCH74_1_1,12007,3,32 CSCH74_1_1,207,299,777 I want the output as GrpID,TragetID,Signal-CSCH74_1_1,Avg_Num CSCH74_1_1,2007,61,256... (4 Replies)
Discussion started by: Vavad
4 Replies

10. Shell Programming and Scripting

Joining multiple files based on one column with different and similar values (shell or perl)

Hi, I have nine files looking similar to file1 & file2 below. File1: 1 ABCA1 1 ABCC8 1 ABR:N 1 ACACB 1 ACAP2 1 ACOT1 1 ACSBG 1 ACTR1 1 ACTRT 1 ADAMT 1 AEN:N 1 AKAP1File2: 1 A4GAL 1 ACTBL 1 ACTL7 (4 Replies)
Discussion started by: seqbiologist
4 Replies
Login or Register to Ask a Question