I had a similar data sorting task and found the above program extremely useful as a starting point and learning tool. Here is the same program with documentation from my figuring out of the program, in case it is useful for anyone else.
#This regex quietly ignores all lines in the datafile starting with #.
#If for some reason you don't want to do that, just start with the {
!/^[#]/ {
#This just checks that the inputted column is between 1 and the greatest
#column number:
if ( c > 0 && c <= NF ) {
# idx is a string made of $1, which is the data index, and c, which
# is the column number of the data we want to extract. They are
# separated by the separator SUBSEP, which can be set if you want
# in a BEGIN{} statement. See for example this page on
arrays and SUBSEP.
idx = $1 SUBSEP c
# a is a 1-dimensional array, whose index is the string idx. While
# scanning through the first file, the (idx in a) test will return false,
# so a[idx] = $c. In subsequent files, (idx in a) will pass, so
# a[idx] will then equal a[idx] OFS $c. OFS is the output field
# separator which I set to " ", $c is the data column. So a is a
# string variable whose string is the row of data which increases in
# length by an OFS and a data value for each file scanned.
a[idx] = ( idx in a ) ? a[idx] OFS $c : $c
}
}
END {
# idx is as above, except that it is now being recalled as the index
# of a. It is still in the form of a string. I found it more clear
# to call it idx again instead of rec.
for( idx in a ) {
# this creates the array idxA by splitting rec between every field
# separator SUBSEP
split(idx, idxA, SUBSEP)
#idxA[1] is the row index, idxA[2] would be the column number
#a[idx] is the string of data values for the same row collected from each datafile.
#"%d%s%s\n" says to format the printed line as a decimal integer followed by
#two strings then a newline. See for example the
printf section of the gawk manual.
printf("%d%s%s\n", idxA[1], OFS, a[idx])
}
}