Paste columns based on common column: multiple files
Hi all,
I've multiple files. In this case 5. Space separated columns. Each file has 12 columns. Each file has 300-400K lines.
I want to get the output such that if a value in column 2 is present in all the files then get all the columns of that value and print it side by side.
Desired output must have 60 columns.
File 1
head HGWAS1/merged_info_CHR1.info
Code:
snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10235:T:TA 10235 T TA 0.001 0.157 0.998 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.436 0.435 0.646 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.095 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.095 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.001 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.000 0.017 1.000 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.211 1.000 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.155 1.000 0 -1 -1 -1
File 2
head HGWAS2/merged_info_CHR1.info
Code:
snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.414 0.473 0.670 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.001 0.141 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.427 0.488 0.673 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.045 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.045 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.020 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.426 0.999 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.003 1.000 0 -1 -1 -1
File 3
head HGWAS3/merged_info_CHR1.info
Code:
snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.434 0.522 0.691 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.000 0.122 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.421 0.526 0.693 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.132 0.999 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.132 0.999 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.294 0.999 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.001 1.000 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.081 1.000 0 -1 -1 -1
File 4
head HGWAS4/merged_info_CHR1.info
Code:
snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.418 0.539 0.700 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.001 0.180 0.998 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.406 0.528 0.695 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.063 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.063 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.015 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.079 0.999 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.022 1.000 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.007 1.000 0 -1 -1 -1
File 5
head HGWAS6/merged_info_CHR1.info
Code:
snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.406 0.512 0.695 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.000 0.115 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.407 0.522 0.695 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.029 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.029 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.759 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.205 0.999 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.012 1.000 0 -1 -1 -1
Desired output: Order of the column is important from each file.
Code:
snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10235:T:TA 10235 T TA 0.001 0.157 0.998 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.001 0.141 0.999 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.000 0.122 0.999 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.001 0.180 0.998 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.000 0.115 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.436 0.435 0.646 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.427 0.488 0.673 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.421 0.526 0.693 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.406 0.528 0.695 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.407 0.522 0.695 0 -1 -1 -1
--- 1:10505:A:T 10505 A T ---similarly
--- 1:10506:C:G 10506 C G --similary
--- 1:10539:C:A 10539 C A ---similarly
What I've done:
I used join, then I end up with 24 columns. But Column two has values present in all the files. Next, I can grep it in all files and am losing out there. I am failing to put the grepped output in columnar format. Apparently not a good way to aim for this problem
Moderator's Comments:
Please use CODE tags correctly as required by forum rules!
Last edited by RudiC; 12-15-2017 at 04:07 AM..
Reason: Changed QUOTE to CODE tags.
for i in {1..22}
do
#--iterate over chromosomes
saveTemp=""
files_info="$(find $input_dir -name "*_CHR$i.info" | sort )"
files_list=""
#---split by new lines and make it array---
SAVEIFS=$IFS
IFS=$'\n'
files_info=($files_info)
IFS=$SAVEIFS
join -j 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10,2.11,2.12 ${files_info[0]} ${files_info[1]} > $output_dir/"tempCHR_"$i".info"
SAVEtemp=$output_dir/"tempCHR_"$i".info"
printf "$i joined for first two files\n"
for (( x=2;x<${#files_info[@]};x++ ))
do
join -j 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10,2.11,2.12 $SAVEtemp ${files_info[$x]} > $output_dir/"tempchr"$i"_"$x".info"
SAVEtemp=$output_dir/"tempchr"$i"_"$x".info"
done
mv $SAVEtemp $output_dir/"joined_CHR""$i"".info"
SAVEtemp=$output_dir/"joined_CHR""$i"".info"
printf "CHR $i is done for joining\n"
for w in ` awk '{print $2}' $SAVEtemp | grep -v "rs_id" `
do
st="" #start null string to concatenate
for (( x=0;x<${#files_info[@]};x++ ))
do
#--loop through files to grep the string
temp_st=$(grep -w $w ${files_info[$x]})
st=$st" "$temp_st
done
echo "$st" >> $output_dir/"cols_joined_CHR"$i".info"
done
printf "Proceseed files for $i chromosome!\n"
done
I left it running last evening and script has not finished working with chromosome 1. Terrible.
gawk '
FNR > 1 {
R[$2]
V[ARGIND FS $2] = $0
}
END {
for ( k in R )
{
f = 0
for ( i = 1; i <= ARGIND; i++ )
{
if ( ! ( ( i FS k ) in V ) )
f = 1
}
if ( f == 0 )
{
for ( i = 1; i <= ARGIND; i++ )
printf "%s ", V[i FS k]
printf "\n"
}
}
}
' HGWAS{1..5}/merged_info_CHR1.info
I have data of an excel files as given below,
file1
org1_1 1 1 2.5 100
org1_2 1 2 5.5 98
org1_3 1 3 7.2 88
file2
org2_1 1 1 2.5 100
org2_2 1 2 5.5 56
org2_3 1 3 7.2 70
I have multiple excel files as above shown.
I have to copy column 1, column 4 and paste into a new excel file as... (26 Replies)
Hello,
I have a file with 2 columns ( tableName , ColumnName) delimited by a Pipe like below . File is sorted by ColumnName.
Table1|Column1
Table2|Column1
Table5|Column1
Table3|Column2
Table2|Column2
Table4|Column3
Table2|Column3
Table2|Column4
Table5|Column4
Table2|Column5
From... (6 Replies)
Hi, I am trying to selectively merge two files based on keys reported in the 1st column.
File1:
#file1-header1
file1-header2
111 qwe rtz uio
198 asd fgh jkl
165 yxc
789 poi uzt rew
89 lkj
File2:
#file2-header2
file2-header2
165 ghz nko2 ... (2 Replies)
Hi all,
I'm trying to join two .txt file tab delimitated based on a common column.
File 1
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
comp1000201_c0_seq1 comp1000201_c0 337 183.51 0.00 0.00 0.00 0.00
comp1000297_c0_seq1 ... (1 Reply)
Hi,
I am trying to get the common entries from 2 files based on 1st field.. However when I try to do in perl I am getting blank output.. How can I do this in awk?
open(BUFF1, "my_genes");
open(BUFF3, "rawcounts");
#open(WRBUFF,">result_rawcounts");
while($line =<BUFF1>)
{
... (3 Replies)
I have 100 data files labelled 250.1.txt through 250.100.txt. The second column of the data files partially match (there is about %90 overlap). Each data file has 4 columns.
I want the merge all these text files by the matching values in the second column. In the output, the first column should... (1 Reply)
I have n files (for ex:64 files) with one similar column. Is it possible to combine them all based on that column ?
file1
ax100 20 30 40
ax200 22 33 44
file2
ax100 10 20 40
ax200 12 13 44
file2
ax100 0 0 4
ax200 2 3 4 (9 Replies)
Hi All,
I do have 2 files
file 1 has 4 tab delimited columns
234 a c dfgyu
294 b g fih
302 c h jzh
328 z c san
597 f g son
File 2 has 2 tab delimted columns
234 23
302 24
597 24
I want to merge file 2 with file 1 based on the data common in both files which is the first column so... (6 Replies)
i have a file - it will be in sorted order on column 1
abc 0 1
abc 2 3
abc 3 5
def 1 7
def 0 1
--------
i'd like (awk maybe?) to get the results (any ideas)???
abc 5 9
def 1 8 (2 Replies)
Hi Friends
I have the following input data in 2 columns.
SNo 1
I1 Value
I2 Value
I3 Value
SNo 2
I4 Value
I5 Value
I6 Value
I7 Value
SNo 3
I8 Value
I9 Value
...............
................
SNo N (1 Reply)