Paste columns based on common column: multiple files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Paste columns based on common column: multiple files
# 1  
Old 12-14-2017
Paste columns based on common column: multiple files

Hi all,

I've multiple files. In this case 5. Space separated columns. Each file has 12 columns. Each file has 300-400K lines.
I want to get the output such that if a value in column 2 is present in all the files then get all the columns of that value and print it side by side.

Desired output must have 60 columns.

File 1
head HGWAS1/merged_info_CHR1.info

Code:
snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10235:T:TA 10235 T TA 0.001 0.157 0.998 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.436 0.435 0.646 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.095 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.095 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.001 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.000 0.017 1.000 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.211 1.000 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.155 1.000 0 -1 -1 -1

File 2
head HGWAS2/merged_info_CHR1.info
Code:
snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.414 0.473 0.670 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.001 0.141 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.427 0.488 0.673 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.045 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.045 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.020 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.426 0.999 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.003 1.000 0 -1 -1 -1

File 3
head HGWAS3/merged_info_CHR1.info
Code:
snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.434 0.522 0.691 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.000 0.122 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.421 0.526 0.693 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.132 0.999 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.132 0.999 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.294 0.999 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.001 1.000 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.081 1.000 0 -1 -1 -1

File 4
head HGWAS4/merged_info_CHR1.info
Code:
snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.418 0.539 0.700 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.001 0.180 0.998 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.406 0.528 0.695 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.063 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.063 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.015 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.079 0.999 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.022 1.000 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.007 1.000 0 -1 -1 -1

File 5
head HGWAS6/merged_info_CHR1.info
Code:
snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.406 0.512 0.695 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.000 0.115 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.407 0.522 0.695 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.029 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.029 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.759 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.205 0.999 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.012 1.000 0 -1 -1 -1

Desired output: Order of the column is important from each file.
Code:
snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 
--- 1:10235:T:TA 10235 T TA 0.001 0.157 0.998 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.001 0.141 0.999 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.000 0.122 0.999 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.001 0.180 0.998 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.000 0.115 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.436 0.435 0.646 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.427 0.488 0.673 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.421 0.526 0.693 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.406 0.528 0.695 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.407 0.522 0.695 0 -1 -1 -1
--- 1:10505:A:T 10505 A T ---similarly
--- 1:10506:C:G 10506 C G  --similary
--- 1:10539:C:A 10539 C A ---similarly

What I've done:
I used join, then I end up with 24 columns. But Column two has values present in all the files. Next, I can grep it in all files and am losing out there. I am failing to put the grepped output in columnar format. Apparently not a good way to aim for this problem Smilie


Moderator's Comments:
Mod Comment Please use CODE tags correctly as required by forum rules!

Last edited by RudiC; 12-15-2017 at 04:07 AM.. Reason: Changed QUOTE to CODE tags.
# 2  
Old 12-14-2017
You should post your full script and show your work.

Thanks.
# 3  
Old 12-15-2017
Something along this line:
Code:
awk '
BEGIN           {TF = ARGC
                }

FNR == 1        {FC++
                 if (FC < TF) ARGV[ARGC++] = FILENAME
                }

FC < TF         {CNT[$2]++
                 next
                }

CNT[$2] == 5    {if (!LINE[$2]) SEQ[++SN] = $2
                 LINE[$2] = LINE[$2] $0 " "
                }

END             {for (s=1; s<=SN; s++) print LINE[SEQ[s]]
                }
'  HGWAS?/merged_info_CHR1.info

# 4  
Old 12-15-2017
Code:
for i in {1..22}
do
    #--iterate over chromosomes
    saveTemp=""
    files_info="$(find $input_dir -name "*_CHR$i.info"  | sort )"
    files_list=""
    
    #---split by new lines and make it array---
    SAVEIFS=$IFS
    IFS=$'\n'
    files_info=($files_info)
    IFS=$SAVEIFS 
    
    join -j 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10,2.11,2.12  ${files_info[0]}  ${files_info[1]}  > $output_dir/"tempCHR_"$i".info" 

    SAVEtemp=$output_dir/"tempCHR_"$i".info"
    printf "$i joined for first two files\n"

    for (( x=2;x<${#files_info[@]};x++ ))
    do
	join -j 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10,2.11,2.12 $SAVEtemp  ${files_info[$x]}  > $output_dir/"tempchr"$i"_"$x".info" 

	SAVEtemp=$output_dir/"tempchr"$i"_"$x".info"
    done
    mv $SAVEtemp $output_dir/"joined_CHR""$i"".info"
    SAVEtemp=$output_dir/"joined_CHR""$i"".info"
    printf "CHR $i is done for joining\n"
    
    for w in ` awk '{print $2}' $SAVEtemp | grep -v "rs_id" `
    do
	st="" #start null string to concatenate
	
	for (( x=0;x<${#files_info[@]};x++ ))
	do
	    #--loop through files to grep the string
	    
	    temp_st=$(grep -w $w ${files_info[$x]}) 
	    st=$st" "$temp_st

	done
	echo "$st" >> $output_dir/"cols_joined_CHR"$i".info" 
    done

    printf "Proceseed files for $i chromosome!\n"
done

I left it running last evening and script has not finished working with chromosome 1. Smilie Terrible.
# 5  
Old 12-15-2017
Here is an approach in gawk:-
Code:
gawk '
        FNR > 1 {
                R[$2]
                V[ARGIND FS $2] = $0
        }
        END {
                for ( k in R )
                {
                        f = 0
                        for ( i = 1; i <= ARGIND; i++ )
                        {
                                if ( ! ( ( i FS k ) in V ) )
                                        f = 1
                        }
                        if ( f == 0 )
                        {
                                for ( i = 1; i <= ARGIND; i++ )
                                        printf "%s ", V[i FS k]
                                printf "\n"
                        }
                }
        }
' HGWAS{1..5}/merged_info_CHR1.info

# 6  
Old 12-15-2017
Do you mind explaining what and how it does what?
# 7  
Old 12-15-2017
See also post#3 which I had to hide until you showed your work as requested by Neo.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to copy a column of multiple files and paste into new excel file (next to column)?

I have data of an excel files as given below, file1 org1_1 1 1 2.5 100 org1_2 1 2 5.5 98 org1_3 1 3 7.2 88 file2 org2_1 1 1 2.5 100 org2_2 1 2 5.5 56 org2_3 1 3 7.2 70 I have multiple excel files as above shown. I have to copy column 1, column 4 and paste into a new excel file as... (26 Replies)
Discussion started by: dineshkumarsrk
26 Replies

2. Shell Programming and Scripting

Join columns across multiple lines in a Text based on common column using BASH

Hello, I have a file with 2 columns ( tableName , ColumnName) delimited by a Pipe like below . File is sorted by ColumnName. Table1|Column1 Table2|Column1 Table5|Column1 Table3|Column2 Table2|Column2 Table4|Column3 Table2|Column3 Table2|Column4 Table5|Column4 Table2|Column5 From... (6 Replies)
Discussion started by: nv186000
6 Replies

3. UNIX for Dummies Questions & Answers

Merge selective columns from files based on common key

Hi, I am trying to selectively merge two files based on keys reported in the 1st column. File1: #file1-header1 file1-header2 111 qwe rtz uio 198 asd fgh jkl 165 yxc 789 poi uzt rew 89 lkj File2: #file2-header2 file2-header2 165 ghz nko2 ... (2 Replies)
Discussion started by: dovah
2 Replies

4. UNIX for Dummies Questions & Answers

How to join 2 .txt files based on a common column?

Hi all, I'm trying to join two .txt file tab delimitated based on a common column. File 1 transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct comp1000201_c0_seq1 comp1000201_c0 337 183.51 0.00 0.00 0.00 0.00 comp1000297_c0_seq1 ... (1 Reply)
Discussion started by: alisrpp
1 Replies

5. Shell Programming and Scripting

common entries between files based on 1st column

Hi, I am trying to get the common entries from 2 files based on 1st field.. However when I try to do in perl I am getting blank output.. How can I do this in awk? open(BUFF1, "my_genes"); open(BUFF3, "rawcounts"); #open(WRBUFF,">result_rawcounts"); while($line =<BUFF1>) { ... (3 Replies)
Discussion started by: Diya123
3 Replies

6. UNIX for Dummies Questions & Answers

Writing a loop to merge multiple files by common column

I have 100 data files labelled 250.1.txt through 250.100.txt. The second column of the data files partially match (there is about %90 overlap). Each data file has 4 columns. I want the merge all these text files by the matching values in the second column. In the output, the first column should... (1 Reply)
Discussion started by: evelibertine
1 Replies

7. Shell Programming and Scripting

Join multiple files based on 1 common column

I have n files (for ex:64 files) with one similar column. Is it possible to combine them all based on that column ? file1 ax100 20 30 40 ax200 22 33 44 file2 ax100 10 20 40 ax200 12 13 44 file2 ax100 0 0 4 ax200 2 3 4 (9 Replies)
Discussion started by: quincyjones
9 Replies

8. Shell Programming and Scripting

Merging 2 files based on a common column

Hi All, I do have 2 files file 1 has 4 tab delimited columns 234 a c dfgyu 294 b g fih 302 c h jzh 328 z c san 597 f g son File 2 has 2 tab delimted columns 234 23 302 24 597 24 I want to merge file 2 with file 1 based on the data common in both files which is the first column so... (6 Replies)
Discussion started by: Lucky Ali
6 Replies

9. Shell Programming and Scripting

sum multiple columns based on column value

i have a file - it will be in sorted order on column 1 abc 0 1 abc 2 3 abc 3 5 def 1 7 def 0 1 -------- i'd like (awk maybe?) to get the results (any ideas)??? abc 5 9 def 1 8 (2 Replies)
Discussion started by: jjoe
2 Replies

10. Shell Programming and Scripting

How to convert 2 column data into multiple columns based on a keyword in a row??

Hi Friends I have the following input data in 2 columns. SNo 1 I1 Value I2 Value I3 Value SNo 2 I4 Value I5 Value I6 Value I7 Value SNo 3 I8 Value I9 Value ............... ................ SNo N (1 Reply)
Discussion started by: ks_reddy
1 Replies
Login or Register to Ask a Question