Paste columns based on common column: multiple files

12-14-2017

Registered User

58, 1

Join Date: Nov 2017

Last Activity: 1 May 2020, 4:16 PM EDT

Posts: 58

Thanks Given: 1

Thanked 1 Time in 1 Post

Paste columns based on common column: multiple files

Hi all,

I've multiple files. In this case 5. Space separated columns. Each file has 12 columns. Each file has 300-400K lines.
I want to get the output such that if a value in column 2 is present in all the files then get all the columns of that value and print it side by side.

Desired output must have 60 columns.

File 1
head HGWAS1/merged_info_CHR1.info

Code:

snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10235:T:TA 10235 T TA 0.001 0.157 0.998 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.436 0.435 0.646 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.095 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.095 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.001 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.000 0.017 1.000 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.211 1.000 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.155 1.000 0 -1 -1 -1

File 2
head HGWAS2/merged_info_CHR1.info

Code:

snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.414 0.473 0.670 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.001 0.141 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.427 0.488 0.673 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.045 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.045 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.020 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.426 0.999 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.003 1.000 0 -1 -1 -1

File 3
head HGWAS3/merged_info_CHR1.info

Code:

snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.434 0.522 0.691 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.000 0.122 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.421 0.526 0.693 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.132 0.999 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.132 0.999 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.294 0.999 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.001 1.000 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.081 1.000 0 -1 -1 -1

File 4
head HGWAS4/merged_info_CHR1.info

Code:

snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.418 0.539 0.700 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.001 0.180 0.998 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.406 0.528 0.695 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.063 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.063 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.015 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.079 0.999 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.022 1.000 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.007 1.000 0 -1 -1 -1

File 5
head HGWAS6/merged_info_CHR1.info

Code:

snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.406 0.512 0.695 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.000 0.115 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.407 0.522 0.695 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.029 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.029 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.759 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.205 0.999 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.012 1.000 0 -1 -1 -1

Desired output: Order of the column is important from each file.

Code:

snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 
--- 1:10235:T:TA 10235 T TA 0.001 0.157 0.998 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.001 0.141 0.999 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.000 0.122 0.999 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.001 0.180 0.998 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.000 0.115 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.436 0.435 0.646 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.427 0.488 0.673 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.421 0.526 0.693 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.406 0.528 0.695 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.407 0.522 0.695 0 -1 -1 -1
--- 1:10505:A:T 10505 A T ---similarly
--- 1:10506:C:G 10506 C G  --similary
--- 1:10539:C:A 10539 C A ---similarly

What I've done:
I used join, then I end up with 24 columns. But Column two has values present in all the files. Next, I can grep it in all files and am losing out there. I am failing to put the grepped output in columnar format. Apparently not a good way to aim for this problem

Moderator's Comments:

Please use CODE tags correctly as required by forum rules!

Last edited by RudiC; 12-15-2017 at 04:07 AM.. Reason: Changed QUOTE to CODE tags.

genome

View Public Profile for genome

Find all posts by genome

12-14-2017

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

You should post your full script and show your work.

Thanks.

Neo

View Public Profile for Neo

Visit Neo's homepage!

Find all posts by Neo

12-15-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Something along this line:

Code:

awk '
BEGIN           {TF = ARGC
                }

FNR == 1        {FC++
                 if (FC < TF) ARGV[ARGC++] = FILENAME
                }

FC < TF         {CNT[$2]++
                 next
                }

CNT[$2] == 5    {if (!LINE[$2]) SEQ[++SN] = $2
                 LINE[$2] = LINE[$2] $0 " "
                }

END             {for (s=1; s<=SN; s++) print LINE[SEQ[s]]
                }
'  HGWAS?/merged_info_CHR1.info

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-15-2017

Registered User

58, 1

Join Date: Nov 2017

Last Activity: 1 May 2020, 4:16 PM EDT

Posts: 58

Thanks Given: 1

Thanked 1 Time in 1 Post

Code:

for i in {1..22}
do
    #--iterate over chromosomes
    saveTemp=""
    files_info="$(find $input_dir -name "*_CHR$i.info"  | sort )"
    files_list=""
    
    #---split by new lines and make it array---
    SAVEIFS=$IFS
    IFS=$'\n'
    files_info=($files_info)
    IFS=$SAVEIFS 
    
    join -j 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10,2.11,2.12  ${files_info[0]}  ${files_info[1]}  > $output_dir/"tempCHR_"$i".info" 

    SAVEtemp=$output_dir/"tempCHR_"$i".info"
    printf "$i joined for first two files\n"

    for (( x=2;x<${#files_info[@]};x++ ))
    do
	join -j 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10,2.11,2.12 $SAVEtemp  ${files_info[$x]}  > $output_dir/"tempchr"$i"_"$x".info" 

	SAVEtemp=$output_dir/"tempchr"$i"_"$x".info"
    done
    mv $SAVEtemp $output_dir/"joined_CHR""$i"".info"
    SAVEtemp=$output_dir/"joined_CHR""$i"".info"
    printf "CHR $i is done for joining\n"
    
    for w in ` awk '{print $2}' $SAVEtemp | grep -v "rs_id" `
    do
	st="" #start null string to concatenate
	
	for (( x=0;x<${#files_info[@]};x++ ))
	do
	    #--loop through files to grep the string
	    
	    temp_st=$(grep -w $w ${files_info[$x]}) 
	    st=$st" "$temp_st

	done
	echo "$st" >> $output_dir/"cols_joined_CHR"$i".info" 
    done

    printf "Proceseed files for $i chromosome!\n"
done

I left it running last evening and script has not finished working with chromosome 1.

Terrible.

genome

View Public Profile for genome

Find all posts by genome

12-15-2017

Moderator

3,689, 1,352

Join Date: Jan 2012

Last Activity: 22 August 2020, 11:29 PM EDT

Location: Galactic Empire

Posts: 3,689

Thanks Given: 268

Thanked 1,352 Times in 1,258 Posts

Here is an approach in gawk:-

Code:

gawk '
        FNR > 1 {
                R[$2]
                V[ARGIND FS $2] = $0
        }
        END {
                for ( k in R )
                {
                        f = 0
                        for ( i = 1; i <= ARGIND; i++ )
                        {
                                if ( ! ( ( i FS k ) in V ) )
                                        f = 1
                        }
                        if ( f == 0 )
                        {
                                for ( i = 1; i <= ARGIND; i++ )
                                        printf "%s ", V[i FS k]
                                printf "\n"
                        }
                }
        }
' HGWAS{1..5}/merged_info_CHR1.info

Yoda

View Public Profile for Yoda

Visit Yoda's homepage!

Find all posts by Yoda

12-15-2017

Registered User

58, 1

Join Date: Nov 2017

Last Activity: 1 May 2020, 4:16 PM EDT

Posts: 58

Thanks Given: 1

Thanked 1 Time in 1 Post

Do you mind explaining what and how it does what?

genome

View Public Profile for genome

Find all posts by genome

12-15-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

See also post#3 which I had to hide until you showed your work as requested by Neo.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Paste columns based on common column: multiple files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to copy a column of multiple files and paste into new excel file (next to column)?

Discussion started by: dineshkumarsrk

2. Shell Programming and Scripting

Join columns across multiple lines in a Text based on common column using BASH

Discussion started by: nv186000

3. UNIX for Dummies Questions & Answers

Merge selective columns from files based on common key

Discussion started by: dovah

4. UNIX for Dummies Questions & Answers

How to join 2 .txt files based on a common column?

Discussion started by: alisrpp

5. Shell Programming and Scripting

common entries between files based on 1st column

Discussion started by: Diya123

6. UNIX for Dummies Questions & Answers

Writing a loop to merge multiple files by common column

Discussion started by: evelibertine

7. Shell Programming and Scripting

Join multiple files based on 1 common column

Discussion started by: quincyjones

8. Shell Programming and Scripting

Merging 2 files based on a common column

Discussion started by: Lucky Ali

9. Shell Programming and Scripting

sum multiple columns based on column value

Discussion started by: jjoe

10. Shell Programming and Scripting

How to convert 2 column data into multiple columns based on a keyword in a row??

Discussion started by: ks_reddy