Merging multiple files from multiple columns


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Merging multiple files from multiple columns
# 1  
Old 10-16-2012
Merging multiple files from multiple columns

Hi guys,
I have very basic linux experience so I need some help with a problem.
I have 3 files from which I want to extract columns based on common fields between them.

File1:
Code:
--- rs74078040 NA 51288690 T G 461652 0.99223 0.53611 3
--- rs77209296 NA 51303525 T G 461843 0.98973 0.60837 3
--- rs7985170 NA 51304203 C T 461852 0.99215 0.541 3

File2:
Code:
id rsid chromosome pos allele_A allele_B index average_call info
--- rs74078040 NA 51288690 T G 461652 0.99852 0.52725 2
--- rs77209296 NA 51303525 T G 461843 0.99992 0.90562 2
--- rs7985170 NA 51304203 C T 461852 0.99994 0.93283 2
--- rs145780716 NA 19032912 G A 2054321 0.9988 0.026123 2

File3:
id rsid chromosome pos allele_A allele_B index average_call info
Code:
--- rs74078040 NA 51288690 T G 461652 0.956 0.8558 1
--- rs77209296 NA 51303525 T G 461843 0.9843 0.8989 1
--- rs7985170 NA 51304203 C T 461852 0.9323 0.6754 1
--- rs145780716 NA 19032912 G A 2054321 0.9098 0.7632 1

I want to take all the rows from File1 and say columns 1-7 (which are the same in all the files) and then add column 8 from File 2 and columns 9 and 10 from File3.
Note that File1 doesn't have a header but the other files do. Also, File1 has less lines than the other files which have the same number of lines.

The output would look like:
Code:
--- rs74078040 NA 51288690 T G 461652 0.99852 0.8558 1
--- rs77209296 NA 51303525 T G 461843 0.99992 0.8989 1
--- rs7985170 NA 51304203 C T 461852 0.99994 0.6754 1

Any help would be greatly appreciated.
Thanks!

Last edited by Franklin52; 10-17-2012 at 06:12 AM.. Reason: Please use code tags for data and code samples
# 2  
Old 10-16-2012
Hi

Assuming your files are a1, a2 and a3:

Code:
$ join -1 2 -2 2 -o  1.1,1.2,1.3,1.4,1.5,1.6,1.7,2.8 a1 a2  | join -1 2 -2 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,2.9,2.10 - a3
--- rs74078040 NA 51288690 T G 461652 0.99852 0.8558 1
--- rs77209296 NA 51303525 T G 461843 0.99992 0.8989 1
--- rs7985170 NA 51304203 C T 461852 0.99994 0.6754 1

Guru.
# 3  
Old 10-17-2012
Hi guruprasadpr,
Many thanks for your reply. I have tried the command you suggest but it tells me: "join: File 2 is not in sorted order" which I think probably comes from the header line because column 4 is in sorted order.
Any suggestions how to get around that problem? I don't want to make a copy of the files without headers because they are quite big and our disk space is limited.
# 4  
Old 10-17-2012
If you don't mind using two awk..

try

Code:
awk 'FNR==NR{$NF="";$(NF-1)="";$(NF-2)="";a[$2]=$0;next}{if(a[$2]){print a[$2],$8}}' file1 file2 > temp

awk 'FNR==NR{a[$2]=$0;next}{if(a[$2]){print a[$2],$9,$10}}' temp file3

# 5  
Old 10-17-2012
Hi pamu, I wouldn't mind using more than one step even though in reality I have 5 files instead of 3.
Could you broadly explain the awk syntax please? My awk knowledge is almost zero.
Thanks for your reply!
# 6  
Old 10-17-2012
Quote:
Originally Posted by bartman2099
Could you broadly explain the awk syntax please? My awk knowledge is almost zero.
It's Better that you should read some basic of awk then Smilie

Read this first..

then read.. thisSmilie

Code:
awk 'FNR==NR{$NF="";$(NF-1)="";$(NF-2)="";a[$2]=$0;next}   # Read first file, make $NF,$(NF-1) and $(NF-2) NULL as we don't need those. Create an array a with $2 is index and which stores $0 ( whole line).                                                            

{if(a[$2]){print a[$2],$8}}' file1 file2 > temp   # Here we are reading second file. Check if $2 present in this file or not. If it is present then file1 line matches with file2 line then print line from file1 and append $8 as you have requested. And write this to a new temp file.

awk 'FNR==NR{a[$2]=$0;next}  # Read temp file create an array a, which stores $0(whole line)

{if(a[$2]){print a[$2],$9,$10}}' temp file3 # read second file. If there is match between files then print temp file line and append $9 and $10 from file3 as per your required output.

# 7  
Old 10-17-2012
One awk:

Code:
awk '
  f==2{
    F2[$2]=$8
  } 
  f==3{
    F3a[$2]=$9
    F3b[$2]=$10
  }
  f==1{
    $8=F2[$2]
    $9=F3a[$2]
    $10=F3b[$2]
    print
  }
' f=2 file2 f=3 file3 f=1 file1

This User Gave Thanks to Scrutinizer For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing carriage returns from multiple lines in multiple files of different number of columns

Hello Gurus, I have a multiple pipe separated files which have records going over multiple Lines. End of line separator is \n and records going over multiple lines have <CR> as separator. below is example from one file. 1|ABC DEF|100|10 2|PQ RS T|200|20 3| UVWXYZ|300|30 4| GHIJKL|400|40... (7 Replies)
Discussion started by: dJHa
7 Replies

2. Shell Programming and Scripting

Merging multiple files into one

Hi guys, could you please help me with this? I have multiple files with this structure: file1 xxx1 1.0 xxx2 3.5 xxx3 2.4 xxx4 3.0 … xxx1890 5.7 file2 xxx1 8.0 xxx3 7.5 xxx4 5.5 …. (4 Replies)
Discussion started by: coppuca
4 Replies

3. Shell Programming and Scripting

Merging Multiple Columns between two files

Hello guys, I have 2 CSV files which goes like this: CSV1: Breaking.csv: UTF-8 "Name","Description","Occupation","Email" "Walter White","","Chemistry Teacher","w.w@bb.com" "Jessie Pinkman","","Junkie","j.p@bb.com" "Hank Schrader","","DEA Agent","h.s@bb.com" CSV2: Bad.csv... (7 Replies)
Discussion started by: jeffreybsu
7 Replies

4. Shell Programming and Scripting

Compare multiple files with multiple number of columns

Hi, input file1 abcd 123 198 xyz1:0909090-0909091 ghij 234 999 xyz2:987654:987655 kilo 7890 7990 xyz3:12345-12357 prem 9 112 xyz5:97-1134 input file2 abcd 123 198 xyz1:0909090-0909091 -9.122 0 abed 88 98 xyz1:98989-090808 -1.234 1.345 ghij 234 999 xyz2:987654:987655 -10.87090909 5... (5 Replies)
Discussion started by: jacobs.smith
5 Replies

5. UNIX for Dummies Questions & Answers

cutting multiple columns into multiple files

Hypothetically, suppose that file1 id v1 v2 v3 v4 v5 v6 v7..........v100 1 1 1 1 1 1 2 2 .....50 2 1 1 1 1 1 2 2 .....50 3 1 1 1 1 1 2 2 .....50 4 1 1 1 1 1 2 2 .....50 5 1 1 1 1 1 2 2 .....50 I want to write a loop such that I take the id# and the first 5 columns (v1-v5) into the... (3 Replies)
Discussion started by: johnkim0806
3 Replies

6. Shell Programming and Scripting

Merging columns from multiple files

Hello, I have a number of tab delimited data files consists of two columns. Like that: File1 800.000000 0.002744 799.000000 0.002517 798.000000 0.002836 797.000000 0.002553 FIle2 800.000000 0.000261 799.000000 0.000001 798.000000 0.000551 797.000000 0.000275 File3... (19 Replies)
Discussion started by: erden
19 Replies

7. Shell Programming and Scripting

need help with post:extract multiple columns from multiple files

hello, I will would be grateful if anyone can help me reply to my post extract multiple cloumns from multiple files; skip rows and include filenames; awk Please see this thread. Thanks manishabh (0 Replies)
Discussion started by: manishabh
0 Replies

8. Shell Programming and Scripting

Combine multiple columns from multiple files

Hi there, I was wondering if someone can help me with this. I am trying the combine multiple columns from multiple files into one file. Example file 1: c0t0d0 c0t2d0 # hostname vgname c0t0d1 c0t2d1 # hostname vgname c0t0d2 c0t2d2 # hostname vgname c0t1d0 c0t3d0 # hostname vgname1... (5 Replies)
Discussion started by: martva
5 Replies

9. UNIX for Advanced & Expert Users

Merging multiple .so files

Hi All, How to merge independent .so files into an executable. Thanks in Advance, Regards, Kusu (2 Replies)
Discussion started by: Kusu
2 Replies

10. Shell Programming and Scripting

Merging columns from multiple files in one file

Hi, I want to select columns from multiple files and combine them in one file. The files are simulation-data-files with 23 columns each and about 50 rows. I now use: cut -f 11 Sweep?wing-30?scale=0.?0?fan2?.txt | pr -3 | awk '{printf("\n%s\t%s\t%s",$1,$2,$3)}' > ../Data_Processed/output.txtI... (1 Reply)
Discussion started by: isgoed
1 Replies
Login or Register to Ask a Question