Join two files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Join two files
# 1  
Old 05-22-2013
Ubuntu Join two files

Hi,

I have two files Files, FileA and FileB which are attached.Each row in the files have 8 tab delimited columns. The two files have to be compared and joined based on first two columns. The resulting file FileC should have:

1. if the data in the first two columns is same in both the files, only data from one of them should be present and print BOTH in the 9th column.

2. Unique rows from fileA and print FileA in the 9th column.

3. Unique rows from fileB and print FileB in the 9th column.

Please find the attached sample input and output files.

Any help using simple linux commands, awk to do this?
# 2  
Old 05-22-2013
I don't really know how to achieve this in one-shot script, I am novice to Unix.
I can't decipher your attached files properly, so I created my own input files and worked on them... these files contain two columns each...and I wrote below two scripts to create third file FileC which contains duplicate rows initially but they are taken care of using sort and uniq commands...

Code:
nawk 'BEGIN{f1="FileB"}NR==FNR{A[$1];next}($1 in A){print $0" "FILENAME" "f1}!($1 in A){print $0" "FILENAME}' FileB FileA > FileC
 nawk 'BEGIN{f1="FileA"}NR==FNR{A[$1];next}($1 in A){print $0" "f1" "FILENAME}!($1 in A){print $0" "FILENAME}' FileA FileB >> FileC

Then...
sort FileC | uniq

Code:
FileA:
XXX abc
XXY bcd
XYY def
YYY efg

FileB:
XXX abc
XYX jkl
XYY def

FileC:
XXX abc FileA FileB
XXY bcd FileA
XYX jkl FileB
XYY def FileA FileB
YYY efg FileA

I am not sure if this solves your problem...
# 3  
Old 05-22-2013
Join files based on column

Hi,

Thanks!! I found an alternate solution.

Code:
 
awk 'FNR==NR{a[$1]++;next}a[$1]' FileA FileB | awk -F '\t' 'BEGIN {OFS=FS} {print $0,"FIleA&FileB"}' > one.txt
 
awk 'FNR==NR{a[$1]++;next}!a[$1]' FileA FileB | awk -F '\t' 'BEGIN {OFS=FS} {print $0,"FileB"}' > two.txt
 
awk 'FNR==NR{a[$1]++;next}!a[$1]' FileB FileA | awk -F '\t' 'BEGIN {OFS=FS} {print $0,"FileA"}' > three.txt
 
cat one.txt two.txt three.txt > FileC.txt


Last edited by Scrutinizer; 05-22-2013 at 05:27 PM.. Reason: code tags
# 4  
Old 05-27-2013
Quote:
Originally Posted by mehar
Hi,

I have two files Files, FileA and FileB which are attached.Each row in the files have 8 tab delimited columns. The two files have to be compared and joined based on first two columns. The resulting file FileC should have:

1. if the data in the first two columns is same in both the files, only data from one of them should be present and print BOTH in the 9th column.

2. Unique rows from fileA and print FileA in the 9th column.

3. Unique rows from fileB and print FileB in the 9th column.

Please find the attached sample input and output files.

Any help using simple linux commands, awk to do this?
When I downloaded FileA.txt, FileB.txt, and FileC.txt, I found a lot of formatting information. I didn't find any tabs. And, after changing groups of spaces to tabs I found that FileA.txt and FileB.txt each had 10 fields; not 8. And, FileC.txt had "Both" instead of "BOTH" in the 11th field.

The following script seems to do what you requested (with input files modified to meet the input format you specified). Note, however that it uses the names of the actual input files rather than FileA and FileB. And this script makes no attempt to sort the input or output. But it only needs to call awk once and only reads and writes the data once:
Code:
awk '
BEGIN { FS = OFS = "\t"}
FNR == 1 {
        f[FNR == NR] = FILENAME
}
FNR == NR {
        f1[$1, $2] = $0
        next
}
{       if(($1, $2) in f1) {
                print $0, "BOTH"
                delete f1[$1, $2]
        } else  print $0, f[0]
}
END {   for(i in f1) print f1[i], f[1]}' FileB.txt FileA.txt

As always, if you want to use this on a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of awk.

If FileA.txt contains:
Code:
chr1    403111  .       G       A       42      .       AC1=1;AF1=0.5;DP=34;DP4=3,2,0,5;FQ=45;MQ=53;PV4=0.17,0.095,0.2,1;SNPEFF_EFFECT=INTERGENIC;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER;VDB=0.0298        GT:GQ:PL        0/1:75:72,0,118
chr1    412573  .       A       C       222     .       AC1=2;AF1=1;DP=71;DP4=0,0,57,14;FQ=-241;MQ=58;SNPEFF_EFFECT=INTERGENIC;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER;VDB=0.0399   GT:GQ:PL        1/1:99:255,214,0
chr1    602567  rs21953190      A       G       222     .       AC1=2;AF1=1;DB;DP=152;DP4=0,0,67,78;FQ=-282;MQ=59;SNPEFF_AMINO_ACID_CHANGE=D1034;SNPEFF_CODON_CHANGE=gaTgaC;SNPEFF_EFFECT=SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=SILENT;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ADNP2;SNPEFF_IMPACT=LOW;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000008;VDB=0.0392      GT:GQ:PL        1/1:99:255,255,0
chr1    604894  rs21953191      A       G       222     .       AC1=2;AF1=1;DB;DP=119;DP4=0,0,69,44;FQ=-282;MQ=59;SNPEFF_AMINO_ACID_CHANGE=F259L;SNPEFF_CODON_CHANGE=TttCtt;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ADNP2;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000008;VDB=0.0392   GT:GQ:PL        1/1:99:255,255,0
chr1    699873  rs21944866      G       A       222     .       AC1=2;AF1=1;DB;DP=59;DP4=0,0,39,18;FQ=-199;MQ=60;SNPEFF_AMINO_ACID_CHANGE=A124V;SNPEFF_CODON_CHANGE=gCg/gTg;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=RBFA;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000037436;VDB=0.0380    GT:GQ:PL        1/1:99:255,172,0

and FileB.txt contains:
Code:
chr1    604894  rs21953191      A       G       4298.77 .       AC=2;AF=1.00;AN=2;DB;DP=119;Dels=0.00;FS=0.000;HaplotypeScore=2.5439;MLEAC=2;MLEAF=1.00;MQ=58.87;MQ0=0;QD=36.12;SNPEFF_AMINO_ACID_CHANGE=F259L;SNPEFF_CODON_CHANGE=TttCtt;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ADNP2;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000008        GT:AD:DP:GQ:PL  1/1:2,116:116:99:4327,340,0
chr1    699873  rs21944866      G       A       2074.77 .       AC=2;AF=1.00;AN=2;DB;DP=59;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=59.22;MQ0=0;QD=35.17;SNPEFF_AMINO_ACID_CHANGE=A124V;SNPEFF_CODON_CHANGE=gCggTg;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=RBFA;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000037436  GT:AD:DP:GQ:PL  1/1:0,59:59:99:2103,165,0
chr1    758630  .       T       TC      2221.73 .       AC=2;AF=1.00;AN=2;DP=61;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=51.14;MQ0=0;QD=36.42;RPA=1,2;RU=C;SNPEFF_EFFECT=INTRON;SNPEFF_EXON_ID=2;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=PQLC1;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000011;STR        GT:AD:DP:GQ:PL  1/1:0,55:61:99:2259,165,0
chr1    767646  .       T       C       4812.77 .       AC=2;AF=1.00;AN=2;DP=130;Dels=0.00;FS=0.000;HaplotypeScore=0.7340;MLEAC=2;MLEAF=1.00;MQ=59.36;MQ0=0;QD=37.02;SNPEFF_EFFECT=INTRON;SNPEFF_EXON_ID=3;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=PQLC1;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000011        GT:AD:DP:GQ:PL  1/1:0,130:130:99:4841,370,0     
chr1    800715  .       C       CT      794.73  .       AC=2;AF=1.00;AN=2;DP=22;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=61.46;MQ0=0;QD=36.12;RPA=1,2;RU=T;SNPEFF_EFFECT=SPLICE_SITE_ACCEPTOR;SNPEFF_EXON_ID=7;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=PQLC1;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000011;STR      GT:AD:DP:GQ:PL  1/1:0,20:22:54:832,54,0

the result on OS X is:
Code:
chr1	403111	.	G	A	42	.	AC1=1;AF1=0.5;DP=34;DP4=3,2,0,5;FQ=45;MQ=53;PV4=0.17,0.095,0.2,1;SNPEFF_EFFECT=INTERGENIC;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER;VDB=0.0298	GT:GQ:PL	0/1:75:72,0,118	FileA.txt
chr1	412573	.	A	C	222	.	AC1=2;AF1=1;DP=71;DP4=0,0,57,14;FQ=-241;MQ=58;SNPEFF_EFFECT=INTERGENIC;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER;VDB=0.0399	GT:GQ:PL	1/1:99:255,214,0	FileA.txt
chr1	602567	rs21953190	A	G	222	.	AC1=2;AF1=1;DB;DP=152;DP4=0,0,67,78;FQ=-282;MQ=59;SNPEFF_AMINO_ACID_CHANGE=D1034;SNPEFF_CODON_CHANGE=gaTgaC;SNPEFF_EFFECT=SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=SILENT;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ADNP2;SNPEFF_IMPACT=LOW;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000008;VDB=0.0392	GT:GQ:PL	1/1:99:255,255,0	FileA.txt
chr1	604894	rs21953191	A	G	222	.	AC1=2;AF1=1;DB;DP=119;DP4=0,0,69,44;FQ=-282;MQ=59;SNPEFF_AMINO_ACID_CHANGE=F259L;SNPEFF_CODON_CHANGE=TttCtt;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ADNP2;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000008;VDB=0.0392	GT:GQ:P1/1:99:255,255,0	BOTH
chr1	699873	rs21944866	G	A	222	.	AC1=2;AF1=1;DB;DP=59;DP4=0,0,39,18;FQ=-199;MQ=60;SNPEFF_AMINO_ACID_CHANGE=A124V;SNPEFF_CODON_CHANGE=gCg/gTg;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=RBFA;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000037436;VDB=0.0380	GT:GQ:P1/1:99:255,172,0	BOTH
chr1	758630	.	T	TC	2221.73	.	AC=2;AF=1.00;AN=2;DP=61;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=51.14;MQ0=0;QD=36.42;RPA=1,2;RU=C;SNPEFF_EFFECT=INTRON;SNPEFF_EXON_ID=2;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=PQLC1;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000011;STR	GT:AD:DP:GQ:PL	1/1:0,55:61:99:2259,165,0	FileB.txt
chr1	767646	.	T	C	4812.77	.	AC=2;AF=1.00;AN=2;DP=130;Dels=0.00;FS=0.000;HaplotypeScore=0.7340;MLEAC=2;MLEAF=1.00;MQ=59.36;MQ0=0;QD=37.02;SNPEFF_EFFECT=INTRON;SNPEFF_EXON_ID=3;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=PQLC1;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000011	GT:AD:DP:GQ:PL	1/1:0,130:130:99:4841,370,0	FileB.txt
chr1	800715	.	C	CT	794.73	.	AC=2;AF=1.00;AN=2;DP=22;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=61.46;MQ0=0;QD=36.12;RPA=1,2;RU=T;SNPEFF_EFFECT=SPLICE_SITE_ACCEPTOR;SNPEFF_EXON_ID=7;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=PQLC1;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000011;STR	GT:AD:DP:GQ:PL	1/1:0,20:22:54:832,54,0	FileB.txt

The lines that appear in both files or that appear only in the 2nd file given to this script will appear in the order in which they appeared in the 2nd file. The lines that appear only in the 1st file will appear in an unspecified order at the end of the output.

Last edited by Don Cragun; 05-27-2013 at 09:23 PM.. Reason: Fix newlines lost in copy/paste.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Join, merge, fill NULL the void columns of multiples files like sql "LEFT JOIN" by using awk

Hello, This post is already here but want to do this with another way Merge multiples files with multiples duplicates keys by filling "NULL" the void columns for anothers joinning files file1.csv: 1|abc 1|def 2|ghi 2|jkl 3|mno 3|pqr file2.csv: 1|123|jojo 1|NULL|bibi... (2 Replies)
Discussion started by: yjacknewton
2 Replies

2. Shell Programming and Scripting

Join two files

I have 2 files: fileA AAA1:AAA2:AAA3:AAA_4:AAA5:AAA_6:AAA7:AAA_8 BBB1:BBB2:BBB3:BBB_4:BBB5:BBB-6 CCC1:CCC2:CCC3:CCC_4fileB AAA_4:XXX1:YYY1 BBB_4:XXX2:YYY2 CCC_4:XXX3:YYY3:ZZZ3 AAA_6:XXX4:YYY4 AAA_8:XXX5:YYY5Result: AAA1:AAA2:AAA3:AAA_4:XXX1:YYY1:AAA5:AAA_6:XXX4:YYY4:AAA7:AAA_8:XXX5:YYY5... (8 Replies)
Discussion started by: vikus
8 Replies

3. Shell Programming and Scripting

Join 2 files

I have file1.txt BGE179W1 BGE179W2 BGE179W3 BGE187W1 BGE187W2 BGE187W3 BGE194W1 BGE194W2 BGE194W3 BGE227W1 BGE227W2 BGE227W3 BGE288W1 BGE288W2 BGE288W3 BGE650W1 ---------- Post updated at 12:41 AM ---------- Previous update was at 12:39 AM ---------- (5 Replies)
Discussion started by: radius
5 Replies

4. UNIX for Dummies Questions & Answers

How to use the the join command to join multiple files by a common column

Hi, I have 20 tab delimited text files that have a common column (column 1). The files are named GSM1.txt through GSM20.txt. Each file has 3 columns (2 other columns in addition to the first common column). I want to write a script to join the files by the first common column so that in the... (5 Replies)
Discussion started by: evelibertine
5 Replies

5. UNIX for Dummies Questions & Answers

how to join two files using "Join" command with one common field in this problem?

file1: Toronto:12439755:1076359:July 1, 1867:6 Quebec City:7560592:1542056:July 1, 1867:5 Halifax:938134:55284:July 1, 1867:4 Fredericton:751400:72908:July 1, 1867:3 Winnipeg:1170300:647797:July 15, 1870:7 Victoria:4168123:944735:July 20, 1871:10 Charlottetown:137900:5660:July 1, 1873:2... (2 Replies)
Discussion started by: mindfreak
2 Replies

6. Shell Programming and Scripting

Join two files

i have two files and i want to join the contents like: file a has content my name is i am i work at and file b has John sims 43 years old maximu ltd and i want to join the two files to get a third file with content reading my name is John sims i am 43 years old i work at... (2 Replies)
Discussion started by: tomjones
2 Replies

7. UNIX for Dummies Questions & Answers

Join 2 files with multiple columns: awk/grep/join?

Hello, My apologies if this has been posted elsewhere, I have had a look at several threads but I am still confused how to use these functions. I have two files, each with 5 columns: File A: (tab-delimited) PDB CHAIN Start End Fragment 1avq A 171 176 awyfan 1avq A 172 177 wyfany 1c7k A 2 7... (3 Replies)
Discussion started by: InfoSeeker
3 Replies

8. Shell Programming and Scripting

join two files

Hi i have file f1 like: xx yy| 123 xx1 yy1| 1234 xx2 yy2|12345 f2 file like: xx yy| aaa xx1 yy1| bbb xx2 yy2|cccc i would like output is: xx yy| 123|aaa xx1 yy1| 1234|bbbb xx2 yy2|12345|cccc please help me on this........... (5 Replies)
Discussion started by: koti_rama
5 Replies

9. Shell Programming and Scripting

join files

Hi , I want to join 2 files based on 2 column join condition. a11 john 2230 5000 a12 XXX 2230 A B 200 345 Expected O/P John 2230 5000 A B 200 I have tried this awk 'NR==FNR{a=$1;next}a&&sub($1,a)' a11 a12 > a13 (3 Replies)
Discussion started by: mohan705
3 Replies

10. Shell Programming and Scripting

Join Files

Hi Gurus, I have to join two flat files based on two key field columns. I concatenated two key fields and i tried the join command. It is working fine. But, without using temporary files can't i use like this: join -t ':' `awk -F ":" '{ printf("%s%s:%s\n", $1,$2, $0) }' file1` `awk -F ":" '{... (2 Replies)
Discussion started by: choppas
2 Replies
Login or Register to Ask a Question