I have two files Files, FileA and FileB which are attached.Each row in the files have 8 tab delimited columns. The two files have to be compared and joined based on first two columns. The resulting file FileC should have:
1. if the data in the first two columns is same in both the files, only data from one of them should be present and print BOTH in the 9th column.
2. Unique rows from fileA and print FileA in the 9th column.
3. Unique rows from fileB and print FileB in the 9th column.
Please find the attached sample input and output files.
Any help using simple linux commands, awk to do this?
When I downloaded FileA.txt, FileB.txt, and FileC.txt, I found a lot of formatting information. I didn't find any tabs. And, after changing groups of spaces to tabs I found that FileA.txt and FileB.txt each had 10 fields; not 8. And, FileC.txt had "Both" instead of "BOTH" in the 11th field.
The following script seems to do what you requested (with input files modified to meet the input format you specified). Note, however that it uses the names of the actual input files rather than FileA and FileB. And this script makes no attempt to sort the input or output. But it only needs to call awk once and only reads and writes the data once:
Code:
awk '
BEGIN { FS = OFS = "\t"}
FNR == 1 {
f[FNR == NR] = FILENAME
}
FNR == NR {
f1[$1, $2] = $0
next
}
{ if(($1, $2) in f1) {
print $0, "BOTH"
delete f1[$1, $2]
} else print $0, f[0]
}
END { for(i in f1) print f1[i], f[1]}' FileB.txt FileA.txt
As always, if you want to use this on a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of awk.
If FileA.txt contains:
Code:
chr1 403111 . G A 42 . AC1=1;AF1=0.5;DP=34;DP4=3,2,0,5;FQ=45;MQ=53;PV4=0.17,0.095,0.2,1;SNPEFF_EFFECT=INTERGENIC;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER;VDB=0.0298 GT:GQ:PL 0/1:75:72,0,118
chr1 412573 . A C 222 . AC1=2;AF1=1;DP=71;DP4=0,0,57,14;FQ=-241;MQ=58;SNPEFF_EFFECT=INTERGENIC;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER;VDB=0.0399 GT:GQ:PL 1/1:99:255,214,0
chr1 602567 rs21953190 A G 222 . AC1=2;AF1=1;DB;DP=152;DP4=0,0,67,78;FQ=-282;MQ=59;SNPEFF_AMINO_ACID_CHANGE=D1034;SNPEFF_CODON_CHANGE=gaTgaC;SNPEFF_EFFECT=SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=SILENT;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ADNP2;SNPEFF_IMPACT=LOW;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000008;VDB=0.0392 GT:GQ:PL 1/1:99:255,255,0
chr1 604894 rs21953191 A G 222 . AC1=2;AF1=1;DB;DP=119;DP4=0,0,69,44;FQ=-282;MQ=59;SNPEFF_AMINO_ACID_CHANGE=F259L;SNPEFF_CODON_CHANGE=TttCtt;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ADNP2;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000008;VDB=0.0392 GT:GQ:PL 1/1:99:255,255,0
chr1 699873 rs21944866 G A 222 . AC1=2;AF1=1;DB;DP=59;DP4=0,0,39,18;FQ=-199;MQ=60;SNPEFF_AMINO_ACID_CHANGE=A124V;SNPEFF_CODON_CHANGE=gCg/gTg;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=RBFA;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000037436;VDB=0.0380 GT:GQ:PL 1/1:99:255,172,0
and FileB.txt contains:
Code:
chr1 604894 rs21953191 A G 4298.77 . AC=2;AF=1.00;AN=2;DB;DP=119;Dels=0.00;FS=0.000;HaplotypeScore=2.5439;MLEAC=2;MLEAF=1.00;MQ=58.87;MQ0=0;QD=36.12;SNPEFF_AMINO_ACID_CHANGE=F259L;SNPEFF_CODON_CHANGE=TttCtt;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ADNP2;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000008 GT:AD:DP:GQ:PL 1/1:2,116:116:99:4327,340,0
chr1 699873 rs21944866 G A 2074.77 . AC=2;AF=1.00;AN=2;DB;DP=59;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=59.22;MQ0=0;QD=35.17;SNPEFF_AMINO_ACID_CHANGE=A124V;SNPEFF_CODON_CHANGE=gCggTg;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=RBFA;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000037436 GT:AD:DP:GQ:PL 1/1:0,59:59:99:2103,165,0
chr1 758630 . T TC 2221.73 . AC=2;AF=1.00;AN=2;DP=61;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=51.14;MQ0=0;QD=36.42;RPA=1,2;RU=C;SNPEFF_EFFECT=INTRON;SNPEFF_EXON_ID=2;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=PQLC1;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000011;STR GT:AD:DP:GQ:PL 1/1:0,55:61:99:2259,165,0
chr1 767646 . T C 4812.77 . AC=2;AF=1.00;AN=2;DP=130;Dels=0.00;FS=0.000;HaplotypeScore=0.7340;MLEAC=2;MLEAF=1.00;MQ=59.36;MQ0=0;QD=37.02;SNPEFF_EFFECT=INTRON;SNPEFF_EXON_ID=3;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=PQLC1;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000011 GT:AD:DP:GQ:PL 1/1:0,130:130:99:4841,370,0
chr1 800715 . C CT 794.73 . AC=2;AF=1.00;AN=2;DP=22;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=61.46;MQ0=0;QD=36.12;RPA=1,2;RU=T;SNPEFF_EFFECT=SPLICE_SITE_ACCEPTOR;SNPEFF_EXON_ID=7;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=PQLC1;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000011;STR GT:AD:DP:GQ:PL 1/1:0,20:22:54:832,54,0
the result on OS X is:
Code:
chr1 403111 . G A 42 . AC1=1;AF1=0.5;DP=34;DP4=3,2,0,5;FQ=45;MQ=53;PV4=0.17,0.095,0.2,1;SNPEFF_EFFECT=INTERGENIC;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER;VDB=0.0298 GT:GQ:PL 0/1:75:72,0,118 FileA.txt
chr1 412573 . A C 222 . AC1=2;AF1=1;DP=71;DP4=0,0,57,14;FQ=-241;MQ=58;SNPEFF_EFFECT=INTERGENIC;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER;VDB=0.0399 GT:GQ:PL 1/1:99:255,214,0 FileA.txt
chr1 602567 rs21953190 A G 222 . AC1=2;AF1=1;DB;DP=152;DP4=0,0,67,78;FQ=-282;MQ=59;SNPEFF_AMINO_ACID_CHANGE=D1034;SNPEFF_CODON_CHANGE=gaTgaC;SNPEFF_EFFECT=SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=SILENT;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ADNP2;SNPEFF_IMPACT=LOW;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000008;VDB=0.0392 GT:GQ:PL 1/1:99:255,255,0 FileA.txt
chr1 604894 rs21953191 A G 222 . AC1=2;AF1=1;DB;DP=119;DP4=0,0,69,44;FQ=-282;MQ=59;SNPEFF_AMINO_ACID_CHANGE=F259L;SNPEFF_CODON_CHANGE=TttCtt;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ADNP2;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000008;VDB=0.0392 GT:GQ:P1/1:99:255,255,0 BOTH
chr1 699873 rs21944866 G A 222 . AC1=2;AF1=1;DB;DP=59;DP4=0,0,39,18;FQ=-199;MQ=60;SNPEFF_AMINO_ACID_CHANGE=A124V;SNPEFF_CODON_CHANGE=gCg/gTg;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=5;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=RBFA;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000037436;VDB=0.0380 GT:GQ:P1/1:99:255,172,0 BOTH
chr1 758630 . T TC 2221.73 . AC=2;AF=1.00;AN=2;DP=61;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=51.14;MQ0=0;QD=36.42;RPA=1,2;RU=C;SNPEFF_EFFECT=INTRON;SNPEFF_EXON_ID=2;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=PQLC1;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000011;STR GT:AD:DP:GQ:PL 1/1:0,55:61:99:2259,165,0 FileB.txt
chr1 767646 . T C 4812.77 . AC=2;AF=1.00;AN=2;DP=130;Dels=0.00;FS=0.000;HaplotypeScore=0.7340;MLEAC=2;MLEAF=1.00;MQ=59.36;MQ0=0;QD=37.02;SNPEFF_EFFECT=INTRON;SNPEFF_EXON_ID=3;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=PQLC1;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000011 GT:AD:DP:GQ:PL 1/1:0,130:130:99:4841,370,0 FileB.txt
chr1 800715 . C CT 794.73 . AC=2;AF=1.00;AN=2;DP=22;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=61.46;MQ0=0;QD=36.12;RPA=1,2;RU=T;SNPEFF_EFFECT=SPLICE_SITE_ACCEPTOR;SNPEFF_EXON_ID=7;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=PQLC1;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENSCAFT00000000011;STR GT:AD:DP:GQ:PL 1/1:0,20:22:54:832,54,0 FileB.txt
The lines that appear in both files or that appear only in the 2nd file given to this script will appear in the order in which they appeared in the 2nd file. The lines that appear only in the 1st file will appear in an unspecified order at the end of the output.
Last edited by Don Cragun; 05-27-2013 at 09:23 PM..
Reason: Fix newlines lost in copy/paste.
Hi Gurus,
I have to join two flat files based on two key field columns. I concatenated two key fields and i tried the join command. It is working
fine. But, without using temporary files can't i use like this:
join -t ':' `awk -F ":" '{ printf("%s%s:%s\n", $1,$2, $0) }' file1` `awk -F ":" '{... (2 Replies)
Hi ,
I want to join 2 files based on 2 column join condition.
a11
john 2230 5000
a12
XXX 2230 A B 200 345
Expected O/P
John 2230 5000 A B 200
I have tried this
awk 'NR==FNR{a=$1;next}a&&sub($1,a)' a11 a12 > a13 (3 Replies)
Hi
i have file f1 like:
xx yy| 123
xx1 yy1| 1234
xx2 yy2|12345
f2 file like:
xx yy| aaa
xx1 yy1| bbb
xx2 yy2|cccc
i would like output is:
xx yy| 123|aaa
xx1 yy1| 1234|bbbb
xx2 yy2|12345|cccc
please help me on this........... (5 Replies)
Hello,
My apologies if this has been posted elsewhere, I have had a look at several threads but I am still confused how to use these functions. I have two files, each with 5 columns:
File A: (tab-delimited)
PDB CHAIN Start End Fragment
1avq A 171 176 awyfan
1avq A 172 177 wyfany
1c7k A 2 7... (3 Replies)
i have two files and i want to join the contents like:
file a has content
my name is
i am
i work at
and file b has
John sims
43 years old
maximu ltd
and i want to join the two files to get a third file with content reading
my name is John sims
i am 43 years old
i work at... (2 Replies)
Hi,
I have 20 tab delimited text files that have a common column (column 1). The files are named GSM1.txt through GSM20.txt. Each file has 3 columns (2 other columns in addition to the first common column).
I want to write a script to join the files by the first common column so that in the... (5 Replies)
I have file1.txt
BGE179W1
BGE179W2
BGE179W3
BGE187W1
BGE187W2
BGE187W3
BGE194W1
BGE194W2
BGE194W3
BGE227W1
BGE227W2
BGE227W3
BGE288W1
BGE288W2
BGE288W3
BGE650W1
---------- Post updated at 12:41 AM ---------- Previous update was at 12:39 AM ---------- (5 Replies)
Hello,
This post is already here but want to do this with another way
Merge multiples files with multiples duplicates keys by filling "NULL" the void columns for anothers joinning files
file1.csv:
1|abc
1|def
2|ghi
2|jkl
3|mno
3|pqr
file2.csv:
1|123|jojo
1|NULL|bibi... (2 Replies)