Using awk to output matches between two files to one file and mismatches to two others


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Using awk to output matches between two files to one file and mismatches to two others
# 1  
Old 08-26-2016
Using awk to output matches between two files to one file and mismatches to two others

I am trying to output the matches between $1 of file1 to $3 of file2 into a new file match.

I am also wanting to output the mismatches between those same 2 files and fields to two separate new files called missing from file1 and missing from file2. The input files are tab-delimited, but the output can be space delimited. The awk below hopefully is a good start. Thank you Smilie.

file1
Code:
1    1
2    2
3    3
4    4
     5
6    6

file2
Code:
1    1    1
2    2    2
3    3    
4    4    
5    5    5
6    6    6

desired output
match
Code:
1 2 6

missing from file1
Code:
5

missing from file2
Code:
3 4

awk tried
Code:
awk -F'\t' 'NR==FNR{a[$1]=$3;next}{if (a[$1])print a[$1],$0;else print "Not Found", $0;}' file1 file2


Last edited by cmccabe; 08-26-2016 at 05:45 PM.. Reason: fixed format
# 2  
Old 08-26-2016
I noticed file1 has just 2 columns, so this statement doesn't make any sense!
Code:
NR==FNR{a[$1]=$3;next}

# 3  
Old 08-26-2016
That should be $2, but the output is still only 1 new file. Thank you Smilie.
# 4  
Old 08-26-2016
Here is an approach:-
Code:
awk -F'\t' '
        NR == FNR {
                A[$1]
                next
        }
        {
                B[$3]
        }
        END {
                print "Match"
                for ( k in A )
                {
                        if ( k && k in B )
                                print k
                }

                print "Missing from file1"
                for ( k in B )
                {
                        if ( ! ( k in A ) )
                                print k
                }

                print "Missing from file2"
                for ( k in A )
                {
                        if ( ! ( k in B ) )
                                print k
                }
        }
' file1 file2

This User Gave Thanks to Yoda For This Post:
# 5  
Old 08-26-2016
Quote:
Originally Posted by cmccabe
I am trying to output the matches between $1 of file1 to $3 of file2 into a new file match.

I am also wanting to output the mismatches between those same 2 files and fields to two separate new files called missing from file1 and missing from file2. The input files are tab-delimited, but the output can be space delimited. The awk below hopefully is a good start. Thank you Smilie.

file1
Code:
1    1
2    2
3    3
4    4
     5
6    6

file2
Code:
1    1    1
2    2    2
3    3    
4    4    
5    5    5
6    6    6

desired output
match
Code:
1 2 6

missing from file1
Code:
5

missing from file2
Code:
3 4

awk tried
Code:
awk -F'\t' 'NR==FNR{a[$1]=$3;next}{if (a[$1])print a[$1],$0;else print "Not Found", $0;}' file1 file2

Hi cmccabe,
I don't see any indication in the match output saying that the empty 1st field in the 5th line of file1 matches the 3rd field of the 3rd and 4th lines of file2... Are empty fields supposed to be ignored? If not, how are empty fields supposed t be displayed in the <space> or <tab> separated output?

If a single value appears more than once in file1 or in file2 and the number of times that value appears in one file is not the same as the number of times it appears in the other file, should there just be a single entry in the match output or should there one entry for each matched pair and entries in one of the other files for the number of unpaired entries?
This User Gave Thanks to Don Cragun For This Post:
# 6  
Old 08-27-2016
@Yoda and @Don Cragun I modified the awk as my real data has the possibility of the same entry being in one of the files with a different pairing. Also, the nulls can remain blank.

For example in the two files below 48719928 AT - is in both files, however 48719928 A G is missing from file1. So the awk uses a combination array as a key then looks for that.

I am using $19 $21 and $22 of file1 to search $3 $5 and $6 of file2. The header row is skipped and it then outputs a new file with what lines match and if they do not match what file the match is missing from. The awk does run but the output produced seems to be incorrect and I am not able to fix it. Thank you Smilie.

file1
Code:
Index    Chromosomal Position    Gene    Inheritance    mRNA    Chromosome    Coverage    Score    A(#F,#R)    C(#F,#R)    G(#F,#R)    T(#F,#R)    Ins(#F,#R)    Del(#F,#R)    SNP    Mutation    Frequency    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    ExonicFunc.refGene    AAChange.refGene    PopFreqMax    1000G2012APR_ALL    1000G2012APR_AFR    1000G2012APR_AMR    1000G2012APR_ASN    1000G2012APR_EUR    ESP6500si_ALL    ESP6500si_AA    ESP6500si_EA    CG46    common    clinvar    clinvarsubmit    clinvarreference    Homopolymer    Splice    Pseudogene    Classification    HGMD    Disease    Sanger    References
98    48719928    FBN1    AD    NM_000138.4    15    6786    30.3    1184;2152    0;0    25;23    0;1    0;5    1195;2206        c.7039_7040delAT    50.12    15    48719928    48719929    AT    -    exonic    FBN1        frameshift deletion    FBN1:NM_000138.4:exon58:c.7039_7040del:p.M2347fs                                                                        pathogenic    CD020234    Marfan syndrome        1. Korkko (2002) J Med Genet 39: 34 PubMed: 11826022
101    48807637    FBN1    AD    NM_000138.4    15    3792    27.7    0;0    0;4    0;0    1227;2561    0;7    0;0    rs4775765    c.[1415G>A]+[1415G>A]    99.89    15    48807637    48807637    C    T    exonic    FBN1        nonsynonymous SNV    FBN1:NM_000138.4:exon12:c.G1415A:p.C472Y    1    1    1    1    1    1    .    .    .    1                                likely benign    n

file2
Code:
R_Index    Chr    Start    End    Ref    Alt    Func.IDP.refGene    Gene.IDP.refGene    GeneDetail.IDP.refGene    Inheritence    ExonicFunc.IDP.refGene    AAChange.IDP.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    dpsi_max_tissue    dpsi_zscore    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Phred    Classification    HGMD    Sanger
36    chr15    48719928    48719929    AT    -    exonic    FBN1    0    0    frameshift deletion    FBN1:NM_000138.4:exon58:c.7039_7040del:p.M2347fs    rs794728319    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    Pathogenic|Pathogenic    Thoracic_aortic_aneurysm_and_aortic_dissection|Marfan_syndrome    RCV000181674.1|RCV000208062.1    MedGen:Orphanet|MedGen:OMIM:Orphanet:SNOMED_CT    CN118826:ORPHA91387|C0024796:154700:ORPHA558:19346006    0    0    0    0    0    0    0
37    chr15    48719928    48719928    A    G    exonic    FBN1    0    0    nonsynonymous SNV    FBN1:NM_000138.4:exon58:c.7040T>C:p.M2347T    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    -0.2791    -0.822    0.56    T    0.369    B    0.222    B    0    D    1    D    1.43    L    0    0    0    0    0    0    0    0    0    0    0    0
38    chr15    48807637    48807637    C    T    exonic    FBN1    0    0    nonsynonymous SNV    FBN1:NM_000138.4:exon12:c.1415G>A:p.C472Y    rs4775765    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    0    0    0    1    2.0758    1.99    1    T    0    B    0    B    0    N    0    P    -4.395    N    0    0    0    0    0    GOOD    308    hom    87    0    0    0

current output
Code:
Match:
48807637 C T
Missing in file1:
48719928 A G
48719928 AT -
  
Missing in file2:
48719929 - exonic

desired output
Code:
Match 48719928 AT -, 48807637 C T 
Missing from file1 48719928 A G 
Missing from file2

awk
Code:
awk 'FNR==1 { next }
      FNR == NR { file1[$19,$21,$22] = $19 " " $21 " " $22 }
      FNR != NR { file2[$3,$5,$6] = $3 " " $5 " " $6 }
      END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
            print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k]
            print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k]
      }' file1 file2 > list


Last edited by cmccabe; 08-27-2016 at 02:57 PM.. Reason: fixed format
# 7  
Old 08-27-2016
As usual - garbage in garbage out. In file1, line 2, field 15 is missing which is rs4775765 in line 3. With a dummy in field 15 in line2, the result of the awk given is:
Code:
Match:
48719928 AT -
48807637 C T
Missing in file1:
48719928 A G
Missing in file2:

Why don't you exercise a bit more care when describing your problems and supplying sample data?
This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

To get the exact mismatches from two csv files

Hello Guys, I am pretty new to unix shell scripting where in i need to compare two files which are comma separated files. So here i go with the file contents cty_id,grade_val,g_val_2,g_val_3 001,10,20,30 002,,,40 003,100,,10 grade_val,g_val_2,cty_id 10,20,001 41,,002 100,1,003... (4 Replies)
Discussion started by: Master_Mind
4 Replies

2. Shell Programming and Scripting

Using awk to output matches and mismatches between two files to one file

In the tab-delimited files, I am trying to match $1,$2,$3,$4,$5 in fiel1 with $1,$2,$3,$4,$5 in fiel2 and create and output file that lists what matches and what was not found (or doesn't match). However the awk below seems to skip the first line and does not produce the desired output. I think... (2 Replies)
Discussion started by: cmccabe
2 Replies

3. Shell Programming and Scripting

awk to output specific matches in file

Using the attached file, the below awk command results in the output below: I can not seem to produce the desired results and need some expert help. Thank you :). awk -F'' ' { id += $4 value += $5 occur++ } END{ printf "%-8s%8s%8s%8s\n", "Gene", "Targets", "Average Depth", "Average... (3 Replies)
Discussion started by: cmccabe
3 Replies

4. Shell Programming and Scripting

Applying the same awk over a directory of files with individual file output

I am trying to apply an awk action over multiple files in a directory. It is a simple action, I want to print out the 1st 2 columns (i.e. $1 and $2) in each tab-separated document and output the result in a new file *.pp This is the awk that I have come up with so far, which is not giving me a... (6 Replies)
Discussion started by: owwow14
6 Replies

5. Shell Programming and Scripting

BASH - Compare 2 Files, Output All Matches

This is probably rehash but I did look. :rolleyes: I want a bash script that will take Item 1 in File1, traverse all lines in File2 and output if a match exists. Continuing the pattern recursively, Item2, File1, traverse all lines in File2 for a match, continue this pattern until all lines... (6 Replies)
Discussion started by: rcbarr2014
6 Replies

6. Shell Programming and Scripting

Compare 2 files and print matches and non-matches in separate files

Hi all, I have two files, chap.txt and complex.txt. chap.txt looks like this: a d l m r k complex.txt looks like this: a c d e l m n j a d l p q r c p r m ......... (7 Replies)
Discussion started by: AshwaniSharma09
7 Replies

7. UNIX for Dummies Questions & Answers

Use awk to pipe output from one file into multiple files

Hi All. Thanks for your help in advance. I have a requirement to examine the number of delimiters in each record of a file. If the record has the expected number of delimiters it should be passed into a 'good' file. If it does not, the record should be passed into a 'bad' file. I have been able... (8 Replies)
Discussion started by: codestar1
8 Replies

8. Shell Programming and Scripting

Comparing the matches in two files using awk when both files have their own field separators

I've two files with data like below: file1.txt: AAA,Apples,123 BBB,Bananas,124 CCC,Carrot,125 file2.txt: Store1|AAA|123|11 Store2|BBB|124|23 Store3|CCC|125|57 Store4|DDD|126|38 So,the field separator in file1.txt is a comma and in file2.txt,it is | Now,the output should be... (2 Replies)
Discussion started by: asyed
2 Replies

9. Shell Programming and Scripting

Writing output into different files while processing file using AWK

Hi, I am trying to do the following using AWK program. 1. Read the input data file 2. Parse the record and see if it contains errors 3. If the record contains errors, then write it into Reject file, else, write into usual output file or display it on the screen Here is what I have done -... (6 Replies)
Discussion started by: vidyak
6 Replies

10. Shell Programming and Scripting

Matches and mismatches in perl

When we give an input sequence , the program should match with the pattern and give the matches and mismatches in the output. i will give you 2 small examples. if you cant get it pls let me know. i will try to give a clear idea. example 1: $a=APPLE; # let it be a pattern... (0 Replies)
Discussion started by: srisha
0 Replies
Login or Register to Ask a Question