Using awk to output matches between two files to one file and mismatches to two others

08-26-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Using awk to output matches between two files to one file and mismatches to two others

I am trying to output the matches between $1 of file1 to $3 of file2 into a new file match.

I am also wanting to output the mismatches between those same 2 files and fields to two separate new files called missing from file1 and missing from file2. The input files are tab-delimited, but the output can be space delimited. The awk below hopefully is a good start. Thank you

.

file1

Code:

file2

Code:

1    1    1
2    2    2
3    3    
4    4    
5    5    5
6    6    6

desired output
match

Code:

1 2 6

missing from file1

Code:

missing from file2

Code:

3 4

awk tried

Code:

awk -F'\t' 'NR==FNR{a[$1]=$3;next}{if (a[$1])print a[$1],$0;else print "Not Found", $0;}' file1 file2

Last edited by cmccabe; 08-26-2016 at 05:45 PM.. Reason: fixed format

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

08-26-2016

Moderator

3,689, 1,352

Join Date: Jan 2012

Last Activity: 22 August 2020, 11:29 PM EDT

Location: Galactic Empire

Posts: 3,689

Thanks Given: 268

Thanked 1,352 Times in 1,258 Posts

I noticed file1 has just 2 columns, so this statement doesn't make any sense!

Code:

NR==FNR{a[$1]=$3;next}

Yoda

View Public Profile for Yoda

Visit Yoda's homepage!

Find all posts by Yoda

08-26-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

That should be $2, but the output is still only 1 new file. Thank you

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

08-26-2016

Moderator

3,689, 1,352

Join Date: Jan 2012

Last Activity: 22 August 2020, 11:29 PM EDT

Location: Galactic Empire

Posts: 3,689

Thanks Given: 268

Thanked 1,352 Times in 1,258 Posts

Here is an approach:-

Code:

awk -F'\t' '
        NR == FNR {
                A[$1]
                next
        }
        {
                B[$3]
        }
        END {
                print "Match"
                for ( k in A )
                {
                        if ( k && k in B )
                                print k
                }

                print "Missing from file1"
                for ( k in B )
                {
                        if ( ! ( k in A ) )
                                print k
                }

                print "Missing from file2"
                for ( k in A )
                {
                        if ( ! ( k in B ) )
                                print k
                }
        }
' file1 file2

This User Gave Thanks to Yoda For This Post:

Yoda

View Public Profile for Yoda

Visit Yoda's homepage!

Find all posts by Yoda

08-26-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by cmccabe

.

file1

Code:

file2

Code:

1    1    1
2    2    2
3    3    
4    4    
5    5    5
6    6    6

desired output
match

Code:

1 2 6

missing from file1

Code:

missing from file2

Code:

3 4

awk tried

Code:

awk -F'\t' 'NR==FNR{a[$1]=$3;next}{if (a[$1])print a[$1],$0;else print "Not Found", $0;}' file1 file2

Hi cmccabe,
I don't see any indication in the match output saying that the empty 1st field in the 5th line of file1 matches the 3rd field of the 3rd and 4th lines of file2... Are empty fields supposed to be ignored? If not, how are empty fields supposed t be displayed in the <space> or <tab> separated output?

If a single value appears more than once in file1 or in file2 and the number of times that value appears in one file is not the same as the number of times it appears in the other file, should there just be a single entry in the match output or should there one entry for each matched pair and entries in one of the other files for the number of unpaired entries?

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-27-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

@Yoda and @Don Cragun I modified the awk as my real data has the possibility of the same entry being in one of the files with a different pairing. Also, the nulls can remain blank.

For example in the two files below 48719928 AT - is in both files, however 48719928 A G is missing from file1. So the awk uses a combination array as a key then looks for that.

I am using $19 $21 and $22 of file1 to search $3 $5 and $6 of file2. The header row is skipped and it then outputs a new file with what lines match and if they do not match what file the match is missing from. The awk does run but the output produced seems to be incorrect and I am not able to fix it. Thank you

.

file1

Code:

Index    Chromosomal Position    Gene    Inheritance    mRNA    Chromosome    Coverage    Score    A(#F,#R)    C(#F,#R)    G(#F,#R)    T(#F,#R)    Ins(#F,#R)    Del(#F,#R)    SNP    Mutation    Frequency    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    ExonicFunc.refGene    AAChange.refGene    PopFreqMax    1000G2012APR_ALL    1000G2012APR_AFR    1000G2012APR_AMR    1000G2012APR_ASN    1000G2012APR_EUR    ESP6500si_ALL    ESP6500si_AA    ESP6500si_EA    CG46    common    clinvar    clinvarsubmit    clinvarreference    Homopolymer    Splice    Pseudogene    Classification    HGMD    Disease    Sanger    References
98    48719928    FBN1    AD    NM_000138.4    15    6786    30.3    1184;2152    0;0    25;23    0;1    0;5    1195;2206        c.7039_7040delAT    50.12    15    48719928    48719929    AT    -    exonic    FBN1        frameshift deletion    FBN1:NM_000138.4:exon58:c.7039_7040del:p.M2347fs                                                                        pathogenic    CD020234    Marfan syndrome        1. Korkko (2002) J Med Genet 39: 34 PubMed: 11826022
101    48807637    FBN1    AD    NM_000138.4    15    3792    27.7    0;0    0;4    0;0    1227;2561    0;7    0;0    rs4775765    c.[1415G>A]+[1415G>A]    99.89    15    48807637    48807637    C    T    exonic    FBN1        nonsynonymous SNV    FBN1:NM_000138.4:exon12:c.G1415A:p.C472Y    1    1    1    1    1    1    .    .    .    1                                likely benign    n

file2

Code:

R_Index    Chr    Start    End    Ref    Alt    Func.IDP.refGene    Gene.IDP.refGene    GeneDetail.IDP.refGene    Inheritence    ExonicFunc.IDP.refGene    AAChange.IDP.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    dpsi_max_tissue    dpsi_zscore    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Phred    Classification    HGMD    Sanger
36    chr15    48719928    48719929    AT    -    exonic    FBN1    0    0    frameshift deletion    FBN1:NM_000138.4:exon58:c.7039_7040del:p.M2347fs    rs794728319    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    Pathogenic|Pathogenic    Thoracic_aortic_aneurysm_and_aortic_dissection|Marfan_syndrome    RCV000181674.1|RCV000208062.1    MedGen:Orphanet|MedGen:OMIM:Orphanet:SNOMED_CT    CN118826:ORPHA91387|C0024796:154700:ORPHA558:19346006    0    0    0    0    0    0    0
37    chr15    48719928    48719928    A    G    exonic    FBN1    0    0    nonsynonymous SNV    FBN1:NM_000138.4:exon58:c.7040T>C:p.M2347T    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    -0.2791    -0.822    0.56    T    0.369    B    0.222    B    0    D    1    D    1.43    L    0    0    0    0    0    0    0    0    0    0    0    0
38    chr15    48807637    48807637    C    T    exonic    FBN1    0    0    nonsynonymous SNV    FBN1:NM_000138.4:exon12:c.1415G>A:p.C472Y    rs4775765    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    0    0    0    1    2.0758    1.99    1    T    0    B    0    B    0    N    0    P    -4.395    N    0    0    0    0    0    GOOD    308    hom    87    0    0    0

current output

Code:

Match:
48807637 C T
Missing in file1:
48719928 A G
48719928 AT -
  
Missing in file2:
48719929 - exonic

desired output

Code:

Match 48719928 AT -, 48807637 C T 
Missing from file1 48719928 A G 
Missing from file2

awk

Code:

awk 'FNR==1 { next }
      FNR == NR { file1[$19,$21,$22] = $19 " " $21 " " $22 }
      FNR != NR { file2[$3,$5,$6] = $3 " " $5 " " $6 }
      END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
            print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k]
            print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k]
      }' file1 file2 > list

Last edited by cmccabe; 08-27-2016 at 02:57 PM.. Reason: fixed format

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

08-27-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

As usual - garbage in garbage out. In file1, line 2, field 15 is missing which is rs4775765 in line 3. With a dummy in field 15 in line2, the result of the awk given is:

Code:

Match:
48719928 AT -
48807637 C T
Missing in file1:
48719928 A G
Missing in file2:

Why don't you exercise a bit more care when describing your problems and supplying sample data?

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Using awk to output matches between two files to one file and mismatches to two others

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

To get the exact mismatches from two csv files

Discussion started by: Master_Mind

2. Shell Programming and Scripting

Using awk to output matches and mismatches between two files to one file

Discussion started by: cmccabe

3. Shell Programming and Scripting

awk to output specific matches in file

Discussion started by: cmccabe

4. Shell Programming and Scripting

Applying the same awk over a directory of files with individual file output

Discussion started by: owwow14

5. Shell Programming and Scripting

BASH - Compare 2 Files, Output All Matches

Discussion started by: rcbarr2014

6. Shell Programming and Scripting

Compare 2 files and print matches and non-matches in separate files

Discussion started by: AshwaniSharma09

7. UNIX for Dummies Questions & Answers

Use awk to pipe output from one file into multiple files

Discussion started by: codestar1

8. Shell Programming and Scripting

Comparing the matches in two files using awk when both files have their own field separators

Discussion started by: asyed

9. Shell Programming and Scripting

Writing output into different files while processing file using AWK

Discussion started by: vidyak

10. Shell Programming and Scripting

Matches and mismatches in perl

Discussion started by: srisha