awk to output match and mismatch with count using specific fields


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to output match and mismatch with count using specific fields
# 1  
Old 12-09-2016
awk to output match and mismatch with count using specific fields

In the below awk I am trying output to one file those lines that match between $2,$3,$4 of file1 and file2 with the count in (). I am also trying to output those lines that are missing between $2,$3,$4 of file1 and file2 with the count of in () each. Both input files are tab-delimited, but the output is not. I am not sure where to put the counter to get the desired output. Thank you Smilie.

file1
Code:
1    955597    G    G
1    9773306    T    C
1    981931    A    G
1    982994    T    C
1    984302    T    C

file2
Code:
1    955597    G    G
1    9773306    T    C
1    981939    A    G
1    982978    T    C
1    984302    T    C

desired output
Code:
Match: (3)
1    955597    G    G
1    9773306    T    C
1    984302    T    C
Missing from file1: (2)
1    981939    A    G
1    982978    T    C
Missing from file2:
1    981931    A    G
1    982994    T    C

awk
Code:
awk -F'\t' 'FNR==1 { next }
        FNR == NR { file1[$2,$3,$4] = $2 " " $3 " " $4 }
        FNR != NR { file2[$2,$3,$4] = $2 " " $3 " " $4 }
        END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
              print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k]
              print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k]
}' file1 file2  > NA12878_match

# 2  
Old 12-09-2016
Hello cmccabe,

Could you please try following and let me know if this helps you.
Code:
awk 'FNR==NR{A[$2 FS $3 FS $4]=$0;next} {Q=$2 FS $3 FS $4} !(Q in A){;NON_MATCH2=NON_MATCH2?NON_MATCH2 ORS $0:$0} (Q in A){MATCH=MATCH?MATCH ORS A[Q]:A[Q];delete A[Q];} END{for(i in A){;NON_MATCH1=NON_MATCH1?NON_MATCH1 ORS A[i]:A[i]};print "Match:" ORS MATCH ORS "Missing from file1:" ORS NON_MATCH2 ORS "Missing from file2:" ORS NON_MATCH1}'   Input_file1  Input_file2

Output will be as follows.
Code:
Match:
1    955597    G    G
1    9773306    T    C
1    984302    T    C
Missing from file1:
1    981939    A    G
1    982978    T    C
Missing from file2:
1    982994    T    C
1    981931    A    G

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
Old 12-09-2016
Can the (count) of each be added in? Thank you Smilie.

Code:
Match: (3)
1    955597    G    G
1    9773306    T    C
1    984302    T    C
Missing from file1: (2)
1    981939    A    G
1    982978    T    C
Missing from file2: (2)
1    982994    T    C
1    981931    A    G

# 4  
Old 12-09-2016
Quote:
Originally Posted by cmccabe
Can the (count) of each be added in? Thank you Smilie.
Code:
Match: (3)
1    955597    G    G
1    9773306    T    C
1    984302    T    C
Missing from file1: (2)
1    981939    A    G
1    982978    T    C
Missing from file2: (2)
1    982994    T    C
1    981931    A    G

Hello cmccabe,

Could you please try following and let me know if this helps.
Code:
awk 'FNR==NR{A[$2 FS $3 FS $4]=$0;next} {Q=$2 FS $3 FS $4} !(Q in A){;NON_MATCH2=NON_MATCH2?NON_MATCH2 ORS $0:$0;p++} (Q in A){MATCH=MATCH?MATCH ORS A[Q]:A[Q];delete A[Q];q++} END{for(i in A){;NON_MATCH1=NON_MATCH1?NON_MATCH1 ORS A[i]:A[i];r++};print "Match: (" q ")" ORS MATCH ORS "Missing from file1: (" p ")" ORS NON_MATCH2 ORS "Missing from file2: (" r ")"ORS NON_MATCH1}'  Input_file1   Input_file2

Output will be as follows.
Code:
Match: (3)
1    955597    G    G
1    9773306    T    C
1    984302    T    C
Missing from file1: (2)
1    981939    A    G
1    982978    T    C
Missing from file2: (2)
1    982994    T    C
1    981931    A    G

Adding one-line form for solution too now.
Code:
awk 'FNR==NR{
                A[$2 FS $3 FS $4]=$0;
                next
            }
            {
                Q=$2 FS $3 FS $4
            }
            !(Q in A){;
                        NON_MATCH2=NON_MATCH2?NON_MATCH2 ORS $0:$0;
                        p++
                     }
             (Q in A){
                        MATCH=MATCH?MATCH ORS A[Q]:A[Q];
                        delete A[Q];
                        q++
                     }
      END   {
                for(i in A){;
                                NON_MATCH1=NON_MATCH1?NON_MATCH1 ORS A[i]:A[i];
                                r++
                           };
                print "Match: (" q ")" ORS MATCH ORS "Missing from file1: (" p ")" ORS NON_MATCH2 ORS "Missing from file2: (" r ")"ORS NON_MATCH1
            }
    '   Input_file1   Input_file2

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 5  
Old 12-10-2016
What is the role of $1 in all of this? What is the point of not using it for the matching, but still using it for the output. Suppose matches in file1 and file2 do have a different $1, how do you decide which $1 to print, or why would you print it at all.

If on the other hand the value of column 1 is always equal for file1 and file2, then we might just as well use $0 for matching, which would simplify the script
This User Gave Thanks to Scrutinizer For This Post:
# 6  
Old 12-12-2016
Thank you very much Smilie

$1 is not used in the match because the format of that is highly variable, sometimes it is chr1 and sometimes it is just 1.
It is very unlikely that matches will have different $1 values, so it just seemed easier not to match on it but print it instead. Thank you Smilie.
# 7  
Old 12-12-2016
This is a bit closer to the approach you used cmccabe.

All I did was build a string in the for loop and count the lines then printf the result

Code:
awk -F'\t' '
{
  if(FNR == NR) file1[$2" "$3" "$4]=$0
  else file2[$2" "$3" "$4]=$0
}
END {
  for (k in file1)
     if (k in file2) { X=X"\n"file1[k]; m++}
  for (k in file2) 
     if (!(k in file1)) {Y=Y"\n"file2[k]; f1++}
  for (k in file1) 
     if (!(k in file2)) {Z=Z"\n"file1[k]; f2++}

  printf "Match: (%d)%s\n", m, X
  printf "Missing in file1: (%d)%s\n", f1, Y
  printf "Missing in file2: (%d)%s\n", f2, Z
}' file1 file2 > NA12878_match

This User Gave Thanks to Chubler_XL For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Match output fields agains two patterns

I need to print field and the next one if field matches 'patternA' and also print 'patternB' fields. echo "some output" | awk '{for(i=1;i<=NF;i++){if($i ~ /patternA/){print $i, $(i+1)}elif($i ~ /patternB/){print $i}}}' This code returnes me 'syntax error'. Pls advise how to do properly. (2 Replies)
Discussion started by: urello
2 Replies

2. UNIX for Beginners Questions & Answers

awk match two fields in two files

Hi, I have two TEST files t.xyz and a.xyz which have three columns each. a.xyz have more rows than t.xyz. I will like to output rows at which $1 and $2 of t.xyz match $1 and $2 of a.xyz. Total number of output rows should be equal to that of t.xyz. It works fine, but when I apply it to large... (6 Replies)
Discussion started by: geomarine
6 Replies

3. Shell Programming and Scripting

awk to print match or non-match and select fields/patterns for non-matches

In the awk below I am trying to output those lines that Match between file1 and file2, those Missing in file1, and those missing in file2. Using each $1,$2,$4,$5 value as a key to match on, that is if those 4 fields are found in both files the match, but if those 4 fields are not found then missing... (0 Replies)
Discussion started by: cmccabe
0 Replies

4. Shell Programming and Scripting

awk to update specific value in file with match and add +1 to specific digit

I am trying to use awk to match the NM_ in file with $1 of id which is tab-delimited. The NM_ will always be in the line of file that starts with > and be after the second _. When there is a match between each NM_ and id, then the value of $2 in id is substituted or used to update the NM_. Each NM_... (3 Replies)
Discussion started by: cmccabe
3 Replies

5. UNIX for Beginners Questions & Answers

How to count lines of CSV file where 2 fields match variables?

I'm trying to use awk to count the occurrences of two matching fields of a CSV file. For instance, for data that looks like this... Joe,Blue,Yes,No,High Mike,Blue,Yes,Yes,Low Joe,Red,No,No,Low Joe,Red,Yes,Yes,Low I've been trying to use code like this... countvar=`awk ' $2~/$color/... (4 Replies)
Discussion started by: nmoore2843
4 Replies

6. Shell Programming and Scripting

awk partial string match and add specific fields

Trying to combine strings that are a partial match to another in $1 (usually below it). If a match is found than the $2 value is added to the $2 value of the match and the $3 value is added to the $3 value of the match. I am not sure how to do this and need some expert help. Thank you :). file ... (2 Replies)
Discussion started by: cmccabe
2 Replies

7. Shell Programming and Scripting

Using to perl to output specific fields to one file

Trying to use perl to output specific fields from all text files in a directory to one new file. Each text file on a new line. The below seems to work for one text file but not more. Thank you :). perl -ne 's/^#//; @n = (6, 7, 8, 16); print if $. ~~ @n' *.txt > out.txt format of all text... (2 Replies)
Discussion started by: cmccabe
2 Replies

8. Shell Programming and Scripting

awk count fields not working

Hi, i am trying to count the fields in a file. Input: 100,1000,,2000,3000,10/26/2012 12:12:30 200,3000,,1000,01/28/2012 17:12:30 300,5000,,5000,7000,09/06/2012 16:12:30 output: Cout of the fileds for each row 6 5 6 awk -F"," '{print $NF}' file1.txt When i try with above awk... (3 Replies)
Discussion started by: onesuri
3 Replies

9. Shell Programming and Scripting

awk help: Match data fields from 2 files & output results from both into 1 file

I need to take 2 input files and create 1 output based on matches from each file. I am looking to match field #1 in both files (Userid) and create an output file that will be a combination of fields from both file1 and file2 if there are any differences in the fields 2,3,4,5,or 6. Below is an... (5 Replies)
Discussion started by: ambroze
5 Replies

10. Shell Programming and Scripting

awk - count character count of fields

Hello All, I got a requirement when I was working with a file. Say the file has unloads of data from a table in the form 1|121|asda|434|thesi|2012|05|24| 1|343|unit|09|best|2012|11|5| I was put into a scenario where I need the field count in all the lines in that file. It was simply... (6 Replies)
Discussion started by: PikK45
6 Replies
Login or Register to Ask a Question