awk to match value to a field within +/- value

09-01-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

awk to match value to a field within +/- value

In the awk below I use $2 of filet to search filea for a match. If the values in $2 are exact match this works great. However, that is not always the case, so I need to perform the search using a range of + or - 2. That is if the value in filea $2 is within + or - 2 of filet $2 then it is matched. Thank you

.

filet

Code:

Chrom    Position    Gene Sym
chr11    1776024    CTSD
chr11    6637518    TPP1
chr11    6638506    TPP1

filea

Code:

Index    Start    Gene
115    1776025    CTSD
116    6637518    TPP1
117    6638506    TPP1

awk

Code:

awk 'FNR==1 { next }
>       FNR == NR { file1[$2,$3] = $2 " " $3} # filet search
>       FNR != NR { file2[$2,$3] = $2 " " $3 } # in filea
>       END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
>             print "Missing in annovar but found in tvc:"; for (k in file2) if (!(k in file1)) print file2[k]
>             print "Missing in tvc but found in annovar:"; for (k in file1) if (!(k in file2)) print file1[k]
>       }' filea filet > match

current output

Code:

Match:
6637518 TPP1
6638506 TPP1
Missing in filea but found in filet:
1776024 CTSD
Missing in filet but found in filea:
1776025 CTSD

desired output

Code:

Match:
6637518 TPP1
6638506 TPP1
1776024 CTSD --> within 2 margin
Missing in filea but found in filet:
Missing in tvc but found in annovar:

awk trued to get the desired output

Code:

awk 'FNR==1 { next }
>       FNR == NR { file1[$2,$3] = $2 " " $3 } # filea search
>       FNR != NR { file2[$2,$3] = $2 " " $3 } # in filet
>       END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
>             print "Missing in filea but found in filet:"; for (k in file2) if (!(k in file1)) print file2[k]
>             print "Missing in filet but found in filea:"; for (k in file1) if (!(k in file2)) print file1[k]
>             if((k-$2)^2<=2^2) {print $0, " --> within 2 margin"; next}
>       }' filea filet > match
awk: cmd. line:7: error: `next' used in END action

Last edited by cmccabe; 09-01-2016 at 02:40 PM.. Reason: fixed format

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

09-01-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello cmccabe,

Could you please try following and let me know if this helps you, let's say following are 2 Input_files.

Code:

cat filet
Chrom    Position    Gene Sym
chr11    1776024    CTSD
chr11    6637518    TPP1
chr11    6638506    TPP1
chr12    1241414    WRTW
cat filea
Index    Start    Gene
115    1776025    CTSD
116    6637518    TPP1
117    6638506    TPP1

Now following code may help you in same.

Code:

awk 'FNR==NR && NR>1{A[$2]=$2 OFS $3;next} FNR!=NR && NR>1{for(i in A){if($2==i){W=W?W ORS $0:$0;delete A[$2];next};if((($2-i)<=2 && ($2-i)>0)||((i-$2)<=2 && (i-$2)>0)){Q=Q?Q ORS $0:$0;split(A[i], B," ");delete A[B[1]];$0=""}}} !($2 in A) && NF && (FNR!=NR && FNR>1){P=P?P ORS $0:$0}  END{for(j in A){E=E?E ORS A[j]:A[j]};print "Common in both files:" ORS W ORS "having +/- 2 values are:" ORS Q ORS "Present in filet and not in filea are:" ORS P ORS "Present in filea and not in filet are:" ORS E}' filea filet

Output will be as follows.

Code:

Common in both files:
chr11    6637518    TPP1
chr11    6638506    TPP1
having +/- 2 values are:
chr11    1776024    CTSD
Present in filet and not in filea are:
chr12    1241414    WRTW
Present in filea and not in filet are:

If you have any changes into your requirements then please mention it more samples of Input_file and expected output with complete details of what you want to put as rules please.
EDIT: Adding a non-one liner form of solution on same.

Code:

awk 'FNR==NR && NR>1{
            A[$2]=$2 OFS $3;
            next
            }
     FNR!=NR && NR>1{
            for(i in A){
                    if($2==i){
                            W=W?W ORS $0:$0;
                                                        delete A[$2];next
                         };
                    if((($2-i)<=2 && ($2-i)>0)||((i-$2)<=2 && (i-$2)>0)){
                    Q=Q?Q ORS $0:$0;
                    split(A[i], B," ");
                    delete A[B[1]];
                    $0=""
                                                }
                                   }
                    } 
     !($2 in A) && NF && (FNR!=NR && FNR>1){
                        P=P?P ORS $0:$0
                       }  
     END{
        for(j in A){
                E=E?E ORS A[j]:A[j]
               };
                print "Common in both files:" ORS W ORS "having +/- 2 values are:" ORS Q ORS "Present in filet and not in filea are:" ORS P ORS "Present in filea and not in filet are:" ORS E
        }
     ' filea filet

Thanks,
R. Singh

Last edited by RavinderSingh13; 09-02-2016 at 05:49 AM.. Reason: Adding a non-one liner form of solution on same too succcessfully now. change OFS to ORS in E variable to get lines as \n.

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

09-01-2016

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Quote:

awk: cmd. line:7: error: `next' used in END action

next goes to the next input cycle, can only be used in the main loop.
In the END section use exit!
(Note: an exit in the main loop goes to the END section.)

Last edited by MadeInGermany; 09-01-2016 at 05:36 PM..

These 2 Users Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

09-01-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I think the next in the loop in the END section is intended to start the next cycle in the for loop. You do that in loops in actions in the main section and in BEGIN and END sections with continue; not next and not exit.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

09-02-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you all very much

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

09-04-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

If I am reading Ravinder's code correctly, I believe it is only looking for a match (or near match) on field 2 values without regard to field 3 values. If I am reading cmccabe's code correctly, it is matching on field 3 and then looking for entries with matching (or near matching) field 2 values among entries that have the same field 3 values. With the limited sample data provided, both forms of matching lead to the same results. (Ravinder's code also sometimes outputs field 1 values in the results which was not included in the desired output in post #1.)

If the intent is to only look for matching and near matching field 2 values for input lines with the same field 3 value, you might want to consider this alternative solution:

Code:

#!/bin/ksh
filea=${1:-filea}
filet=${2:-filet}
margin=${3:-2}	# Default margin.

awk -v margin="$margin" '
FNR == 1 {
	if(NR == 1) {
		fn1 = FILENAME
		range = margin ^ 2
		print "Match:"
		next
	}
	fn2 = FILENAME
	next
}
FNR == NR {
	# Gather data from file1.
	file1[$2 OFS $3] = $3
	next
}
{	# We are reading the 2nd input file...
	# Look for exact.
	k = $2 OFS $3
	if(k in file1) {
		# Exact match found.
		print k
		delete file1[k]
		next
	}
	# If there was no match, gather data from the second input file.
	file2[k] = $3
}
END {	# Look for near matches...
	for(key1 in file1)
		for(key2 in file2)
			if(file1[key1] == file2[key2] &&
			    (key1 - key2)^2 <= range) {
				# Near match found.
				print key2, "--> within", margin, "margin"
				delete file1[key1]
				delete file2[key2]
				break
			}
	# Look for unmatched entries...
	print "Missing in " fn1  " but found in " fn2 ":"
	for(k in file2)
		print k
	print "Missing in " fn2 " but found in " fn1 ":"
	for(k in file1)
		print k
}' "$filea" "$filet"

Note that in addition to accepting alternative input file pathnames as script operands, the margin can also be specified as a third operand in case you want to experiment with values other than 2.

With the sample input files provided in post #1 in this thread, the above script produces the output:

Code:

Match:
6637518 TPP1
6638506 TPP1
1776024 CTSD --> within 2 margin
Missing in filea but found in filet:
Missing in filet but found in files:

when invoked without operands. On my system, the above script is named tester. With two other input files:
filea_extended:

Code:

Index    Start    Gene
115    1776025    CTSD
116    6637518    TPP1
117    6638506    TPP1
118    1776025    EXTRA

and filet_extended:

Code:

Chrom    Position    Gene Sym
chr11    1776024    CTSD
chr11    6637518    TPP1
chr11    6638506    TPP1
chr11    1776030    EXTRA

The command:

Code:

./tester filea_extended filet_extended

produces the output:

Code:

Match:
6637518 TPP1
6638506 TPP1
1776024 CTSD --> within 2 margin
Missing in filea_extended but found in filet_extended:
1776030 EXTRA
Missing in filet_extended but found in filea_extended:
1776025 EXTRA

and the command:

Code:

./tester filea_extended filet_extended 5

produces the output:

Code:

Match:
6637518 TPP1
6638506 TPP1
1776024 CTSD --> within 5 margin
1776030 EXTRA --> within 5 margin
Missing in filea_extended but found in filet_extended:
Missing in filet_extended but found in filea_extended:

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

09-06-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Yes @Don Cragun you are correct in that $2 isn't always unique so I used a combination of $3 and $2 to perform the lookup. Thank you all for your help

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

awk to match value to a field within +/- value

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to print text in field if match and range is met

Discussion started by: cmccabe

2. Shell Programming and Scripting

awk to update field in file based of match in another

Discussion started by: cmccabe

3. Shell Programming and Scripting

awk to match field between two files and use conditions on match

Discussion started by: cmccabe

4. Shell Programming and Scripting

awk to update field file based on match

Discussion started by: cmccabe

5. Shell Programming and Scripting

awk to remove field and match strings to add text

Discussion started by: cmccabe

6. Shell Programming and Scripting

awk Match First Field and Replace Second Column

Discussion started by: jacobs.smith

7. Shell Programming and Scripting

awk or sed? change field conditional on key match

Discussion started by: RascalHoudi

8. Shell Programming and Scripting

AWK: Pattern match between 2 files, then compare a field in file1 as > or < field in file2

Discussion started by: right_coaster

9. UNIX for Dummies Questions & Answers

Awk counting lines with field match

Discussion started by: fredted40x

10. Shell Programming and Scripting

how do i pattern match a field with awk?

Discussion started by: someone123