awk to match value to a field within +/- value


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to match value to a field within +/- value
# 1  
Old 09-01-2016
awk to match value to a field within +/- value

In the awk below I use $2 of filet to search filea for a match. If the values in $2 are exact match this works great. However, that is not always the case, so I need to perform the search using a range of + or - 2. That is if the value in filea $2 is within + or - 2 of filet $2 then it is matched. Thank you Smilie.

filet
Code:
Chrom    Position    Gene Sym
chr11    1776024    CTSD
chr11    6637518    TPP1
chr11    6638506    TPP1

filea
Code:
Index    Start    Gene
115    1776025    CTSD
116    6637518    TPP1
117    6638506    TPP1

awk
Code:
awk 'FNR==1 { next }
>       FNR == NR { file1[$2,$3] = $2 " " $3} # filet search
>       FNR != NR { file2[$2,$3] = $2 " " $3 } # in filea
>       END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
>             print "Missing in annovar but found in tvc:"; for (k in file2) if (!(k in file1)) print file2[k]
>             print "Missing in tvc but found in annovar:"; for (k in file1) if (!(k in file2)) print file1[k]
>       }' filea filet > match

current output
Code:
Match:
6637518 TPP1
6638506 TPP1
Missing in filea but found in filet:
1776024 CTSD
Missing in filet but found in filea:
1776025 CTSD

desired output
Code:
Match:
6637518 TPP1
6638506 TPP1
1776024 CTSD --> within 2 margin
Missing in filea but found in filet:
Missing in tvc but found in annovar:

awk trued to get the desired output
Code:
awk 'FNR==1 { next }
>       FNR == NR { file1[$2,$3] = $2 " " $3 } # filea search
>       FNR != NR { file2[$2,$3] = $2 " " $3 } # in filet
>       END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
>             print "Missing in filea but found in filet:"; for (k in file2) if (!(k in file1)) print file2[k]
>             print "Missing in filet but found in filea:"; for (k in file1) if (!(k in file2)) print file1[k]
>             if((k-$2)^2<=2^2) {print $0, " --> within 2 margin"; next}
>       }' filea filet > match
awk: cmd. line:7: error: `next' used in END action


Last edited by cmccabe; 09-01-2016 at 02:40 PM.. Reason: fixed format
# 2  
Old 09-01-2016
Hello cmccabe,

Could you please try following and let me know if this helps you, let's say following are 2 Input_files.
Code:
cat filet
Chrom    Position    Gene Sym
chr11    1776024    CTSD
chr11    6637518    TPP1
chr11    6638506    TPP1
chr12    1241414    WRTW
cat filea
Index    Start    Gene
115    1776025    CTSD
116    6637518    TPP1
117    6638506    TPP1

Now following code may help you in same.
Code:
awk 'FNR==NR && NR>1{A[$2]=$2 OFS $3;next} FNR!=NR && NR>1{for(i in A){if($2==i){W=W?W ORS $0:$0;delete A[$2];next};if((($2-i)<=2 && ($2-i)>0)||((i-$2)<=2 && (i-$2)>0)){Q=Q?Q ORS $0:$0;split(A[i], B," ");delete A[B[1]];$0=""}}} !($2 in A) && NF && (FNR!=NR && FNR>1){P=P?P ORS $0:$0}  END{for(j in A){E=E?E ORS A[j]:A[j]};print "Common in both files:" ORS W ORS "having +/- 2 values are:" ORS Q ORS "Present in filet and not in filea are:" ORS P ORS "Present in filea and not in filet are:" ORS E}' filea filet

Output will be as follows.
Code:
Common in both files:
chr11    6637518    TPP1
chr11    6638506    TPP1
having +/- 2 values are:
chr11    1776024    CTSD
Present in filet and not in filea are:
chr12    1241414    WRTW
Present in filea and not in filet are:

If you have any changes into your requirements then please mention it more samples of Input_file and expected output with complete details of what you want to put as rules please.
EDIT: Adding a non-one liner form of solution on same.
Code:
awk 'FNR==NR && NR>1{
            A[$2]=$2 OFS $3;
            next
            }
     FNR!=NR && NR>1{
            for(i in A){
                    if($2==i){
                            W=W?W ORS $0:$0;
                                                        delete A[$2];next
                         };
                    if((($2-i)<=2 && ($2-i)>0)||((i-$2)<=2 && (i-$2)>0)){
                    Q=Q?Q ORS $0:$0;
                    split(A[i], B," ");
                    delete A[B[1]];
                    $0=""
                                                }
                                   }
                    } 
     !($2 in A) && NF && (FNR!=NR && FNR>1){
                        P=P?P ORS $0:$0
                       }  
     END{
        for(j in A){
                E=E?E ORS A[j]:A[j]
               };
                print "Common in both files:" ORS W ORS "having +/- 2 values are:" ORS Q ORS "Present in filet and not in filea are:" ORS P ORS "Present in filea and not in filet are:" ORS E
        }
     ' filea filet

Thanks,
R. Singh

Last edited by RavinderSingh13; 09-02-2016 at 05:49 AM.. Reason: Adding a non-one liner form of solution on same too succcessfully now. change OFS to ORS in E variable to get lines as \n.
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
Old 09-01-2016
Quote:
awk: cmd. line:7: error: `next' used in END action
next goes to the next input cycle, can only be used in the main loop.
In the END section use exit!
(Note: an exit in the main loop goes to the END section.)

Last edited by MadeInGermany; 09-01-2016 at 05:36 PM..
These 2 Users Gave Thanks to MadeInGermany For This Post:
# 4  
Old 09-01-2016
I think the next in the loop in the END section is intended to start the next cycle in the for loop. You do that in loops in actions in the main section and in BEGIN and END sections with continue; not next and not exit.
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 09-02-2016
Thank you all very much Smilie.
# 6  
Old 09-04-2016
If I am reading Ravinder's code correctly, I believe it is only looking for a match (or near match) on field 2 values without regard to field 3 values. If I am reading cmccabe's code correctly, it is matching on field 3 and then looking for entries with matching (or near matching) field 2 values among entries that have the same field 3 values. With the limited sample data provided, both forms of matching lead to the same results. (Ravinder's code also sometimes outputs field 1 values in the results which was not included in the desired output in post #1.)

If the intent is to only look for matching and near matching field 2 values for input lines with the same field 3 value, you might want to consider this alternative solution:
Code:
#!/bin/ksh
filea=${1:-filea}
filet=${2:-filet}
margin=${3:-2}	# Default margin.

awk -v margin="$margin" '
FNR == 1 {
	if(NR == 1) {
		fn1 = FILENAME
		range = margin ^ 2
		print "Match:"
		next
	}
	fn2 = FILENAME
	next
}
FNR == NR {
	# Gather data from file1.
	file1[$2 OFS $3] = $3
	next
}
{	# We are reading the 2nd input file...
	# Look for exact.
	k = $2 OFS $3
	if(k in file1) {
		# Exact match found.
		print k
		delete file1[k]
		next
	}
	# If there was no match, gather data from the second input file.
	file2[k] = $3
}
END {	# Look for near matches...
	for(key1 in file1)
		for(key2 in file2)
			if(file1[key1] == file2[key2] &&
			    (key1 - key2)^2 <= range) {
				# Near match found.
				print key2, "--> within", margin, "margin"
				delete file1[key1]
				delete file2[key2]
				break
			}
	# Look for unmatched entries...
	print "Missing in " fn1  " but found in " fn2 ":"
	for(k in file2)
		print k
	print "Missing in " fn2 " but found in " fn1 ":"
	for(k in file1)
		print k
}' "$filea" "$filet"

Note that in addition to accepting alternative input file pathnames as script operands, the margin can also be specified as a third operand in case you want to experiment with values other than 2.

With the sample input files provided in post #1 in this thread, the above script produces the output:
Code:
Match:
6637518 TPP1
6638506 TPP1
1776024 CTSD --> within 2 margin
Missing in filea but found in filet:
Missing in filet but found in files:

when invoked without operands. On my system, the above script is named tester. With two other input files:
filea_extended:
Code:
Index    Start    Gene
115    1776025    CTSD
116    6637518    TPP1
117    6638506    TPP1
118    1776025    EXTRA

and filet_extended:
Code:
Chrom    Position    Gene Sym
chr11    1776024    CTSD
chr11    6637518    TPP1
chr11    6638506    TPP1
chr11    1776030    EXTRA

The command:
Code:
./tester filea_extended filet_extended

produces the output:
Code:
Match:
6637518 TPP1
6638506 TPP1
1776024 CTSD --> within 2 margin
Missing in filea_extended but found in filet_extended:
1776030 EXTRA
Missing in filet_extended but found in filea_extended:
1776025 EXTRA

and the command:
Code:
./tester filea_extended filet_extended 5

produces the output:
Code:
Match:
6637518 TPP1
6638506 TPP1
1776024 CTSD --> within 5 margin
1776030 EXTRA --> within 5 margin
Missing in filea_extended but found in filet_extended:
Missing in filet_extended but found in filea_extended:

These 2 Users Gave Thanks to Don Cragun For This Post:
# 7  
Old 09-06-2016
Yes @Don Cragun you are correct in that $2 isn't always unique so I used a combination of $3 and $2 to perform the lookup. Thank you all for your help Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to print text in field if match and range is met

In the awk below I am trying to match the value in $4 of file1 with the split value from $4 in file2. I store the value of $4 in file1 in A and the split value (using the _ for the split) in array. I then strore the value in $2 as min, the value in $3 as max, and the value in $1 as chr. If A is... (6 Replies)
Discussion started by: cmccabe
6 Replies

2. Shell Programming and Scripting

awk to update field in file based of match in another

I am trying to use awk to match two files that are tab-delimited. When a match is found between file1 $1 and file2 $4, $4 in file2 is updated using the $2 value in file1. If no match is found then the next line is processed. Thank you :). file1 uc001bwr.3 ADC uc001bws.3 ADC... (4 Replies)
Discussion started by: cmccabe
4 Replies

3. Shell Programming and Scripting

awk to match field between two files and use conditions on match

I am trying to look for $2 of file1 (skipping the header) in $2 of file2 (skipping the header) and if they match and the value in $10 is > 30 and $11 is > 49, then print the line from file1 to a output file. If no match is foung the line is not printed. Both the input and output are tab-delimited.... (3 Replies)
Discussion started by: cmccabe
3 Replies

4. Shell Programming and Scripting

awk to update field file based on match

If $1 in file1 matches $2 in file2. Then the value in $2 of file2 is updated to $1"."$2 of file2. The awk seems to only match the two files but not update. Thank you :). awk awk 'NR==FNR{A ; next} $1 in A { $2 = a }1' file1 file2 file1 name version NM_000593 5 NM_001257406... (3 Replies)
Discussion started by: cmccabe
3 Replies

5. Shell Programming and Scripting

awk to remove field and match strings to add text

In file1 field $18 is removed.... column header is "Otherinfo", then each line in file1 is used to search file2 for a match. When a match is found the last four strings in file2 are copied to file1. Maybe: cut -f1-17 file1 and then match each line to file2 file1 Chr Start End ... (6 Replies)
Discussion started by: cmccabe
6 Replies

6. Shell Programming and Scripting

awk Match First Field and Replace Second Column

Hi Friends, I have looked around the forums and over online but couldn't figure out how to deal with this problem input.txt gene1,axis1/0/1,axis2/0/1 gene1,axis1/1/2,axis2/1/2 gene1,axis1/2/3,axis2/2/3 gene2,axis1/3/4,axis2/3/4 Match on first column and if first column is... (1 Reply)
Discussion started by: jacobs.smith
1 Replies

7. Shell Programming and Scripting

awk or sed? change field conditional on key match

Hi. I'd appreciate if I can get some direction in this issue to get me going. Datafile1: -About 4000 records, I have to update field#4 in selected records based on a match in the key field (Field#1). -Field #1 is the key field (servername) . # of Fields may vary # comment server1 bbb ccc... (2 Replies)
Discussion started by: RascalHoudi
2 Replies

8. Shell Programming and Scripting

AWK: Pattern match between 2 files, then compare a field in file1 as > or < field in file2

First, thanks for the help in previous posts... couldn't have gotten where I am now without it! So here is what I have, I use AWK to match $1 and $2 as 1 string in file1 to $1 and $2 as 1 string in file2. Now I'm wondering if I can extend this AWK command to incorporate the following: If $1... (4 Replies)
Discussion started by: right_coaster
4 Replies

9. UNIX for Dummies Questions & Answers

Awk counting lines with field match

Hi, Im trying to create a script that reads throught every line in a file and then counts how many lines there with a certain field that matches a input, and also ausing another awk it has to do the same as the above but to then use sort anduniq to get rid of all the unique lines with another... (8 Replies)
Discussion started by: fredted40x
8 Replies

10. Shell Programming and Scripting

how do i pattern match a field with awk?

hi, let's say $numbers = "324 350 587" an so on... what i'm trying to do is this: awk -v numbers="$numbers" '{if (numbers ~ /$2/) print $0, "bla bla"}' file # file looks like this: 214 ..... 215 ... 216 .... 250 ... 324 325 ... 350 something ... ... 587 ... (4 Replies)
Discussion started by: someone123
4 Replies
Login or Register to Ask a Question