In the awk below I use $2 of filet to search filea for a match. If the values in $2 are exact match this works great. However, that is not always the case, so I need to perform the search using a range of + or - 2. That is if the value in filea $2 is within + or - 2 of filet $2 then it is matched. Thank you .
awk 'FNR==1 { next }
> FNR == NR { file1[$2,$3] = $2 " " $3} # filet search
> FNR != NR { file2[$2,$3] = $2 " " $3 } # in filea
> END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
> print "Missing in annovar but found in tvc:"; for (k in file2) if (!(k in file1)) print file2[k]
> print "Missing in tvc but found in annovar:"; for (k in file1) if (!(k in file2)) print file1[k]
> }' filea filet > match
current output
Code:
Match:
6637518 TPP1
6638506 TPP1
Missing in filea but found in filet:
1776024 CTSD
Missing in filet but found in filea:
1776025 CTSD
desired output
Code:
Match:
6637518 TPP1
6638506 TPP1
1776024 CTSD --> within 2 margin
Missing in filea but found in filet:
Missing in tvc but found in annovar:
awk trued to get the desired output
Code:
awk 'FNR==1 { next }
> FNR == NR { file1[$2,$3] = $2 " " $3 } # filea search
> FNR != NR { file2[$2,$3] = $2 " " $3 } # in filet
> END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
> print "Missing in filea but found in filet:"; for (k in file2) if (!(k in file1)) print file2[k]
> print "Missing in filet but found in filea:"; for (k in file1) if (!(k in file2)) print file1[k]
> if((k-$2)^2<=2^2) {print $0, " --> within 2 margin"; next}
> }' filea filet > match
awk: cmd. line:7: error: `next' used in END action
Last edited by cmccabe; 09-01-2016 at 02:40 PM..
Reason: fixed format
awk 'FNR==NR && NR>1{A[$2]=$2 OFS $3;next} FNR!=NR && NR>1{for(i in A){if($2==i){W=W?W ORS $0:$0;delete A[$2];next};if((($2-i)<=2 && ($2-i)>0)||((i-$2)<=2 && (i-$2)>0)){Q=Q?Q ORS $0:$0;split(A[i], B," ");delete A[B[1]];$0=""}}} !($2 in A) && NF && (FNR!=NR && FNR>1){P=P?P ORS $0:$0} END{for(j in A){E=E?E ORS A[j]:A[j]};print "Common in both files:" ORS W ORS "having +/- 2 values are:" ORS Q ORS "Present in filet and not in filea are:" ORS P ORS "Present in filea and not in filet are:" ORS E}' filea filet
Output will be as follows.
Code:
Common in both files:
chr11 6637518 TPP1
chr11 6638506 TPP1
having +/- 2 values are:
chr11 1776024 CTSD
Present in filet and not in filea are:
chr12 1241414 WRTW
Present in filea and not in filet are:
If you have any changes into your requirements then please mention it more samples of Input_file and expected output with complete details of what you want to put as rules please. EDIT: Adding a non-one liner form of solution on same.
Code:
awk 'FNR==NR && NR>1{
A[$2]=$2 OFS $3;
next
}
FNR!=NR && NR>1{
for(i in A){
if($2==i){
W=W?W ORS $0:$0;
delete A[$2];next
};
if((($2-i)<=2 && ($2-i)>0)||((i-$2)<=2 && (i-$2)>0)){
Q=Q?Q ORS $0:$0;
split(A[i], B," ");
delete A[B[1]];
$0=""
}
}
}
!($2 in A) && NF && (FNR!=NR && FNR>1){
P=P?P ORS $0:$0
}
END{
for(j in A){
E=E?E ORS A[j]:A[j]
};
print "Common in both files:" ORS W ORS "having +/- 2 values are:" ORS Q ORS "Present in filet and not in filea are:" ORS P ORS "Present in filea and not in filet are:" ORS E
}
' filea filet
Thanks,
R. Singh
Last edited by RavinderSingh13; 09-02-2016 at 05:49 AM..
Reason: Adding a non-one liner form of solution on same too succcessfully now. change OFS to ORS in E variable to get lines as \n.
This User Gave Thanks to RavinderSingh13 For This Post:
awk: cmd. line:7: error: `next' used in END action
next goes to the next input cycle, can only be used in the main loop.
In the END section use exit!
(Note: an exit in the main loop goes to the END section.)
Last edited by MadeInGermany; 09-01-2016 at 05:36 PM..
These 2 Users Gave Thanks to MadeInGermany For This Post:
I think the next in the loop in the END section is intended to start the next cycle in the for loop. You do that in loops in actions in the main section and in BEGIN and END sections with continue; not next and not exit.
This User Gave Thanks to Don Cragun For This Post:
If I am reading Ravinder's code correctly, I believe it is only looking for a match (or near match) on field 2 values without regard to field 3 values. If I am reading cmccabe's code correctly, it is matching on field 3 and then looking for entries with matching (or near matching) field 2 values among entries that have the same field 3 values. With the limited sample data provided, both forms of matching lead to the same results. (Ravinder's code also sometimes outputs field 1 values in the results which was not included in the desired output in post #1.)
If the intent is to only look for matching and near matching field 2 values for input lines with the same field 3 value, you might want to consider this alternative solution:
Code:
#!/bin/ksh
filea=${1:-filea}
filet=${2:-filet}
margin=${3:-2} # Default margin.
awk -v margin="$margin" '
FNR == 1 {
if(NR == 1) {
fn1 = FILENAME
range = margin ^ 2
print "Match:"
next
}
fn2 = FILENAME
next
}
FNR == NR {
# Gather data from file1.
file1[$2 OFS $3] = $3
next
}
{ # We are reading the 2nd input file...
# Look for exact.
k = $2 OFS $3
if(k in file1) {
# Exact match found.
print k
delete file1[k]
next
}
# If there was no match, gather data from the second input file.
file2[k] = $3
}
END { # Look for near matches...
for(key1 in file1)
for(key2 in file2)
if(file1[key1] == file2[key2] &&
(key1 - key2)^2 <= range) {
# Near match found.
print key2, "--> within", margin, "margin"
delete file1[key1]
delete file2[key2]
break
}
# Look for unmatched entries...
print "Missing in " fn1 " but found in " fn2 ":"
for(k in file2)
print k
print "Missing in " fn2 " but found in " fn1 ":"
for(k in file1)
print k
}' "$filea" "$filet"
Note that in addition to accepting alternative input file pathnames as script operands, the margin can also be specified as a third operand in case you want to experiment with values other than 2.
With the sample input files provided in post #1 in this thread, the above script produces the output:
Code:
Match:
6637518 TPP1
6638506 TPP1
1776024 CTSD --> within 2 margin
Missing in filea but found in filet:
Missing in filet but found in files:
when invoked without operands. On my system, the above script is named tester. With two other input files: filea_extended:
Code:
Index Start Gene
115 1776025 CTSD
116 6637518 TPP1
117 6638506 TPP1
118 1776025 EXTRA
and filet_extended:
Code:
Chrom Position Gene Sym
chr11 1776024 CTSD
chr11 6637518 TPP1
chr11 6638506 TPP1
chr11 1776030 EXTRA
The command:
Code:
./tester filea_extended filet_extended
produces the output:
Code:
Match:
6637518 TPP1
6638506 TPP1
1776024 CTSD --> within 2 margin
Missing in filea_extended but found in filet_extended:
1776030 EXTRA
Missing in filet_extended but found in filea_extended:
1776025 EXTRA
and the command:
Code:
./tester filea_extended filet_extended 5
produces the output:
Code:
Match:
6637518 TPP1
6638506 TPP1
1776024 CTSD --> within 5 margin
1776030 EXTRA --> within 5 margin
Missing in filea_extended but found in filet_extended:
Missing in filet_extended but found in filea_extended:
These 2 Users Gave Thanks to Don Cragun For This Post:
Yes @Don Cragun you are correct in that $2 isn't always unique so I used a combination of $3 and $2 to perform the lookup. Thank you all for your help .
In the awk below I am trying to match the value in $4 of file1 with the split value from $4 in file2. I store the value of $4 in file1 in A and the split value (using the _ for the split) in array. I then strore the value in $2 as min, the value in $3 as max, and the value in $1 as chr.
If A is... (6 Replies)
I am trying to use awk to match two files that are tab-delimited. When a match is found between file1 $1 and file2 $4, $4 in file2 is updated using the $2 value in file1. If no match is found then the next line is processed. Thank you :).
file1
uc001bwr.3 ADC
uc001bws.3 ADC... (4 Replies)
I am trying to look for $2 of file1 (skipping the header) in $2 of file2 (skipping the header) and if they match and the value in $10 is > 30 and $11 is > 49, then print the line from file1 to a output file. If no match is foung the line is not printed. Both the input and output are tab-delimited.... (3 Replies)
If $1 in file1 matches $2 in file2. Then the value in $2 of file2 is updated to $1"."$2 of file2. The awk seems to only match the two files but not update. Thank you :).
awk
awk 'NR==FNR{A ; next} $1 in A { $2 = a }1' file1 file2
file1
name version
NM_000593 5
NM_001257406... (3 Replies)
In file1 field $18 is removed.... column header is "Otherinfo", then each line in file1 is used to search file2 for a match. When a match is found the last four strings in file2 are copied to file1.
Maybe:
cut -f1-17 file1 and then match each line to file2
file1
Chr Start End ... (6 Replies)
Hi Friends,
I have looked around the forums and over online but couldn't figure out how to deal with this problem
input.txt
gene1,axis1/0/1,axis2/0/1
gene1,axis1/1/2,axis2/1/2
gene1,axis1/2/3,axis2/2/3
gene2,axis1/3/4,axis2/3/4
Match on first column and if first column is... (1 Reply)
Hi. I'd appreciate if I can get some direction in this issue to get me going.
Datafile1:
-About 4000 records, I have to update field#4 in selected records based on a match in the key field (Field#1).
-Field #1 is the key field (servername) . # of Fields may vary
# comment
server1 bbb ccc... (2 Replies)
First, thanks for the help in previous posts... couldn't have gotten where I am now without it!
So here is what I have, I use AWK to match $1 and $2 as 1 string in file1 to $1 and $2 as 1 string in file2. Now I'm wondering if I can extend this AWK command to incorporate the following:
If $1... (4 Replies)
Hi,
Im trying to create a script that reads throught every line in a file and then counts how many lines there with a certain field that matches a input, and also ausing another awk it has to do the same as the above but to then use sort anduniq to get rid of all the unique lines with another... (8 Replies)