awk to assign points to variables based on conditions and update specific field
I have been reading old posts and trying to come up with a solution for the below: Use a tab-delimited input file to assign
point to variables that are used to update a specific field, Rank. I really couldn't find too much in the way of assigning points
to variable, but made an attempt at an awk to try and accomplish this. I added comments to the code and a description of
what the desired result should be. I apologize for the lengthy post but tried to come up with a solution to my question.
I know that I have made many but I am really trying to learn more and use what I have learned. This question uses the information in the
input file to assign points values to variables, then sum them to update a field. Each condition (there are 6), has a negative or condition that
assign 0 points (or in the case of condition 1, variable points... starting from 5 going down). In my real data there are several hundreds or
thousands of rows but they all follow the same format as below. Thank you .
Description
Code:
Line 1:
condition1 = 0 as FuncrefGene in $7 is not exonic (it is .) and GeneDetailrefGene in $9 has a value > 1 or 2 (it is 84), if $7 was not exonic but $9 was <= 1 or 2 than rank would get 2 points
condition2 = 0 as ExonicFuncrefGene is .
condition3 = 0 as PopFreqMax in $14 is > 0.011
condition4 = 0 as ClinSig in $47 is not Pathogenic or Likely pathogenic (it is a .)
condition5 = 0 as Hgmd in $58 is n or .
condition6 = 1 as Zygosity in $54 is het
0+0+0+0+1 = Rank of 1 in $57
Line 2:
condition1 = 0 as FuncrefGene in $7 is exonic, so this line is skipped as it does not meet this condition and the next condition is processed
condition2 = 5 as ExonicFuncrefGene is stopgain
condition3 = 1 as PopFreqMax in $14 is <= 0.011
condition4 = 2 as ClinSig in $47 is Pathogenic or Likely pathogenic
condition5 = 2 as Hgmd in $58 is not n or .
condition6 = 1 as Zygosity in $54 is het
0+5+1+2+2+1 = Rank of 11 in $57
awk
Code:
awk '
BEGIN { # Set input and output field separators:
FS = OFS = "\t"
# Create list of needed field headers:
nfh["FuncrefGene"]
nfh["GeneDetailrefGene"]
nfh["ExonicFuncrefGene"]
nfh["PopFreqMax"]
nfh["CLINSIG"]
nfh["Hgmd"]
nfh["Zygosity"]
nfh["Rank"]
}
NR == 1 {
# Create array to tranlate needed field headers to field numbers
for(i = 1; i <= NF; i++)
if($i in nfh)
f[$i] = i
# Verify that all of the needed field headers were found
for(i in nfh)
if(!(i in f)) {
missing++
printf("Needed field missing: %s\n", i)
}
# If one or more needed fields were not found, give up
if(missing)
exit 1
}
# Condition 1
NR > 1 {
if(index$f["FuncrefGene"] == "." || $f["FuncrefGene"] != "exonic" && ($9 ~ /:NM_/ && match($9,/c..+||-/)) { # search for :NM_ and c..+||-
# Get the substring after +/-
VAL=substr($9,RSTART+1,RLENGTH-2);
for(i=1;i<=VAL;i++){ # start loop
$f["GeneDetailrefGene"] <= 2) # if GeneDetailRefgene <= 2
var1=2 # var1 gets 2 points
}
# Condition 1a
else {
if($f["FuncrefGene"] == "." || $f["FuncrefGene"] != "exonic" && ($9 ~ /:NM_/ && match($9,/c..+||-/)) { # search for :NM_ and c..+||-
# Get the substring after +/-
VAL=substr($9,RSTART+1,RLENGTH-2);
for(i=1;i<=VAL;i++){ # start loop
$f["GeneDetailrefGene"] > 2) # if GeneDetailRefgene > 2
var1= 0 # var1 gets 0 points
}
# Condition 1b
else {
if($f["FuncrefGene"] == "exonic") # if FuncrefGene is exonic
var1=0 # var1 gets 0 points
next # skip and process next condition
}
else { #Condition 2
if($f["ExonicFuncrefGene"] ~ /^stopgain$/ || $f["ExonicFuncrefGene"] ~ /^stoploss$/) # if ExonicFuncrefGene has stopgain or stoploss in it
var2=5 # var2 gets 5 points
}
else { #Condition 2a
if($f["ExonicFuncrefGene"] ~ /^frameshift$/) # if ExonicFuncrefGene has frameshift in it (could be frameshift insertion/deletion/block substitution)
var2=4 # var2 gets 4 points
}
else { #Condition 2b
if($f["ExonicFuncrefGene"] ~ /^nonframeshift$/) # if ExonicFuncrefGene has nonframeshift in it (could be nonframeshift insertion/deletion/block substitution)
var2=3 # var2 gets 3 points
}
else { #Condition 2c
if($f["ExonicFuncrefGene"] ~ /^nonsynonymous$/) # if ExonicFuncrefGene has nonsynonymous in it (could be nonsynonymous SNV/MNV/other)
var2=2 # var2 gets 2 points
}
else { #Condition 2d
if($f["ExonicFuncrefGene"] ~ /^synonymous$/) # if ExonicFuncrefGene has synonymous in it (could be nonsynymous SNV/MNV/other)
var2 =1 # var2 gets 1 point
}
else { #Condition 2e
if($f["ExonicFuncrefGene"] == ".") # if ExonicFuncrefGene is . (dot)
var2=0 # var2 gets 0 points
}
else { #Condition 3
if($f["$f["PopFreqMax"] <= 0.01) # if PopFreqMax <= 0.01
var3=1 # var3 gets 1 point
}
else { #Test #3a#
if($f["$f["PopFreqMax"] > 0.01) # if PopFreqMax > 0.01
var3=0 # var3 gets 0 points
}
else { #Condition 4
if($f["Clinsig"] ~ /^Pathogenic$/ || $f["Clinsig"] ~ /^Likely pathogenic$/) # if ClinSig has Pathogenic or Likely pathogenic in it
var4=2 # var3 gets 2 points
}
else { #Condition 4a
if($f["Clinsig"] !~ /^Pathogenic$/ || $f["Clinsig"] !~ /^Likely pathogenic$/) # if ClinSig does not have Pathogenic or Likely pathogenic in it
var4=0 # var4 gets 0 points
}
else { #Condition 5
if($f["Hgmd"] !~ /^n$/ || $f["Hgmd"] !~ /^.$/) # if Hgmd does not have n or . (dot) in it
var5=2 # var5 gets 2 points
else { #Condition 5a
if($f["Hgmd"] ~ /^n$/ || $f["Hgmd"] !~ /^.$/) # if Hgmd does have n or . (dot) in it
var5=0 # var5 gets 0 points
}
else { #Condition 6
if($f["Zygosity"] == "hom") # if Zygosity is hom
var6=2 # var6 gets 2 points
}
else { #Condition 6a
if($f["Zygosity"] ~ /^het$/ || $f["Zygosity"] ~ /^.$/) # if Zygosity has het or . (dot) in it
var6=1 # var6 gets 1 point
}
{ # Update Rank
printf($f["Rank"] = var1+var2+var3+var4+var5+var6) # add all variables and store value in Rank
}1' input.txt > update.txt # update and define input and output
Did you test your script so far ? have you performed some "unity testing" to validate one by one that your conditions work as you expect ?
In condition 1 , where does this (see in if(index$f["FuncrefGene"]... ) "index$f" come from ?
Why using the dollar sign in variable settings ? when assigning the value 3 to the variable "var2" in awk, shouldn't you use var2=3 rather than $var2=3 ??? same in testing : you are in awk, not in shell so use : if ( var2 == 3 ) { ... } rather than if ( $var2 == 3 ) { ... }
I have an input file with
A=xyz
B=pqr
I would want the value in Second Field (xyz or pqr) updated with a value present in Shell Variable based on the value passed in the first field. (A or B )
while read line
do
NEW_VALUE = `some functionality done on $line`
If $line=First Field-... (1 Reply)
In the perl below, which does execute, I am having trouble with the else in Rule 3. The digit in f{8} is extracted and used to update f accordinly along with the value in f.
There can be either - * or + before the number that is extracted but the same logic applies, that is if the value is greater... (5 Replies)
I am trying to use awk to update the below tab-delimited file based on 5 different rules/conditions. The final output is also
tab-delimited and each line in the file will meet one of the conditions. My attemp is below as well though I am not very confident in it. Thank you :).
Condition 1: The... (10 Replies)
In the tab-delimeted input file below I am trying to use awk to update the value in $2 if TYPE=ins in bold, by adding the value of
HRUN= in italics. In the below since in line 1 TYPE=ins the 117282541 value in $2 has 6 added because that is the value of HRUN=.
Hopefully the awk is a start but I... (2 Replies)
I am trying to use awk to match two files that are tab-delimited. When a match is found between file1 $1 and file2 $4, $4 in file2 is updated using the $2 value in file1. If no match is found then the next line is processed. Thank you :).
file1
uc001bwr.3 ADC
uc001bws.3 ADC... (4 Replies)
If $1 in file1 matches $2 in file2. Then the value in $2 of file2 is updated to $1"."$2 of file2. The awk seems to only match the two files but not update. Thank you :).
awk
awk 'NR==FNR{A ; next} $1 in A { $2 = a }1' file1 file2
file1
name version
NM_000593 5
NM_001257406... (3 Replies)
Please help me to write a script
Match with ACNO & NAME if it matched calculate the total val1 val2 val3 and val4 and GT is total of ACNO wise.please check the output
Table
-----------------
1005|ANDP|ACN|20|50|10|30
1005|ANDP|ACN|20|10|30|40
1001|AND|NAC|40|50|40|50... (22 Replies)
Hello,
I'm trying the solve the following problem.
I have a file which I intend to use as a csv called master.csv
The columns are separated by commas.
I want to change the text on a specific row in either column 3,4,5 or 6 from xxx to yyy depending upon if column 1 matches a specified pattern.... (3 Replies)
Hello Friends,
I have a file(InputFile.csv) with the following columns(the columns are pipe-delimited):
ColA|ColB|ColC|ColD|ColE|ColF
Now for this file, I have to get those records which fulfil the following condition:
If "ColB" is NOT NULL and "ColD" has values one of the following... (9 Replies)