awk to assign points to variables based on conditions and update specific field


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to assign points to variables based on conditions and update specific field
# 1  
Old 09-13-2017
awk to assign points to variables based on conditions and update specific field

I have been reading old posts and trying to come up with a solution for the below: Use a tab-delimited input file to assign
point to variables that are used to update a specific field, Rank. I really couldn't find too much in the way of assigning points
to variable, but made an attempt at an awk to try and accomplish this. I added comments to the code and a description of
what the desired result should be. I apologize for the lengthy post but tried to come up with a solution to my question.
I know that I have made many but I am really trying to learn more and use what I have learned. This question uses the information in the
input file to assign points values to variables, then sum them to update a field. Each condition (there are 6), has a negative or condition that
assign 0 points (or in the case of condition 1, variable points... starting from 5 going down). In my real data there are several hundreds or
thousands of rows but they all follow the same format as below. Thank you Smilie.


Description
Code:
Line 1:
condition1 = 0 as FuncrefGene in $7 is not exonic (it is .) and GeneDetailrefGene in $9 has a value > 1 or 2 (it is 84), if $7 was not exonic but $9 was <= 1 or 2 than rank would get 2 points
condition2 = 0 as ExonicFuncrefGene is .
condition3 = 0 as PopFreqMax in $14 is > 0.011
condition4 = 0 as ClinSig in $47 is not Pathogenic or Likely pathogenic (it is a .)
condition5 = 0 as Hgmd in $58 is n or .
condition6 = 1 as Zygosity in $54 is het
0+0+0+0+1 = Rank of 1 in $57

Line 2:
condition1 = 0 as FuncrefGene in $7 is exonic, so this line is skipped as it does not meet this condition and the next condition is processed
condition2 = 5 as ExonicFuncrefGene is stopgain
condition3 = 1 as PopFreqMax in $14 is <= 0.011
condition4 = 2 as ClinSig in $47 is Pathogenic or Likely pathogenic
condition5 = 2 as Hgmd in $58 is not n or .
condition6 = 1 as Zygosity in $54 is het
0+5+1+2+2+1 = Rank of 11 in $57

awk
Code:
awk '
BEGIN {    # Set input and output field separators:
    FS = OFS = "\t"
    # Create list of needed field headers:
        nfh["FuncrefGene"]
    nfh["GeneDetailrefGene"]
        nfh["ExonicFuncrefGene"]
    nfh["PopFreqMax"]
    nfh["CLINSIG"]
    nfh["Hgmd"]
    nfh["Zygosity"]
    nfh["Rank"]
}
NR == 1 {
    # Create array to tranlate needed field headers to field numbers
    for(i = 1; i <= NF; i++)
        if($i in nfh)
            f[$i] = i
    # Verify that all of the needed field headers were found
    for(i in nfh)
        if(!(i in f)) {
            missing++
            printf("Needed field missing: %s\n", i)
        }
    # If one or more needed fields were not found, give up
    if(missing)
        exit 1
}
        # Condition 1
NR > 1 {
    if(index$f["FuncrefGene"] == "."  || $f["FuncrefGene"] != "exonic" && ($9 ~ /:NM_/ && match($9,/c..+||-/)) {  # search for :NM_ and c..+||-
    # Get the substring after +/-
    VAL=substr($9,RSTART+1,RLENGTH-2);
                        for(i=1;i<=VAL;i++){  # start loop
            $f["GeneDetailrefGene"] <= 2)  # if GeneDetailRefgene <= 2
                var1=2                                # var1 gets 2 points
        }
        # Condition 1a
    else {
        if($f["FuncrefGene"] == "."  || $f["FuncrefGene"] != "exonic" && ($9 ~ /:NM_/ && match($9,/c..+||-/)) {  # search for :NM_ and c..+||-
    # Get the substring after +/-
    VAL=substr($9,RSTART+1,RLENGTH-2);
                        for(i=1;i<=VAL;i++){  # start loop
            $f["GeneDetailrefGene"] > 2)  # if GeneDetailRefgene > 2
                var1= 0                            # var1 gets 0 points
             }
        # Condition 1b
        else {
        if($f["FuncrefGene"] == "exonic")  # if FuncrefGene is exonic
                var1=0                                # var1 gets 0 points 
                        next                   # skip and process next condition
             }
        else {    #Condition 2
        if($f["ExonicFuncrefGene"] ~ /^stopgain$/ || $f["ExonicFuncrefGene"] ~ /^stoploss$/)  # if ExonicFuncrefGene has stopgain or stoploss in it
                var2=5                                                                                                             # var2 gets 5 points
             }
        else {    #Condition 2a
        if($f["ExonicFuncrefGene"] ~ /^frameshift$/)  # if ExonicFuncrefGene has frameshift in it (could be frameshift insertion/deletion/block substitution)
                var2=4                                                # var2 gets 4 points
             }
        else {    #Condition 2b
        if($f["ExonicFuncrefGene"] ~ /^nonframeshift$/)  # if ExonicFuncrefGene has nonframeshift in it (could be nonframeshift insertion/deletion/block substitution)
                var2=3                                                      # var2 gets 3 points
             }
        else {    #Condition 2c
        if($f["ExonicFuncrefGene"] ~ /^nonsynonymous$/)  # if ExonicFuncrefGene has nonsynonymous in it (could be nonsynonymous SNV/MNV/other)
                var2=2                                                         # var2 gets 2 points
             }
        else {    #Condition 2d
        if($f["ExonicFuncrefGene"] ~ /^synonymous$/)  # if ExonicFuncrefGene has synonymous in it (could be nonsynymous SNV/MNV/other)
                var2 =1                                                   # var2 gets 1 point
             }
        else {    #Condition 2e
        if($f["ExonicFuncrefGene"] == ".")  # if ExonicFuncrefGene is . (dot) 
                var2=0                                 # var2 gets 0 points
             }
    else {    #Condition 3
        if($f["$f["PopFreqMax"] <= 0.01)       # if PopFreqMax <= 0.01
                var3=1                                    # var3 gets 1 point
             }
    else {    #Test #3a#
        if($f["$f["PopFreqMax"] > 0.01)       # if PopFreqMax > 0.01
                var3=0                                  # var3 gets 0 points
             }
    else {    #Condition 4
        if($f["Clinsig"] ~ /^Pathogenic$/ || $f["Clinsig"] ~ /^Likely pathogenic$/)  # if ClinSig has Pathogenic or Likely pathogenic in it
                var4=2                                                                                         # var3 gets 2 points
             }
    else {    #Condition 4a
        if($f["Clinsig"] !~ /^Pathogenic$/ || $f["Clinsig"] !~ /^Likely pathogenic$/)  # if ClinSig does not have Pathogenic or Likely pathogenic in it
                var4=0                                                                                            # var4 gets 0 points
             }
    else {    #Condition 5
        if($f["Hgmd"] !~ /^n$/ || $f["Hgmd"] !~ /^.$/) # if Hgmd does not have n or . (dot) in it
                var5=2                                                   # var5 gets 2 points
    else {    #Condition 5a
        if($f["Hgmd"] ~ /^n$/ || $f["Hgmd"] !~ /^.$/)  # if Hgmd does have n or . (dot) in it
                var5=0                                                   # var5 gets 0 points
             }
    else {    #Condition 6
        if($f["Zygosity"] == "hom")  # if Zygosity is hom
                var6=2                       # var6 gets 2 points
             }
    else {    #Condition 6a
        if($f["Zygosity"] ~ /^het$/ || $f["Zygosity"] ~ /^.$/)  # if Zygosity has het or . (dot) in it
                var6=1                                                            # var6 gets 1 point
             }
     { # Update Rank
    printf($f["Rank"] = var1+var2+var3+var4+var5+var6)  # add all variables and store value in Rank
}1' input.txt > update.txt  # update and define input and output

input.txt tab-delimited
Code:
R_Index    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    Inheritence    ExonicFunc.refGene    AAChange.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    dpsi_max_tissue    dpsi_zscore    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Score    Classification    Rank    HGMD    Sanger
2    chr1    948870    948870    C    G    UTR5    ISG15    NM_005101:c.-84C>G    .    .    .    rs4615788    1    0.9    0.72    0.93    1.    0.96    0.98    .    .    .    .    .    .    .    .    0.89    0.74    0.96    0.79    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    GOOD    206    het    12    Likely Benign    .    .    NM_005101:c.-84C>G
1990    chr2    178879181    178879181    G    A    exonic    PDE11A    .    .    stopgain    PDE11A:NM_016953:exon2:c.919C>T:p.R307X,PDE11A:NM_001077197:exon3:c.169C>T:p.R57X    rs76308115    0.0051    0.0018    .    0.0014    .    0.005    0.0031    0.003    0.0009    0.0013    .    0.0043    0.0045    .    0.0011    0.0038    0.0014    0.0051    .    .    .    .    .    .    .    .    D    1    A    .    .    -50.4062    -3.532    Pathogenic    Pigmented_nodular_adrenocortical_disease,_primary,_2    RCV000005604.3    MedGen:OMIM    C1864851:610475    GOOD    86    het    8    VUS    .    .    NM_016953:exon2:c.919C>T:p.R307X

desired update.txt in $57 tab-delimited
Code:
R_Index    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    Inheritence    ExonicFunc.refGene    AAChange.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    dpsi_max_tissue    dpsi_zscore    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Score    Classification    Rank    HGMD    Sanger
2    chr1    948870    948870    C    G    UTR5    ISG15    NM_005101:c.-84C>G    .    .    .    rs4615788    1    0.9    0.72    0.93    1.    0.96    0.98    .    .    .    .    .    .    .    .    0.89    0.74    0.96    0.79    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    GOOD    206    het    12    Likely Benign    1    .    NM_005101:c.-84C>G
1990    chr2    178879181    178879181    G    A    exonic    PDE11A    .    .    stopgain    PDE11A:NM_016953:exon2:c.919C>T:p.R307X,PDE11A:NM_001077197:exon3:c.169C>T:p.R57X    rs76308115    0.0051    0.0018    .    0.0014    .    0.005    0.0031    0.003    0.0009    0.0013    .    0.0043    0.0045    .    0.0011    0.0038    0.0014    0.0051    .    .    .    .    .    .    .    .    D    1    A    .    .    -50.4062    -3.532    Pathogenic    Pigmented_nodular_adrenocortical_disease,_primary,_2    RCV000005604.3    MedGen:OMIM    C1864851:610475    GOOD    86    het    8    VUS    11    CM125    NM_016953:exon2:c.919C>T:p.R307X


Last edited by cmccabe; 09-22-2017 at 02:22 PM.. Reason: fixed format and added details
# 2  
Old 09-22-2017
Did you test your script so far ? have you performed some "unity testing" to validate one by one that your conditions work as you expect ?

In condition 1 , where does this (see in if(index$f["FuncrefGene"]... ) "index$f" come from ?

Why using the dollar sign in variable settings ? when assigning the value 3 to the variable "var2" in awk, shouldn't you use var2=3 rather than $var2=3 ??? same in testing : you are in awk, not in shell so use : if ( var2 == 3 ) { ... } rather than if ( $var2 == 3 ) { ... }

Last edited by ctsgnb; 09-22-2017 at 09:41 AM..
This User Gave Thanks to ctsgnb For This Post:
# 3  
Old 09-22-2017
if(index$f["FuncrefGene"]

This is $7 or Func.refGene in the tab-delimited input.txt. Thank you Smilie.

I will make the change in the variables, I didn't even catch that. Thank you Smilie.

Last edited by cmccabe; 09-22-2017 at 09:39 AM.. Reason: added details
# 4  
Old 09-22-2017
In condition 2c, what are you trying to do : an assignation ( = ) or a comparison ( == ) ?

Also review your whole code : you stilll have a bunch of dollars that shouldn't be there ( f[... rather than $f[... , f[i] rather than f[$i] ...).

Last edited by ctsgnb; 09-22-2017 at 10:43 AM..
This User Gave Thanks to ctsgnb For This Post:
# 5  
Old 09-22-2017
Sorry, that's a typo in 2c, should be an assignation... I will cleanup the code. Thank you Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Update a specific field in file with Variable value based on other Key Word

I have an input file with A=xyz B=pqr I would want the value in Second Field (xyz or pqr) updated with a value present in Shell Variable based on the value passed in the first field. (A or B ) while read line do NEW_VALUE = `some functionality done on $line` If $line=First Field-... (1 Reply)
Discussion started by: infernalhell
1 Replies

2. Shell Programming and Scripting

Perl to update field based on a specific set of rules

In the perl below, which does execute, I am having trouble with the else in Rule 3. The digit in f{8} is extracted and used to update f accordinly along with the value in f. There can be either - * or + before the number that is extracted but the same logic applies, that is if the value is greater... (5 Replies)
Discussion started by: cmccabe
5 Replies

3. Shell Programming and Scripting

awk to update file based on 5 conditions

I am trying to use awk to update the below tab-delimited file based on 5 different rules/conditions. The final output is also tab-delimited and each line in the file will meet one of the conditions. My attemp is below as well though I am not very confident in it. Thank you :). Condition 1: The... (10 Replies)
Discussion started by: cmccabe
10 Replies

4. Shell Programming and Scripting

awk to update value in field based on another field

In the tab-delimeted input file below I am trying to use awk to update the value in $2 if TYPE=ins in bold, by adding the value of HRUN= in italics. In the below since in line 1 TYPE=ins the 117282541 value in $2 has 6 added because that is the value of HRUN=. Hopefully the awk is a start but I... (2 Replies)
Discussion started by: cmccabe
2 Replies

5. Shell Programming and Scripting

awk to update field in file based of match in another

I am trying to use awk to match two files that are tab-delimited. When a match is found between file1 $1 and file2 $4, $4 in file2 is updated using the $2 value in file1. If no match is found then the next line is processed. Thank you :). file1 uc001bwr.3 ADC uc001bws.3 ADC... (4 Replies)
Discussion started by: cmccabe
4 Replies

6. Shell Programming and Scripting

awk to update field file based on match

If $1 in file1 matches $2 in file2. Then the value in $2 of file2 is updated to $1"."$2 of file2. The awk seems to only match the two files but not update. Thank you :). awk awk 'NR==FNR{A ; next} $1 in A { $2 = a }1' file1 file2 file1 name version NM_000593 5 NM_001257406... (3 Replies)
Discussion started by: cmccabe
3 Replies

7. Shell Programming and Scripting

Calculate the total 4 field based on the conditions

Please help me to write a script Match with ACNO & NAME if it matched calculate the total val1 val2 val3 and val4 and GT is total of ACNO wise.please check the output Table ----------------- 1005|ANDP|ACN|20|50|10|30 1005|ANDP|ACN|20|10|30|40 1001|AND|NAC|40|50|40|50... (22 Replies)
Discussion started by: kalia4u
22 Replies

8. Shell Programming and Scripting

awk search/replace specific field, using variables for regexp & subsitution then overwrite file

Hello, I'm trying the solve the following problem. I have a file which I intend to use as a csv called master.csv The columns are separated by commas. I want to change the text on a specific row in either column 3,4,5 or 6 from xxx to yyy depending upon if column 1 matches a specified pattern.... (3 Replies)
Discussion started by: cyphex
3 Replies

9. Shell Programming and Scripting

awk assign output of array to specific field-number

With this script i want to print the output to a specific field-number . Can anybody help? awk 'NR=FNR{split(FILENAME,fn,"_");nr=$2;f = $1} END{for (i=1;i<=f;i++) print i,$fn=nr}' input_5.csv input_6.csvinput_5.csv 4 135 5 185 6 85 11 30input_6.csv 1 90 3 58 4 135 7 60 8 55 10... (1 Reply)
Discussion started by: sdf
1 Replies

10. Shell Programming and Scripting

Extract file records based on some field conditions

Hello Friends, I have a file(InputFile.csv) with the following columns(the columns are pipe-delimited): ColA|ColB|ColC|ColD|ColE|ColF Now for this file, I have to get those records which fulfil the following condition: If "ColB" is NOT NULL and "ColD" has values one of the following... (9 Replies)
Discussion started by: mehimadri
9 Replies
Login or Register to Ask a Question