awk to update file based on 5 conditions

01-23-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

As I said before, there is no such thing as a field not having a value. If there are no characters in a field, the value of that field is an empty string. And, yes, test 2 explicitly fails if the CLINSIG field is an empty string. And since an empty string does not contain | and an empty string is not the same as the string VUS, tests 1, 3, 4, and 5 also fail leaving test 6 to set the Classification field to VUS.

Now that you have a clearer specification for your script, can you write the awk script to perform those tests and produce your desired output? Try to write it and let us see what you come up with. If you get stuck, we'll try to help you fix your problems.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-23-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I will attempt to write an awk that can be used with these requirements. It may be similar to post one in structure, but that is only because I am not yet proficient enough, only the final classified file is needed. Thank you

.

---------- Post updated at 12:53 PM ---------- Previous update was at 05:07 AM ----------

Below is my awk for the first 3 conditions as well as the sixth. I know that it needs work but, hopefully I got the basic concepts. Thank you

.

awk

Code:

awk -F'\t' -v OFS='\t' 'NR>1{ if ($(NF-3 == "|"));$(NF-1)="Conflicted"} 1' file > conflicted # condition 1
awk -F'\t' -v OFS='\t' 'NR>1{ if ($(NF-3 ~ /^(.|VUS)$/ );$(NF-1)=($(NF-3))} 1' conflicted > clinsig # condition 2
awk -F'\t' -v OFS='\t' 'NR>1{ if ($(NF-3 == "VUS"));$(NF-7)="UTR";$(NF-1)="Likely Benign"} 1' clinsig > utr # condition 3
awk -F'\t' -v OFS='\t' 'NR>1{$(NF-1)="VUS"} 1' #condition 6

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

01-24-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You still seem to be having trouble with awk syntax...

Code:

if ($(NF-3 == "|"))

does not look for a pipe symbol somewhere in the 3rd from the last field, it looks for a non-empty, non-zero value in field 0 (which with your sample input is ALWAYS true). This is because the number NF-3 will always be a number and will never be the character |; so the value in parentheses will always be false (i.e. 0) and since $0 is not an empty string and is not a string that just contains (or evaluates to) the value 0, the expression evaluates to true. I used:

Code:

if(index($f["CLINSIG"], "|"))

for this test, but you could also just use:

Code:

if($f["CLINSIG"] ~ /|/)

All of your other if statement expressions have similar problems.

Something more like:

Code:

awk '
BEGIN {	# Set input and output field separators:
	FS = OFS = "\t"
	# Create list of needed field headers:
	nfh["Classification"]
	nfh["CLINSIG"]
	nfh["PopFreqMax"]
	nfh["Func.IDP.refGene"]
}
NR == 1 {
	# Create array to tranlate needed field headers to field numbers:
	for(i = 1; i <= NF; i++)
		if($i in nfh)
			f[$i] = i
	# Verify that all of the needed field headers were found:
	for(i in nfh)
		if(!(i in f)) {
			missing++
			printf("Needed field missing: %s\n", i)
		}
	# If one or more needed fields were not found, give up:
	if(missing)
		exit 1
}
NR > 1 {# Test #1:
	#for(i in nfh) printf("NR=%d: f[\"%s\",%d]=\"%s\"\n",
	#    NR, i, f[i], $f[i])
	if(index($f["CLINSIG"], "|"))
		$f["Classification"] = "Conflicting"
	else {	#Test #2:
		if($f["CLINSIG"] != "" && $f["CLINSIG"] != "." &&
		    $f["CLINSIG"] != "VUS")
		    	$f["Classification"] = $f["CLINSIG"]
		else	# Tests 3, 4, & 5:
			if($f["CLINSIG"] == "VUS" && ( \
			    ($f["Func.IDP.refGene"] == "UTR") || \
			    ($f["PopFreqMax"] > .01) || \
			    ($f["Func.IDP.refGene"] ~ /^spl?icing$/) \
			))	$f["Classification"] = "Likely Benign"
			else	# Test #6:
				$f["Classification"] = "VUS"
	}
	#printf(" out: f[\"%s\",%d]=\"%s\"\n", "Classification",
	#    f["Classification"], $f["Classification"])
}
1' file > final

with your sample input file (with each sequence of four <space> characters changed to a <tab> character) produces the output:

Code:

R_Index	Chr	Start	End	Ref	Alt	Func.IDP.refGene	GeneDetail.IDP.refGene	AAChange.IDP.refGene	PopFreqMax	CLINSIG	CLNDBN	Classification	Quality
1	chr1	40562993	40562993	T	C	UTR5	NM_000310.3:c.-83A>G	.	0.9	.	.	VUS	15
2	chr5	125887685	125887685	C	T	splicing	NM_001201377.1:exon14:c.1233+28G>A	.	0.82	.	.	VUS	10
3	chr16	2105400	2105400	C	T	splicing	NM_000548.4:exon6:c.482-3C>T	.	0.21	not provided|not provided|not provided|not provided|other|Benign	TSC	Conflicting	25
4	chr16	2110805	2110805	G	A	exonic	.	TSC2:NM_000548.4:exon11:c.1110G>A:p.Q370Q	.004	Pathogenic	TSC	Pathogenic	40

or, with the debugging statements uncommented:

Code:

R_Index	Chr	Start	End	Ref	Alt	Func.IDP.refGene	GeneDetail.IDP.refGene	AAChange.IDP.refGene	PopFreqMax	CLINSIG	CLNDBN	Classification	Quality
NR=2: f["Classification",13]="."
NR=2: f["PopFreqMax",10]="0.9"
NR=2: f["Func.IDP.refGene",7]="UTR5"
NR=2: f["CLINSIG",11]="."
 out: f["Classification",13]="VUS"
1	chr1	40562993	40562993	T	C	UTR5	NM_000310.3:c.-83A>G	.	0.9	.	.	VUS	15
NR=3: f["Classification",13]="."
NR=3: f["PopFreqMax",10]="0.82"
NR=3: f["Func.IDP.refGene",7]="splicing"
NR=3: f["CLINSIG",11]="."
 out: f["Classification",13]="VUS"
2	chr5	125887685	125887685	C	T	splicing	NM_001201377.1:exon14:c.1233+28G>A	.	0.82	.	.	VUS	10
NR=4: f["Classification",13]="."
NR=4: f["PopFreqMax",10]="0.21"
NR=4: f["Func.IDP.refGene",7]="splicing"
NR=4: f["CLINSIG",11]="not provided|not provided|not provided|not provided|other|Benign"
 out: f["Classification",13]="Conflicting"
3	chr16	2105400	2105400	C	T	splicing	NM_000548.4:exon6:c.482-3C>T	.	0.21	not provided|not provided|not provided|not provided|other|Benign	TSC	Conflicting	25
NR=5: f["Classification",13]="."
NR=5: f["PopFreqMax",10]=".004"
NR=5: f["Func.IDP.refGene",7]="exonic"
NR=5: f["CLINSIG",11]="Pathogenic"
 out: f["Classification",13]="Pathogenic"
4	chr16	2110805	2110805	G	A	exonic	.	TSC2:NM_000548.4:exon11:c.1110G>A:p.Q370Q	.004	Pathogenic	TSC	Pathogenic	40

Note that I had a typo in Test 1 in post #6:

If the value of the CLINSIG field contains a | character, set the Classification to Conflicted.

should have been:

If the value of the CLINSIG field contains a | character, set the Classification to Conflicting.

The code above implements this corrected Test 1.

Note that the output for input lines 2 and 3 has the Classification field set to VUS because none of Tests 1 through 5 are met by the data in those lines. Your desired output requested that that field be set to Likely Benign for both of those lines, but I don't see how either of those lines meet your Conditions 2 through 5 (my Tests 3 through 5) which are the tests that would set the Classification field to Likely Benign.

Note also that my code processes the header line to locate the names of the fields that are used in the tests and works using field names instead of trying to decipher what is supposed to be in $(NF - whatever).

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-26-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you very much for your help

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

awk to update file based on 5 conditions

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update file based on match in 3 fields

Discussion started by: cmccabe

2. Shell Programming and Scripting

awk to assign points to variables based on conditions and update specific field

Discussion started by: cmccabe

3. Shell Programming and Scripting

awk to update value based on pattern match in another file

Discussion started by: cmccabe

4. Shell Programming and Scripting

awk to filter file based on seperate conditions

Discussion started by: cmccabe

5. Shell Programming and Scripting

awk to update field in file based of match in another

Discussion started by: cmccabe

6. Shell Programming and Scripting

awk to update field file based on match

Discussion started by: cmccabe

7. Shell Programming and Scripting

Split File based on different conditions

Discussion started by: protech

8. Shell Programming and Scripting

awk merging files based on 2 complex conditions

Discussion started by: ruby_sgp

9. Shell Programming and Scripting

using awk to count no of records based on conditions

Discussion started by: aemunathan

10. Shell Programming and Scripting

validating a file based on conditions

Discussion started by: trichyselva