awk to update file based on 5 conditions


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to update file based on 5 conditions
# 8  
Old 01-23-2017
As I said before, there is no such thing as a field not having a value. If there are no characters in a field, the value of that field is an empty string. And, yes, test 2 explicitly fails if the CLINSIG field is an empty string. And since an empty string does not contain | and an empty string is not the same as the string VUS, tests 1, 3, 4, and 5 also fail leaving test 6 to set the Classification field to VUS.

Now that you have a clearer specification for your script, can you write the awk script to perform those tests and produce your desired output? Try to write it and let us see what you come up with. If you get stuck, we'll try to help you fix your problems.
This User Gave Thanks to Don Cragun For This Post:
# 9  
Old 01-23-2017
I will attempt to write an awk that can be used with these requirements. It may be similar to post one in structure, but that is only because I am not yet proficient enough, only the final classified file is needed. Thank you Smilie.

---------- Post updated at 12:53 PM ---------- Previous update was at 05:07 AM ----------

Below is my awk for the first 3 conditions as well as the sixth. I know that it needs work but, hopefully I got the basic concepts. Thank you Smilie.

awk
Code:
awk -F'\t' -v OFS='\t' 'NR>1{ if ($(NF-3 == "|"));$(NF-1)="Conflicted"} 1' file > conflicted # condition 1
awk -F'\t' -v OFS='\t' 'NR>1{ if ($(NF-3 ~ /^(.|VUS)$/ );$(NF-1)=($(NF-3))} 1' conflicted > clinsig # condition 2
awk -F'\t' -v OFS='\t' 'NR>1{ if ($(NF-3 == "VUS"));$(NF-7)="UTR";$(NF-1)="Likely Benign"} 1' clinsig > utr # condition 3
awk -F'\t' -v OFS='\t' 'NR>1{$(NF-1)="VUS"} 1' #condition 6

# 10  
Old 01-24-2017
You still seem to be having trouble with awk syntax...
Code:
if ($(NF-3 == "|"))

does not look for a pipe symbol somewhere in the 3rd from the last field, it looks for a non-empty, non-zero value in field 0 (which with your sample input is ALWAYS true). This is because the number NF-3 will always be a number and will never be the character |; so the value in parentheses will always be false (i.e. 0) and since $0 is not an empty string and is not a string that just contains (or evaluates to) the value 0, the expression evaluates to true. I used:
Code:
if(index($f["CLINSIG"], "|"))

for this test, but you could also just use:
Code:
if($f["CLINSIG"] ~ /|/)

All of your other if statement expressions have similar problems.

Something more like:
Code:
awk '
BEGIN {	# Set input and output field separators:
	FS = OFS = "\t"
	# Create list of needed field headers:
	nfh["Classification"]
	nfh["CLINSIG"]
	nfh["PopFreqMax"]
	nfh["Func.IDP.refGene"]
}
NR == 1 {
	# Create array to tranlate needed field headers to field numbers:
	for(i = 1; i <= NF; i++)
		if($i in nfh)
			f[$i] = i
	# Verify that all of the needed field headers were found:
	for(i in nfh)
		if(!(i in f)) {
			missing++
			printf("Needed field missing: %s\n", i)
		}
	# If one or more needed fields were not found, give up:
	if(missing)
		exit 1
}
NR > 1 {# Test #1:
	#for(i in nfh) printf("NR=%d: f[\"%s\",%d]=\"%s\"\n",
	#    NR, i, f[i], $f[i])
	if(index($f["CLINSIG"], "|"))
		$f["Classification"] = "Conflicting"
	else {	#Test #2:
		if($f["CLINSIG"] != "" && $f["CLINSIG"] != "." &&
		    $f["CLINSIG"] != "VUS")
		    	$f["Classification"] = $f["CLINSIG"]
		else	# Tests 3, 4, & 5:
			if($f["CLINSIG"] == "VUS" && ( \
			    ($f["Func.IDP.refGene"] == "UTR") || \
			    ($f["PopFreqMax"] > .01) || \
			    ($f["Func.IDP.refGene"] ~ /^spl?icing$/) \
			))	$f["Classification"] = "Likely Benign"
			else	# Test #6:
				$f["Classification"] = "VUS"
	}
	#printf(" out: f[\"%s\",%d]=\"%s\"\n", "Classification",
	#    f["Classification"], $f["Classification"])
}
1' file > final

with your sample input file (with each sequence of four <space> characters changed to a <tab> character) produces the output:
Code:
R_Index	Chr	Start	End	Ref	Alt	Func.IDP.refGene	GeneDetail.IDP.refGene	AAChange.IDP.refGene	PopFreqMax	CLINSIG	CLNDBN	Classification	Quality
1	chr1	40562993	40562993	T	C	UTR5	NM_000310.3:c.-83A>G	.	0.9	.	.	VUS	15
2	chr5	125887685	125887685	C	T	splicing	NM_001201377.1:exon14:c.1233+28G>A	.	0.82	.	.	VUS	10
3	chr16	2105400	2105400	C	T	splicing	NM_000548.4:exon6:c.482-3C>T	.	0.21	not provided|not provided|not provided|not provided|other|Benign	TSC	Conflicting	25
4	chr16	2110805	2110805	G	A	exonic	.	TSC2:NM_000548.4:exon11:c.1110G>A:p.Q370Q	.004	Pathogenic	TSC	Pathogenic	40

or, with the debugging statements uncommented:
Code:
R_Index	Chr	Start	End	Ref	Alt	Func.IDP.refGene	GeneDetail.IDP.refGene	AAChange.IDP.refGene	PopFreqMax	CLINSIG	CLNDBN	Classification	Quality
NR=2: f["Classification",13]="."
NR=2: f["PopFreqMax",10]="0.9"
NR=2: f["Func.IDP.refGene",7]="UTR5"
NR=2: f["CLINSIG",11]="."
 out: f["Classification",13]="VUS"
1	chr1	40562993	40562993	T	C	UTR5	NM_000310.3:c.-83A>G	.	0.9	.	.	VUS	15
NR=3: f["Classification",13]="."
NR=3: f["PopFreqMax",10]="0.82"
NR=3: f["Func.IDP.refGene",7]="splicing"
NR=3: f["CLINSIG",11]="."
 out: f["Classification",13]="VUS"
2	chr5	125887685	125887685	C	T	splicing	NM_001201377.1:exon14:c.1233+28G>A	.	0.82	.	.	VUS	10
NR=4: f["Classification",13]="."
NR=4: f["PopFreqMax",10]="0.21"
NR=4: f["Func.IDP.refGene",7]="splicing"
NR=4: f["CLINSIG",11]="not provided|not provided|not provided|not provided|other|Benign"
 out: f["Classification",13]="Conflicting"
3	chr16	2105400	2105400	C	T	splicing	NM_000548.4:exon6:c.482-3C>T	.	0.21	not provided|not provided|not provided|not provided|other|Benign	TSC	Conflicting	25
NR=5: f["Classification",13]="."
NR=5: f["PopFreqMax",10]=".004"
NR=5: f["Func.IDP.refGene",7]="exonic"
NR=5: f["CLINSIG",11]="Pathogenic"
 out: f["Classification",13]="Pathogenic"
4	chr16	2110805	2110805	G	A	exonic	.	TSC2:NM_000548.4:exon11:c.1110G>A:p.Q370Q	.004	Pathogenic	TSC	Pathogenic	40

Note that I had a typo in Test 1 in post #6:
  1. If the value of the CLINSIG field contains a | character, set the Classification to Conflicted.
should have been:
  1. If the value of the CLINSIG field contains a | character, set the Classification to Conflicting.
The code above implements this corrected Test 1.

Note that the output for input lines 2 and 3 has the Classification field set to VUS because none of Tests 1 through 5 are met by the data in those lines. Your desired output requested that that field be set to Likely Benign for both of those lines, but I don't see how either of those lines meet your Conditions 2 through 5 (my Tests 3 through 5) which are the tests that would set the Classification field to Likely Benign.

Note also that my code processes the header line to locate the names of the fields that are used in the tests and works using field names instead of trying to decipher what is supposed to be in $(NF - whatever).
This User Gave Thanks to Don Cragun For This Post:
# 11  
Old 01-26-2017
Thank you very much for your help Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update file based on match in 3 fields

Trying to use awk to store the value of $5 in file1 in array x. That array x is then used to search $4 of file1 to find aa match (I use x to skip the header in file1). Since $4 can have multiple strings in it seperated by a , (comma), I split them and iterate througn each split looking for a match.... (2 Replies)
Discussion started by: cmccabe
2 Replies

2. Shell Programming and Scripting

awk to assign points to variables based on conditions and update specific field

I have been reading old posts and trying to come up with a solution for the below: Use a tab-delimited input file to assign point to variables that are used to update a specific field, Rank. I really couldn't find too much in the way of assigning points to variable, but made an attempt at an awk... (4 Replies)
Discussion started by: cmccabe
4 Replies

3. Shell Programming and Scripting

awk to update value based on pattern match in another file

In the awk, thanks you @RavinderSingh13, for the help in below, hopefully it is close as I am trying to update the value in $12 of the tab-delimeted file2 with the matching value in $1 of the space delimeted file1. I have added comments for each line as well. Thank you :). awk awk '$12 ==... (10 Replies)
Discussion started by: cmccabe
10 Replies

4. Shell Programming and Scripting

awk to filter file based on seperate conditions

The below awk will filter a list of 30,000 lines in the tab-delimited file. What I am having trouble with is adding a condition to SVTYPE=CNV that will only print that line if CI= must be >.05 . The other condition to add is if SVTYPE=Fusion, then in order to print that line READ_COUNT must... (3 Replies)
Discussion started by: cmccabe
3 Replies

5. Shell Programming and Scripting

awk to update field in file based of match in another

I am trying to use awk to match two files that are tab-delimited. When a match is found between file1 $1 and file2 $4, $4 in file2 is updated using the $2 value in file1. If no match is found then the next line is processed. Thank you :). file1 uc001bwr.3 ADC uc001bws.3 ADC... (4 Replies)
Discussion started by: cmccabe
4 Replies

6. Shell Programming and Scripting

awk to update field file based on match

If $1 in file1 matches $2 in file2. Then the value in $2 of file2 is updated to $1"."$2 of file2. The awk seems to only match the two files but not update. Thank you :). awk awk 'NR==FNR{A ; next} $1 in A { $2 = a }1' file1 file2 file1 name version NM_000593 5 NM_001257406... (3 Replies)
Discussion started by: cmccabe
3 Replies

7. Shell Programming and Scripting

Split File based on different conditions

I need to split the file Conditions: Ignore any record that either starts with 1 or 9 Split the file at position 404 , if position 404 is abc or def then write all the records in a file > File 1 , the remaining records should go in to a file > File 2 Further I want to split the... (7 Replies)
Discussion started by: protech
7 Replies

8. Shell Programming and Scripting

awk merging files based on 2 complex conditions

1. if the 1st row IDs of input1 (ID1/ID2.....) is equal to any IDNames of input2 print all relevant values together as defined in the output. 2. A bit tricky part is IDno in the output. All we need to do is numbering same kind of letters as 1 (aa of ID1) and different letters as 2 (ab... (4 Replies)
Discussion started by: ruby_sgp
4 Replies

9. Shell Programming and Scripting

using awk to count no of records based on conditions

Hi I am having files with date and time stamp as the folder names like 200906051400,200906051500,200906051600 .....hence everyday 24 files will be generated i need to do certain things on this 24 files daily file contains the data like 200906050016370 0 1244141195225298lessrv3 ... (13 Replies)
Discussion started by: aemunathan
13 Replies

10. Shell Programming and Scripting

validating a file based on conditions

i have a file in unix in which the records are like this aaa 123 233 aaa 234 222 aaa 242 222 bbb 122 111 bbb 122 123 ccc 124 222 In the output i want only the below records aaa ccc The validation logic is 1st column and 2nd column need to be considered if both columns values are... (8 Replies)
Discussion started by: trichyselva
8 Replies
Login or Register to Ask a Question