awk to update file with numerical difference if condition is met


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to update file with numerical difference if condition is met
# 8  
Old 04-01-2017
Hi cmccabe, you're welcome. In the mean time I improved the script somewhat and put comments everywhere..

--
As you have noticed, I put 50 in variable m, but still used 50 everywhere, but it could have been m, so the script can be used for other cases where a different value may be needed. I'll leave it up to you to try it out and correct the code so that m is used everywhere instead of 50.

You may also want to change the quick and dirty one letter variable names to more mnemonic names, once the script is to your liking
And add exception error handling, for example when no corresponding value for $8 is found in file 2 ...

Last edited by Scrutinizer; 04-01-2017 at 11:40 PM..
This User Gave Thanks to Scrutinizer For This Post:
# 9  
Old 04-02-2017
Here is a slightly different approach to handling your problem...

Your specification talked about left and right values in the ranges separated by a minus sign in file2, my code treats them as low and high values (since the left value always seems to be the low end of a range and the right values always seems to be the high end of a range in your sample data). Instead of using the awk split() function to split out the low and high values, my code uses <space> and <hyphen> as characters in the field separator used when reading in file2 so the low and high values in each range are already split apart when awk reads lines from file2.

Your specification said that if field 4 was closer to the low end of a range, the difference between the values should be printed with a leading minus sign. My code does that. Scrutinizer's code does that as long as $4 is not the low end of a range (in which case it prints 0 instead of -0). Your sample desired output also omitted this minus sign. Without a leading sign, you can't tell if the value in field 4 matched the low end of a range or matched the high end of a range.

Your specification said that if field 4 was closer to the high end of a range, the difference should be printed with a leading plus sign. My code does that. Scrutinizer's code never prints a leading +.

Your specification doesn't say what should happen if the value given is exactly halfway between the low and high values. My code treats this case as if the midpoint is closer to the low end of the range. If this isn't acceptable, you can modify the code to do whatever you want in this case.

Your description said that if field 9 is not a period or field 12 is not a period, the input line should be unchanged in the output. My code and Scrutinizer's code both do that. But, it doesn't match the output that you said should be printed for the 3rd line in file1. The input line is:
Code:
2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101.3:c.-84C>G	.	.

but you say the desired output for that line should be:
Code:
2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101.3:c.-84C>G	.	.   .

which has an added four spaces and a period added to the end of the line??? I assume this was a typo in the desired sample output you posted.

The code I came up with is:
Code:
awk -v extra=50 -v OFS='\t' '
NR == FNR {
	# Process 1st input file (file2).  Note that we have set FS to split
	# fields on <space> and <hyphen> so the ranges have been split into
	# pairs of fields before we get here.
	#	Field 1 tells us how many ranges there are.
	#	Field 2 is the key field to match against field 8 in file1.
	#	Field 3 is ignored.
	#	Even fields 4 through NF-1 are low ends of ranges.
	#	Odd fields 5 through NF are high ends of ranges.
	# Gather the count of ranges and the low and high ends of the ranges
	# for each key.
	count[$2] = $1
	for(i = 1; i <= $1; i++) {
		low[$2, i] = $(2 + 2 * i)
		high[$2, i] = $(3 + 2 * i)
		# Also calculate the midpoint of each range.
		mid[$2, i] = (low[$2, i] + high[$2, i]) / 2
	}
	next	# Skip to the next input record.
}
FNR != 1 && $9 == "." && $12 == "." && $8 in count {
	# Look for a range for the key in this record that contains field 4.
	# Note that the range is extended on both ends by the value specified
	# in extra.
	for(i = 1; i <= count[$8]; i++)
		if($4 >= (low[$8, i] - extra) && $4 <= (high[$8, i] + extra)) {
			# A matching range was found; update field 9...
			if($4 > mid[$8, i]) {
				# The value in this record is closer to the high
				# end of the range...
				sign = "+"
				value = high[$8, i]
			} else {# The value in this record is closer to the low
				# end of the range...
				sign = "-"
				value = low[$8, i]
			}
			# Calculate the absolute value of the difference between
			# field 4 and the closest end of the range.  If the
			# value is closer to the low end of the range and the
			# difference is less than or equal to 50 print it with a
			# leading minus sign; if the value is closer to the high
			# end of the range and the difference is less than or
			# equal to 50, print it with a leading plus sign;
			# otherwise (if the difference is greater than 50),
			# print ">50".
			diff = (value > $4) ? value - $4 : $4 - value
			$9 = (diff > 50) ? ">50" : (sign diff)
			break
		}
	if(i > count[$8]) {
		# No matching range was found.
		$9 = ">50"	# Or do something else to indicate this case???
	}
}
1	# Print the possibly updated record.
' FS='[- ]' file2 FS='\t' file1

In cases where a line containing a periods in fields 9 and 12, but field 8 was not found in file2, this code leaves the line unchanged with warning that the field 8 value was not found. If the field 8 value is found, but field 4 is not in any of the specified ranges, this code changes field 9 to >50. The spot where this is done is commented so you can change it to some other string or issue a diagnostic if this is not what you want to happen in this case.

As always, if someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

With your sample input, the above code produces the output:
Code:
R_Index	Chr	Start	End	Ref	Alt	Func.IDP.refGene	Gene.IDP.refGene	GeneDetail.IDP.refGene	Inheritence	ExonicFunc.IDP.refGene	AAChange.IDP.refGene
1	chr1	948846	948846	-	A	upstream	ISG15	-0	.
2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101.3:c.-84C>G	.	.
3	chr1	949608	949608	G	A	exonic	ISG15	.	.	nonsynonymous SNV	ISG15:NM_005101.3:exon2:c.248G>A:p.S83N
4	chr1	949925	949925	C	T	downstream	ISG15	+6	.
5	chr1	207646923	207646923	G	A	intronic	CR2	>50	.	.	.
6	chr2	3653844	3653844	T	C	intronic	COLEC11	>50	.
7	chr1	154562623	154562625	CCG	-	intronic	ADAR	>50	.	.	.
8	chr1	948840	948840	-	C	upstream	ISG15	-6	.

The differences between what this code produces and what Scrutinizer's code produces are marked in red (the - in the 2nd output line and the + in the 5th output line). Scrutinizer's code also copies the contents of file2 to the output; the code above does not.

I hope this helps.

Last edited by Don Cragun; 04-02-2017 at 10:18 AM.. Reason: Correct the comments describing the modifications to field 9.
This User Gave Thanks to Don Cragun For This Post:
# 10  
Old 04-03-2017
Thank you both very much for the help and explanations... I have a lot to learn to fully understand but these help a lot Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk - print when condition is met

I have a file.txt containing the following: Query= HWI-ST863:386:C5Y8UACXX:3:2302:16454:89688 1:N:0:ACACGAAT Length=100 Score E Sequences producing significant alignments: (Bits) Value ... (2 Replies)
Discussion started by: tons92
2 Replies

2. Shell Programming and Scripting

Add another condition to bash for when not met

In the below I can not seem to add a line that will add Not low if the statement in bold is not true or meet. I guess when the first if statement is true/meet then print low, otherwise print Not low in $(NF + 1). I am not sure how to correctly add this. Thank you :). if(low <= $2 && $2 <=... (5 Replies)
Discussion started by: cmccabe
5 Replies

3. Shell Programming and Scripting

Need help on how to append on the filename when condition met.

Hi All, Seeking for your assistance on how to append the specific string when $3 condion met. ex. file1.txt ar0050046b16,5,888,0,0,0,0.00,0.00,0.00,0.00,25689.55 ar0050046b16,5,0,0,0,0,0.00,0.00,0.00,0.00,25689.55 ar0050046b16,5,0,0,0,0,0.00,0.00,0.00,0.00,25689.55 expected output:... (5 Replies)
Discussion started by: znesotomayor
5 Replies

4. Shell Programming and Scripting

Getting the records once condition met

Hi All, Seeking for your assistance to get the records once the $2 met the condition. Ex. file 1.txt 123455,10-Aug-2020 07:33:37 AM,2335235,1323534,12343 123232,11-Aug-2015 08:33:37 PM,4234324,1321432,34364 Output: 123455,10-Aug-2020 07:33:37 AM,2335235,1323534,12343 What i did... (5 Replies)
Discussion started by: znesotomayor
5 Replies

5. Shell Programming and Scripting

Awk. Abort script if condition was met.

I want to abort script if input variable matched first field in any line of a file. #!/bin/sh read INPUTVAR1 awk "{if(\$INPUTVAR1 == $1) x = 1} END {if(x==1) print \"I want to abort script here\"; else print \"OK\"}" /etc/some.conf I tried "exit" and system("exit") but no luck. (1 Reply)
Discussion started by: urello
1 Replies

6. Shell Programming and Scripting

Delete if condition met in a column

i have a table like this: id, senderNumber, blacklist ----------------------------- 1 0835636326 Y 2 0373562343 Y 3 0273646833 Y and I want to delete automatically if a new inserted row on another table consist anything on senderNumber column above using a BASH Script I... (9 Replies)
Discussion started by: jazzyzha
9 Replies

7. UNIX for Advanced & Expert Users

While loop only if a condition is met

All, I wrote the following section of code (which logically in PHP would of worked): tmpPATH=${1} tmpTAG=${2} if then while read tmpTAG tmpPATH do fi echo $tmpTAG echo $tmpPATH if then done < ./config.cfg fi (4 Replies)
Discussion started by: Cranie
4 Replies

8. Shell Programming and Scripting

do nothing if condition is not met but not exit

Hello all, I created the below script....and it seemed to be working fine. My problem is i want the script to ignore rest of the things if my condition is not met but do not exit.... #!/bin/ksh ########################### ########################### # Set name of the listener, this... (2 Replies)
Discussion started by: abdul.irfan2
2 Replies

9. Shell Programming and Scripting

sed/awk to update 1st column if condition met

Hi, I am trying to update the 1st column of a file but only if it contains a char here is an example of my file 1111aaa 9999 textaaa 22222bbb 9999 textbbb 3333 9999 textccc 444ddd 9999 textddd i would like the output to remove any characters () from... (5 Replies)
Discussion started by: plennon
5 Replies

10. Shell Programming and Scripting

How to break a loop if condition is met

I am having trouble figuring this code I want to grep a text from a file and if it match certain text it break out of the loop or it should continue searching for the text Here is what I have written but it isn't working while true f=`grep 'END OF STATUS REPORT' filename` do if ... (9 Replies)
Discussion started by: Issemael
9 Replies
Login or Register to Ask a Question