awk to update file with numerical difference if condition is met
In the file1 below if $9 and $12 are . (dot) then the value in $8 of file1 is used as a key (exact match) to lookup in each $2 of file2, when a match is found then the value of $4
in file1 is used to look for a range match within +/- 50 using the values in $4 and after in file2. The number of fields can be variable but will always start at $4.
For example, ISG15 has 2 fields in it with coordinates starting at $4 and ending at $5. CR2 has 19 coordinates in it starting at $4 ending at $22. The value in $1 of file2
tells you how many coordinates there are but the start or first will always be in $4.
There will only be one range match but if the number is closer to the first value before the - (hyphen) in it the $9 of file1 is updated from a . to the numerical difference between the two numbers with a - (minus) in front. If the number is closer to the second value after the - (hyphen) in it the $9 of file1 is updated from a . (dot) to the numerical difference between the two numbers with a + (plus) in front. However is the calculated difference is greater than 50, then >50 is printed in $9 of file1.
If $9 or $12 of file1 have a value other then . (dot) in them then that line is skipped (nothing happens) and the next line is processed. In file1 lines 2 and 3 are skipped. The awk below will identify these lines and print them, but I am not sure how to do the rest and need some expert help. Thank you .
file1 tab-delimited
Code:
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 . . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
3 chr1 949608 949608 G A exonic ISG15 . . nonsynonymous SNV ISG15:NM_005101.3:exon2:c.248G>A:p.S83N
4 chr1 949925 949925 C T downstream ISG15 . . . .
5 chr1 207646923 207646923 G A intronic CR2 . . . .
6 chr2 3653844 3653844 T C intronic COLEC11 . . . .
7 chr1 154562623 154562625 CCG - intronic ADAR . . . .
8 chr1 948840 948840 - C upstream ISG15 . . . .
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 0 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . . .
3 chr1 949608 949608 G A exonic ISG15 . . nonsynonymous SNV ISG15:NM_005101.3:exon2:c.248G>A:p.S83N
4 chr1 949925 949925 C T downstream ISG15 +6 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
6 chr2 3653844 3653844 T C intronic COLEC11 >50 . . .
7 chr1 154562623 154562625 CCG - intronic ADAR >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6 . . .
Description of updated file1
Code:
line1:file1 $9 updated to 0 because ISG15 is matched to line 1, $2 of file2 and the value in $4 of file1, 948846 is a exact match to the first cordinate in $4 before the -
line2:not updated, skipped because $9 or $12 in file1 have a value other then . in them
line3:not updated, skipped because $9 or $12 in file1 have a value other then . in them
line4:file1 $9 updated to +6 because ISG15 is matched to line 1, $2 of file2 and the value in $4 of file1, 949925 is a range match to the second coordinate in $5 after the -
line5:file1 $9 updated to >50 because CR2 is matched to line 2, $2 of file2 and the value in $4 of file1, 207646923 is a range match to the first coordinate in $14 before the - but the difference of 222 is > 50
line6:file1 $9 updated to >50 because COLEC11 is matched to line 3, $2 of file2 and the value in $4 of file1, 3653844 is a range match to the second coordinate in $2 after the - but the difference of 1784 is > 50
line7:file1 $9 updated to >50 because ADAR is matched to line 4, $2 of file2 and the value in $4 of file1, 154562625 is a range match to the second coordinate in $12 after the - but the difference of112 is > 50
line8: file1 $9 updated to -6 because ISG15 is matched to line 1, $2 of file2 and the value in $4 of file1, 948840 is a range match to the first coordinate in $4 before the -
There doesn't seem to be anything required in this thread that hasn't been done for you in one or more of the other 340 threads you have submitted in this forum.
Please put a little more effort into showing us that you have learned something from the suggestions we have provided during your membership here of more than two years.
This User Gave Thanks to Don Cragun For This Post:
I apologize, but have not done anything like this before. I do always check previous threads and look to try something but thos seems different. I will read through them again, but this seemed more complex than others, Thank you .
You have seen examples that showed you how to read one file and gather information from that file to be used while processing a second file. You have seen examples that showed you how to split a field into subfields based on a subfield delimiter (in this case the minus sign). You have seen examples that showed you how to create arrays from fields in a line (or two arrays with one array containing data and a related array containing counts of elements in the first array). With more than two years of learning from the examples we have provided in 340 threads, we would like to think that you could put those pieces together to solve this problem on your own (or at least get close to it).
Please give it a try and, if you get stuck, show us what you have accomplished and explain what you can't get to work. We want to help you learn how to write your own awk scripts; not to act as your unpaid programming staff.
These 2 Users Gave Thanks to Don Cragun For This Post:
I have learned quite a lot in the past couple of years. I guess that am just not too confident, but will try. In writing the detail of what I was trying to do, it seemed very complex. I will try to come up with a workable solution that will hopefully be a start. Thank you and others for all your help in making this scientist better able to handle large data sets and complex issues. I really appreciate it .
---------- Post updated at 08:25 PM ---------- Previous update was at 04:41 AM ----------
The below awk is what I was able to come up with. I included comments on each line as well. The section in bold is where each condition is checked (that is where the distance is calculated)... I am also not sure if the split is done correctly and accounts for the possibility the coordinates are in multiple fields. I tried to follow the description to get the desired output. Thank you very much.
Code:
awk ' # Run awk to process the following awk script...
BEGIN { OFS = "\t" # Set output field separator.
}
FNR == NR { # For each line in the 1st input file (file2)...
A[$1] = $2 # Read each 2nd field in file2 into array A
next # process next line in file2
}
{
for ( k in A ) # Read each line in file1 into Array k
{
if ( $9=="." && $12=="." == A[k] } # only store file1 lines where field 9 and 12 are a . in them and match them to Array A store the results in updated Array k
F[k] = $2 # match Array k from file1 with 2nd field of file2 and store in Array k
}
split($4, a, /[-]/) # Split the 4th field of file1 into b[] with hyphen and as subfield separator
s[++c] = a[1] # Save the symbol found at the start of this line.
# c is the current line number in the 1st file.
m[c] = b[2] # Save the minimum value found on this line. (left of hyphen)
M[c] = b[3] # Save the maximum value found on this line. (right of hyphen)
t[c] = $4 # Save the tag found on this line.
next # Skip the remaining steps in this script.
}
}
---- if then checks here?
END {
for ( k in F ) # read each K in variable F to update
print k, F[k] # update each k value
break # break out of this for loop.
}
}1 # Print the updated
' file2 file1 # End the awk script and name the input files to be processed.
These 2 Users Gave Thanks to cmccabe For This Post:
I think this case may be a bit more complex than earlier cases as it involves multi-dimensional arrays and looping over them. Try this and see if it works for you and compare it to your own code..
Code:
awk -v m=50 '
NR==FNR { # For each line in the 1st input file (file2)...
for(i=4; i<=NF; i++) { # for all ranges in $4 and further
split($i,F,/-/) # split the range in a left field and a right field
L[$2,i-3]=F[1] # Assign them to multi-dimensional associative array L(eft)
R[$2,i-3]=F[2] # And R(ight), which will be needed in the second part
N[$2]=NF-3 # Record the number of ranges in associative array N for index $2
}
}
$9=="." && $12=="." { # For each line in the 2nd input file (file1) if both field 9 and 12 equal "."
d=">50" # set the default difference to >50
for(i=1; i<=N[$8]; i++) { # loop over the number of ranges for key $8
if((d1=$4-L[$8,i])<0) # set the left difference for a range
d1=-d1
if((d2=$4-R[$8,i])<0) # set the right difference for a range
d2=-d2
if(d1<=50 || d2<=50) { # if either of those difference is less than or equal to 50
if(d1<=50 && d2<=50) { # if they are both less than or equal to 50
if(d1<d2) # calculated the smallest difference.
d=-d1 # If it is the left one then the difference is minus that value
else #
d=d2 # If it is the right one then the difference is plus that value
}
else { # if only one of them is less than or equal to 50
if(d1<=50) # if it is the left one
d=-d1
else # if it is the right one
d=d2
}
break # found a match, no need to loop further (assuming there are no overlapping ranges)
}
}
$9=d # set field 9 to the difference that was found
}
1 # print the record
' file2 FS='\t' OFS='\t' file1
Last edited by Scrutinizer; 04-01-2017 at 11:24 PM..
This User Gave Thanks to Scrutinizer For This Post:
Using the original code I put comments by what I think I understand (not gonna lie most I don't, but maybe eventually I will).... thank you very much for all the great help and continued learning .
Code:
awk -v m=50 ' # set maximum distance for difference and store as variable m
NR==FNR { # For each line in the 1st input file (file2)...
for(i=4; i<=NF; i++) { # start at field 4 and store in i as a loop (capture variable fields lenght of each line)
split($i,F,/-/) # split each i on the - (hyphen)
L[$2,i-3]=F[1] # store the left of the - as F[1]
R[$2,i-3]=F[2] # store the right of the - asF[2}
} # end file2 processing
if(NF>n)
n=NF
next
}
$9=="." && $12=="." { # only look at lines where fields 9 and 12 have a . (dot) in file1
d=">50"
for(i=1; i<=n; i++) { # create loop for file1 i variable
if( ($8,i) in L) { # loop hrough field 8 of file1 and store in variable L
if((d1=$4-L[$8,i])<0)
d1=-d1
if((d2=$4-R[$8,i])<0)
d2=-d2
if(d1<=50 || d2<=50) {
if(d1<=50 && d2<=50) {
if(d1<d2)
d=-d1
else
d=d2
}
else {
if(d1<=50)
d=-d1
else
d=d2
}
break # break out of this for loop.
}
}
}
$9=d
}
1
' file2 FS='\t' OFS='\t' file1 # End the awk script and name the input files to be processed with field seperators set as tab.
Last edited by Scrutinizer; 04-02-2017 at 05:08 AM..
I have a file.txt containing the following:
Query= HWI-ST863:386:C5Y8UACXX:3:2302:16454:89688 1:N:0:ACACGAAT
Length=100
Score E
Sequences producing significant alignments: (Bits) Value
... (2 Replies)
In the below I can not seem to add a line that will add Not low if the statement in bold is not true or meet. I guess when the first if statement is true/meet then print low, otherwise print Not low in $(NF + 1). I am not sure how to correctly add this. Thank you :).
if(low <= $2 && $2 <=... (5 Replies)
Hi All,
Seeking for your assistance on how to append the specific string when $3 condion met.
ex. file1.txt
ar0050046b16,5,888,0,0,0,0.00,0.00,0.00,0.00,25689.55
ar0050046b16,5,0,0,0,0,0.00,0.00,0.00,0.00,25689.55
ar0050046b16,5,0,0,0,0,0.00,0.00,0.00,0.00,25689.55
expected output:... (5 Replies)
Hi All,
Seeking for your assistance to get the records once the $2 met the condition.
Ex. file 1.txt
123455,10-Aug-2020 07:33:37 AM,2335235,1323534,12343
123232,11-Aug-2015 08:33:37 PM,4234324,1321432,34364
Output:
123455,10-Aug-2020 07:33:37 AM,2335235,1323534,12343
What i did... (5 Replies)
I want to abort script if input variable matched first field in any line of a file.
#!/bin/sh
read INPUTVAR1
awk "{if(\$INPUTVAR1 == $1) x = 1} END {if(x==1) print \"I want to abort script here\"; else print \"OK\"}" /etc/some.conf
I tried "exit" and system("exit") but no luck. (1 Reply)
i have a table like this:
id, senderNumber, blacklist
-----------------------------
1 0835636326 Y
2 0373562343 Y
3 0273646833 Y
and I want to delete automatically if a new inserted row on another table consist anything on senderNumber column above using a BASH Script
I... (9 Replies)
All,
I wrote the following section of code (which logically in PHP would of worked):
tmpPATH=${1}
tmpTAG=${2}
if
then
while read tmpTAG tmpPATH
do
fi
echo $tmpTAG
echo $tmpPATH
if
then
done < ./config.cfg
fi (4 Replies)
Hello all, I created the below script....and it seemed to be working fine. My problem is i want the script to ignore rest of the things if my condition is not met but do not exit....
#!/bin/ksh
###########################
###########################
# Set name of the listener, this... (2 Replies)
Hi,
I am trying to update the 1st column of a file but only if it contains a char
here is an example of my file
1111aaa 9999 textaaa
22222bbb 9999 textbbb
3333 9999 textccc
444ddd 9999 textddd
i would like the output to remove any characters () from... (5 Replies)
I am having trouble figuring this code
I want to grep a text from a file and if it match certain text it break out of the loop or it should continue searching for the text
Here is what I have written but it isn't working
while true
f=`grep 'END OF STATUS REPORT' filename`
do
if ... (9 Replies)