Help getting a code in awk - Want to know how much of the data is covered by entries


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Help getting a code in awk - Want to know how much of the data is covered by entries
# 1  
Old 08-14-2019
Help getting a code in awk - Want to know how much of the data is covered by entries

Here is my data structure.
Code:
# id1    id2    len   start    end
# 9     16792   5475   4181     4232
# 11    16792   2317   1086     1137
# 11    32879   2317      8       60
# 11    32858   2317     10       52
# 11    30670   2317     17       63
# 14    12645    532      3       67
# 14    12645    532    158      222
# 14    11879    532      3      223
# 18    23847    644     64      285
# 18    30160    644     98      285
# 18    30160    644    345      477
# 18    30160    644    516      644

I want to get the coverage of id1 based on its length (column len) considering all entries start and end values. The problem is that the multiple entries can have juxtapose values so considering the values in all entries would overrate the coverage. Also considering the smallest start value and biggest end value doesn't account for all since it can have gaps where not all length is represented.

My expected result should be something like this
Code:
 9 --- 50 / 5475  = 0.009
11 --- 106 / 2317 = 0.046
14 --- 220 / 532  = 0.414
18 --- 481 / 644  = 0.75


Last edited by Scrutinizer; 08-14-2019 at 02:02 PM.. Reason: code tags
# 2  
Old 08-14-2019
If you don't want the smallest range, and don't want the biggest range, then what do you want? The average?
# 3  
Old 08-14-2019
Code:
awk '
NR > 1 {
   if (!id1[$2]++) {ids[idc++]=$2; len[$2]=$4;}
   for (i=$5; i<$6; i++) if (!value[$2,i]++) coverage[$2]++;
}
END {
   for (i=0; i<idc; i++)
      printf "%d --- %d / %d = %.3f\n", ids[i],
             coverage[ids[i]], len[ids[i]],
             (coverage[ids[i]] / len[ids[i]]);
}
' data

Note: for first line there is only one range of coverage. Check the range in output shown.

Last edited by rdrtx1; 08-14-2019 at 08:54 PM.. Reason: efficiency++
This User Gave Thanks to rdrtx1 For This Post:
# 4  
Old 08-14-2019
Thanks for the help rdrtx1.
It worked great.
Login or Register to Ask a Question

Previous Thread | Next Thread

1 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk code to ignore the first occurence unknown number of rows in a data column

Hello experts, Shown below is the 2 column sample data(there are many data columns in actual input file), Key, Data A, 1 A, 2 A, 2 A, 3 A, 1 A, 1 A, 1 I need the below output. Key, Data A, 2 A, 2 A, 3 A, 1 A, 1 A, 1 (2 Replies)
Discussion started by: ks_reddy
2 Replies
Login or Register to Ask a Question