How to sum value of a column by range defined in another file awk?


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers How to sum value of a column by range defined in another file awk?
# 1  
Old 03-01-2019
How to sum value of a column by range defined in another file awk?

I have two files, file1.table is the count table, and the other is the range condition file2.range.
Code:
file1.table
chr start end  count
N1 0 48  1
N1 48  181  2
N1 181 193  0
N1 193 326  2
N1 326 457  0
N1 457 471  1
N1 471 590  2
N1 590 604  1
N1 604 752  1
N1 752 875  1

Code:
file2.range
chr start end
N1      0     99
N1    100    199
N1    200    299
N1    300    399
N1    400    499
N1    500    599
N1    600    699
N1    700    799
N1    800    899
N1    900    999

The value in column 3 [correction: should be column 4] of file1 needs to be summed by the range defined in the file2 (sliding window), especially by column 2 (start position) of the range like this:
Code:
chr start end  count
N1      0     99    3
N1    100    199    2
N1    200    299    0
N1    300    399    0
N1    400    499    3
N1    500    599    1
N1    600    699    1
N1    700    799    0
N1    800    899    0
N1    900    999    0

There is a overlapping problem with some rows such as
Code:
N1 48  181  2 
N1 181 193  0

but I simply ignore it by choosing the start position [48] only at this moment.
Thanks a lot!

Last edited by yifangt; 03-04-2019 at 01:39 PM..
# 2  
Old 03-01-2019
What overlapping problem? Using column 3, those do not overlap. Are we intended to count column 2 as well?

Are all of them N1?
# 3  
Old 03-01-2019
What overlapping problem? Using column 3, those do not overlap. Are we intended to count column 2 as well?
What I meant "overlapping" was for the range, for example:

this line N1 48 181 2 could overlap with two ranges:
Code:
N1   0    99   ?
N1 100   199   ?

so I just ignore columns 3 (181) and categorize it to range N1 0 99.
Are all of them N1?
No, N1 means chromosome N1, so that there are 50 different strings, N1, N19, Scaff01 ... Sorry, I should provide a better sample with at least two chromosomes.
Code:
file1.table
N1    0    48    0
N1    48    181    2
N1    181    193    0
N1    193    326    2
N1    326    457    0
N1    457    471    1
N1    471    590    2
N1    590    604    1
N1    604    752    0
N1    752    875    1
N2    0    580    0
N2    580    592    1
N2    592    713    2
N2    568    627    1
N2    627    698    2
N2    698    701    3
N2    701    717    2
N2    713    724    1
N2    717    726    3

Code:
file2.range
chr start end 
N1      0     99   
N1    100    199 
N1    200    299
N1    300    399 
N1    400    499 
N1    500    599 
N1    600    699 
N1    700    799 
N1    800    899 
N1    900    999
N2      0     99 
N2    100    199 
N2    200    299 
N2    300    399 
N2    400    499 
N2    500    599 
N2    600    699 
N2    700    799 
N2    800    899 
N2    900    999

And output:
Code:
chr start end  count 
N1      0     99    3 
N1    100    199    2 
N1    200    299    0 
N1    300    399    0 
N1    400    499    3 
N1    500    599    1 
N1    600    699    1 
N1    700    799    0 
N1    800    899    0 
N1    900    999    0
N2      0     99    0 
N2    100    199    0 
N2    200    299    0 
N2    300    399    0 
N2    400    499    0 
N2    500    599    3 
N2    600    699    5 
N2    700    799    6 
N2    800    899    0 
N2    900    999    0


Last edited by yifangt; 03-01-2019 at 04:13 PM.. Reason: typo and markdown change
# 4  
Old 03-01-2019
We're using more than just column 3 then, the whole range of every row must be considered to know when to ignore and when not to.

Does this mean there's different ranges for every N1 as well?
# 5  
Old 03-01-2019
Quote:
Originally Posted by yifangt
...

The value in column 3 of file1 needs to be summed by the range defined in the file2 (sliding window),
...
Sure you want col 3? Not the count value in col 4? And, how are the count values shared between ranges? Are they evenly distributed?
Please expalin exactly how the result is computed, from what input, what algorithm.
# 6  
Old 03-01-2019
Each N1 range is different without overlapping for sure, as they are evenly spaced except the last one. Say N1 has 7550bp long, that it is modulo-ed by 100, the last range would be N1 7500 7550.
If understand your question correctly, corona688.
Thanks RudiC, It should be column 4 as the "count" number. column 3 is the "end" coordinate.
# 7  
Old 03-01-2019
Is everything sorted? Can we depend on N1, N2, N3 being nicely grouped and coming in the same order in both file1 and file2? Order of the ranges doesn't necessarily need sorted.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to Sum columns when other column has duplicates and append one column value to another with Care

Hi Experts, Please bear with me, i need help I am learning AWk and stuck up in one issue. First point : I want to sum up column value for column 7, 9, 11,13 and column15 if rows in column 5 are duplicates.No action to be taken for rows where value in column 5 is unique. Second point : For... (1 Reply)
Discussion started by: as7951
1 Replies

2. Shell Programming and Scripting

Sum values of specific column in multiple files, considering ranges defined in another file

I have a file (let say file B) like this: File B: A1 3 5 A1 7 9 A2 2 5 A3 1 3 The first column defines a filename and the other two define a range in that specific file. In the same directory, I have also three more files (File A1, A2 and A3). Here is 10 sample lines... (3 Replies)
Discussion started by: Bastami
3 Replies

3. Shell Programming and Scripting

awk to sum a column based on duplicate strings in another column and show split totals

Hi, I have a similar input format- A_1 2 B_0 4 A_1 1 B_2 5 A_4 1 and looking to print in this output format with headers. can you suggest in awk?awk because i am doing some pattern matching from parent file to print column 1 of my input using awk already.Thanks! letter number_of_letters... (5 Replies)
Discussion started by: prashob123
5 Replies

4. Shell Programming and Scripting

Sum of 286th column using awk in a file

Hi, I am using the following code to find the sum of the values of column 286 in a file. It will have the Decimal values with the scale of 2. Delimiter is '|^' cut -d'|^' -f286 filename|cut -c3-| awk '{ x += $1 } END { printf("%.2f\n", x) }' There are around 50k records in this file... (2 Replies)
Discussion started by: Jram
2 Replies

5. Shell Programming and Scripting

How to sum multiple column output with awk ?

Hi Experts, I am trying to sum multiple columns and rows with awk , I want the sum of : 1] Horizontal Sum: (rows sum): 2] Vertical Sum: (Column's sum] details: # cat file1 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 40 31 32 33 34 35 36 37 38 39 70 41 42 43 44... (2 Replies)
Discussion started by: rveri
2 Replies

6. Shell Programming and Scripting

awk count characters, sum, and divide by another column

Hi All, I am another biologist attempting to parse a large txt file containing several million lines like: tucosnp 56762 T Y 228 228 60 23 .CcCcc,,..c.c,cc,,.C... What I need to do is get the frequency of periods (.) plus commas (,) in column 9, and populate this number into another... (1 Reply)
Discussion started by: peromhc
1 Replies

7. Shell Programming and Scripting

help sum columns by break in first column with awk or sed or something.

I have some data that is something like this? item: onhand counted location ITEM0001 1 0 a1 ITEM0001 0 1 a2 ITEM0002 5 0 b5 ITEM0002 0 6 c1 I want to sum up... (6 Replies)
Discussion started by: syadnom
6 Replies

8. UNIX for Dummies Questions & Answers

Column containing sum using awk

Hi All, I am trying to add a column that contains the sum of the previous column repeated. IE 1 2 3 4 I would like to get: 1 10 2 10 3 10 4 10 Advice? I can get 1 1 2 3 3 6 (4 Replies)
Discussion started by: baconbasher
4 Replies

9. Shell Programming and Scripting

give column range in awk

hi all, I generally give an awk command to print multiple columns like this: awk -F~ '{OFS=",";print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13}' test.txt can't we give a range like : awk -F~ '{OFS=",";print $1-$13}' ( I know this will subtract column 13 from 1) or awk -F~... (1 Reply)
Discussion started by: sumeet
1 Replies

10. Shell Programming and Scripting

Log File date compare for user defined range

:confused: Hi i am a noob and need a little help to finish my shell script. I am learning as i go but hit a problem. I am search thorugh logs(*.rv) files to find entires between two user defined dates, The script so far looks for the "START" and "END" of each entry at sees if it belongs To... (0 Replies)
Discussion started by: mojo24
0 Replies
Login or Register to Ask a Question