Visit Our UNIX and Linux User Community


Split a file based on number sum at the second column and the third column.


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Split a file based on number sum at the second column and the third column.
# 1  
Old 03-01-2020
Split a file based on number sum at the second column and the third column.

Dear all,

I have a bed file below. I want to split the bed file based on base length (2999 kb) between the start and the end position. For example, from the start position 12109 to the end position 14678 should be in one file, as these are in 2999kb range. the start position 15573 and the end position 15612 (2999 bp length from the start position to the end is the splitting condition) should be in another file and so on.
I tried bedtools make windows and bedops chop options but they didn't work.

The input file (has many lines):
Code:
Sp_chr1 12109   12149 DNA Sequences 
Sp_chr1 12348   12388 DNA Sequences 
Sp_chr1 12493   12533 DNA Sequences 
Sp_chr1 12616   12656 DNA Sequences 
Sp_chr1 12746   12786 DNA Sequences 
Sp_chr1 14486   14521 DNA Sequences 
Sp_chr1 14525   14564 DNA Sequences  
Sp_chr1 14638   14678 DNA Sequences 
Sp_chr1 15573   15612 DNA Sequences 
Sp_chr1 20498   20538 DNA Sequences 
Sp_chr1 21628   21668 DNA Sequences 
Sp_chr1 25346   25386 DNA Sequences 
Sp_chr1 26053   26093 DNA Sequences 
Sp_chr1 26129   26169 DNA Sequences 
Sp_chr1 27874   27913 DNA Sequences

The desired output files are :
The output file 1:
Code:
Sp_chr1 12109   12149 DNA Sequences 
Sp_chr1 12348   12388 DNA Sequences 
Sp_chr1 12493   12533 DNA Sequences 
Sp_chr1 12616   12656 DNA Sequences 
Sp_chr1 12746   12786 DNA Sequences 
Sp_chr1 14486   14521 DNA Sequences 
Sp_chr1 14525   14564 DNA Sequences  
Sp_chr1 14638   14678 DNA Sequences

The output file2:
Code:
Sp_chr1 15573   15612 DNA Sequences

The output file3:
Code:
Sp_chr1 20498   20538 DNA Sequences 
Sp_chr1 21628   21668 DNA Sequences

and so on.

Moderator's Comments:
Mod Comment Please use code tags for your code and data, thanks

Last edited by Chubler_XL; 03-02-2020 at 12:31 AM..
# 2  
Old 03-02-2020
This forum is not a script writing service.

If you have a solution you have worked on that is not complete we can help you, but you must have shown some effort to solve this yourself.
# 3  
Old 03-02-2020
I tried to do using bedtools makewindows option and bedops chop option, but none of them worked as I need.
This is why I asked here. I know this forum or another forums are not a script writing service. This is the best I can do because my efforts that I know did not worked.
# 4  
Old 03-02-2020
I can understand that the specifics of what particular tool or language require knowledge you may not have.

How about trying to give us some pseudo code for how this file should be processed.

eg
Code:
set startnum=0
set fileext = 1
loop:
    read line from input
     ...
    append line to filename("file" + fileext)
end loop

can you fill in the missing logic for "..." above.
# 5  
Old 03-02-2020
Hi
Maybe just like that?
Code:
awk '
/^\S+\s+12109/,/^(\S+\s+){2}14678\s/ {print > "file1"}
/^\S+\s+15573/,/^(\S+\s+){2}15612\s/ {print > "file2"}
/^\S+\s+20498/,/^(\S+\s+){2}21668\s/ {print > "file3"}
' file

# 6  
Old 03-02-2020
Code:
set startnum=0
set fileext = 1
loop:
    read line from input
     awk '{ Name= $1; 
         startposition = $2; stopposition = $3; for (startposition = stopposition + 2999); print '{Name}'
    append line to filename("file" + fileext)
end loop

--- Post updated at 08:58 AM ---

Hi,
Thanks. I have a huge file where there are many lines.

Code:
awk '
/^\S+\s+12109/,/^(\S+\s+){2}14678\s/ {print > "file1"}
/^\S+\s+15573/,/^(\S+\s+){2}15612\s/ {print > "file2"}
/^\S+\s+20498/,/^(\S+\s+){2}21668\s/ {print > "file3"}
' file

# 7  
Old 03-02-2020
maybe so?
Code:
#!/bin/bash

step=2999
declare -i start=12109 end=start+step count=1
stop=$(awk '{if($3>max) max=$3} END {print max}' file)

while [ $end -le $stop ]; do
        awk -vA=$start -vZ=$end -vf="file$count" '
                $2>=A && $3<=Z {print > f}
        ' file
        start+=step
        end+=step
        count+=1
done

This User Gave Thanks to nezabudka For This Post:

Previous Thread | Next Thread
Test Your Knowledge in Computers #111
Difficulty: Easy
The Unix version with the largest installed base in 2019 is macOS.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Sum in file based column

Hi All, I have a file as below and want to sum based on the id in the first column Input 10264;ATE; 12 10265;SES;11 10266AUT;50 10264;ATE;10 10265;SES;13 10266AUT;89 10264;ATE;1 10265;SES;15 10266AUT;78 Output 10264;ATE; 23 10265;SES;39 10266AUT;139 (6 Replies)
Discussion started by: arunkumar_mca
6 Replies

2. Shell Programming and Scripting

Sum of a column as new column based on header in a script

Hello, I am trying to store sum of a column as a new column inside a file but have to find the column names dynamically I/p c1,c2,c3,c4,c5 10,20,30,40,50 20,30,40,50,60 If i want to find sum only column c1, c3 and output it as c6,c7 O/p c1,c2,c3,c4,c5,c6,c7 10,20,30,40,50,30,70... (6 Replies)
Discussion started by: mkathi
6 Replies

3. Shell Programming and Scripting

Split column data if the table has n number of column's with some record

Split column data if the table has n number of column's with some record then how to split n number of colmn's line by line with records Table --------- Col1 col2 col3 col4 ....................col20 1 2 3 4 .................... 20 a b c d .................... v ... (11 Replies)
Discussion started by: Priti2277
11 Replies

4. Shell Programming and Scripting

Split column data if the table has n number of column's

please write a shell script Table -------------------------- 1 2 3 a b c 3 4 5 c d e 7 8 9 f g h Output should be like this --------------- 1 2 3 3 4 5 7 8 9 a b c c d e f g h (1 Reply)
Discussion started by: Priti2277
1 Replies

5. Shell Programming and Scripting

Sum column values based in common identifier in 1st column.

Hi, I have a table to be imported for R as matrix or data.frame but I first need to edit it because I've got several lines with the same identifier (1st column), so I want to sum the each column (2nd -nth) of each identifier (1st column) The input is for example, after sorted: K00001 1 1 4 3... (8 Replies)
Discussion started by: sargotrons
8 Replies

6. Shell Programming and Scripting

awk to sum a column based on duplicate strings in another column and show split totals

Hi, I have a similar input format- A_1 2 B_0 4 A_1 1 B_2 5 A_4 1 and looking to print in this output format with headers. can you suggest in awk?awk because i am doing some pattern matching from parent file to print column 1 of my input using awk already.Thanks! letter number_of_letters... (5 Replies)
Discussion started by: prashob123
5 Replies

7. Shell Programming and Scripting

Sum Of Column Based On Column Condition

I have a following inputfile MT,AP,CDM,TTML,MUM,GS,SUCC,3 MT,AP,CDM,TTSL,AP,GS,FAIL,9 MT,AP,CDM,RCom,MAH,GS,SUCC,3 MT,AP,CDM,RTL,HP,GS,SUCC,1 MT,AP,CDM,Uni,UPE,GS,SUCC,2 MT,AP,CDM,Uni,MUM,GS,SUCC,2 TTSL,AP,GS,MT,MAH,CDM,SUCC,20 TTML,AP,GS,MT,MAH,CDM,FAIL,10... (2 Replies)
Discussion started by: siramitsharma
2 Replies

8. Shell Programming and Scripting

Split the file based on column

Hi, I have a file sample_1.txt (300k rows) which has data like below: * Also each record is around 64k bytes 11|1|abc|102553|125589|64k bytes of data 10|2|def|123452|123356|...... 13|2|geh|144351|121123|... 25|4|fgh|165250|118890|.. 14|1|abc|186149|116657|......... (6 Replies)
Discussion started by: sol_nov
6 Replies

9. Shell Programming and Scripting

Split single file into multiple files based on the number in the column

Dear All, I would like to split a file of the following format into multiple files based on the number in the 6th column (numbers 1, 2, 3...): ATOM 1 N GLY A 1 -3.198 27.537 -5.958 1.00 0.00 N ATOM 2 CA GLY A 1 -2.199 28.399 -6.617 1.00 0.00 ... (3 Replies)
Discussion started by: tomasl
3 Replies

10. UNIX for Dummies Questions & Answers

How do I sum one column based on another column?

Hi, I am new to this forum and new to awk. I have a file that contains 2 columns. Heres an example of what it looks like: 10 + 20 + 40 + 50 - 70 - So the file is tab-delimited. What I want to do is add 10 to column 1 whenever column 2 is + and substract 10 from column 1... (1 Reply)
Discussion started by: phil_heath
1 Replies

Featured Tech Videos