awk to adjust coordinates in field based on sequential numbers in another field


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to adjust coordinates in field based on sequential numbers in another field
# 1  
Old 01-27-2017
awk to adjust coordinates in field based on sequential numbers in another field

I am trying to output a tab-delimited result that uses the data from a tab-delimited file to combine and subtract specific lines.

If $4 matches in each line then the first matching sequential $6 value is added to $2, unless the value is 1, then the original $2 is used (like in the case of line 1). This is the new or adjusted $2 value.

The last matching sequential $6 value is added to $2 and this is the new or adjusted $3 value.

The new $2 and $3 vales are combined with $1 in the format $1:$2-$3 and the $5 value is printed on the line.

The awk command below works great as long as the $4 values are unique, but that is not always the case. I can not seem to add in a condition that checks $6 and if the numbers are not sequential (1 2 is, but then there is a break between 92 93 94), when there is a break a new line is created.

Maybe there is another way but hopefully this helps. Thank you Smilie


Code:
chrX    110956442   110956535   chrX:110956442-110956535    ALG13   1   19
chrX    110956442   110956535   chrX:110956442-110956535    ALG13   2   19
chrX    110956442   110956535   chrX:110956442-110956535    ALG13   92  18
chrX    110956442   110956535   chrX:110956442-110956535    ALG13   93  18
chrX    110956442   110956535   chrX:110956442-110956535    ALG13   94  18
chrX    110961329   110961512   chrX:110961329-110961512    ALG13   2   1
chrX    110961329   110961512   chrX:110961329-110961512    ALG13   3   1
chr15    25031028    25031925    chrX:25031028-25031925  ARX 651 3

desired output
Code:
chrX:110956442-110956444    ALG13
chrX:110956534-110956536    ALG13
chrX:110961331-110961332    ALG13
chr15:25031679-25031679  ARX

awk
Code:
awk 'FNR==NR {S[$4]++;next} ($4 in S){if(S[$4]>1){print $1 OFS $2 OFS $2+S[$4] OFS $5;} 
else {if($6==1){print $1 OFS $2 OFS $2 OFS $5}
else {print $1 OFS $2+$6 OFS $2+$6 OFS $5}};delete S[$4]}' file file

current output
Code:
chrX 110956442 110956449 ALG13
chrX 110961329 110961334 ALG13
chr15 25031028 25031031 ARX


Last edited by cmccabe; 01-27-2017 at 12:58 PM.. Reason: fixed format
# 2  
Old 01-27-2017
If I run your awk on your input then I do not get your output.
But maybe I have understood your description.
If your file is sorted by $4 and $6 (so $6 sequences are in adjacent lines),
then the following can do it:
Code:
awk '
# print from stored values
function prt(){
  print p1 ":" (p6start==1 ? p2 : p2+p6start) "-" p2+p6, p5
}
($4!=p4 || $6!=p6+1) {
# new sequence, print the previous sequence
  if (NR>1) prt()
  p6start=$6  
}
{
# store the values that we need later
  p1=$1
  p2=$2
  p4=$4
  p5=$5
  p6=$6
}
END { prt() }
' file

A problem is the "late" end-of-sequence detection. This is solved with storing the previous values, and an END section, and a print function.
This User Gave Thanks to MadeInGermany For This Post:
# 3  
Old 01-28-2017
Are you sure the output should not be:
Code:
chrX:110956442-110956444    ALG13
chrX:110956532-110956535    ALG13
chrX:110961330-110961332    ALG13
chr15:25031678-25031678  ARX

That would make more sense to me, maybe I'm wrong..

Last edited by Scrutinizer; 01-28-2017 at 04:32 AM..
This User Gave Thanks to Scrutinizer For This Post:
# 4  
Old 01-30-2017
Thank you very much for your help and for catching the output correction, this is why the computer does the math Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Problem with getting awk to multiply a field by a value set based on condition of another field

Hi, So awk is driving me crazy on this one. I have searched everywhere and read man, docs and every related post Google can find and still no luck. The actual files I need to run this on are sensitive in nature, but it is the same thing as if I needed to calculate weighted grades for multiple... (15 Replies)
Discussion started by: cotilloe
15 Replies

2. Shell Programming and Scripting

Perl to adjust coordinates based on repeat string

In the file below I am trying to count the given repeats of A,T,C,G in each string of letters. Each sequence is below the > and it is possible for a string of repeats to wrap from the line above. For example, in the first line the last letter is a T and the next lines has 3 more. I think the below... (10 Replies)
Discussion started by: cmccabe
10 Replies

3. Shell Programming and Scripting

awk to adjust text and count based on value in field

The below awk executes as is and produces the current output. It isvery close but what Ican not seem to do is add the -exon..., the ... portion comes from $1 and the _exon is static and will never change. If there is + sign in $4 then the ... is in acending order or sequential. If there is a - in... (2 Replies)
Discussion started by: cmccabe
2 Replies

4. Shell Programming and Scripting

awk to update value in field based on another field

In the tab-delimeted input file below I am trying to use awk to update the value in $2 if TYPE=ins in bold, by adding the value of HRUN= in italics. In the below since in line 1 TYPE=ins the 117282541 value in $2 has 6 added because that is the value of HRUN=. Hopefully the awk is a start but I... (2 Replies)
Discussion started by: cmccabe
2 Replies

5. Shell Programming and Scripting

awk to combine matches and use a field to adjust coordinates in other fields

Trying to output a result that uses the data from file to combine and subtract specific lines. If $4 matches in each line then the last $6 value is added to $2 and that becomes the new$3. Each matching line in combined into one with $1 then the original $2 then the new$3 then $5. For the cases... (4 Replies)
Discussion started by: cmccabe
4 Replies

6. UNIX for Dummies Questions & Answers

[Solved] awk solution to add sequential numbers based on a word

Hi experts, I've been struggling to format a large genetic dataset. It's complicated to explain so I'll simply post example input/output $cat input.txt ID GENE pos start end blah1 coolgene 1 3 5 blah2 coolgene 1 4 6 blah3 coolgene 1 4 ... (4 Replies)
Discussion started by: torchij
4 Replies

7. Shell Programming and Scripting

AWK: Pattern match between 2 files, then compare a field in file1 as > or < field in file2

First, thanks for the help in previous posts... couldn't have gotten where I am now without it! So here is what I have, I use AWK to match $1 and $2 as 1 string in file1 to $1 and $2 as 1 string in file2. Now I'm wondering if I can extend this AWK command to incorporate the following: If $1... (4 Replies)
Discussion started by: right_coaster
4 Replies

8. Shell Programming and Scripting

awk, comma as field separator and text inside double quotes as a field.

Hi, all I need to get fields in a line that are separated by commas, some of the fields are enclosed with double quotes, and they are supposed to be treated as a single field even if there are commas inside the quotes. sample input: for this line, 5 fields are supposed to be extracted, they... (8 Replies)
Discussion started by: kevintse
8 Replies

9. UNIX for Dummies Questions & Answers

awk - Summing a field based on another field

So, I need to do some summing. I have an Apache log file with the following as a typical line: 127.0.0.1 - frank "GET /apache_pb.gif HTTP/1.0" 200 2326 Now, what I'd like to do is a per-minute sum. So, I can have awk tell me the individual minutes, preserving the dates(since this is a... (7 Replies)
Discussion started by: treesloth
7 Replies

10. Shell Programming and Scripting

Find top N values for field X based on field Y's value

I want to find the top N entries for a certain field based on the values of another field. For example if N=3, we want the 3 best values for each entry: Entry1 ||| 100 Entry1 ||| 95 Entry1 ||| 30 Entry1 ||| 80 Entry1 ||| 50 Entry2 ||| 40 Entry2 ||| 20 Entry2 ||| 10 Entry2 ||| 50... (1 Reply)
Discussion started by: FrancoisCN
1 Replies
Login or Register to Ask a Question