awk to combine matches and use a field to adjust coordinates in other fields

07-20-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

awk to combine matches and use a field to adjust coordinates in other fields

Trying to output a result that uses the data from file to combine and subtract specific lines. If $4 matches in each line then the last $6 value is added to $2 and that becomes the new$3. Each matching line in combined into one with $1 then the original $2 then the new$3 then $5. For the cases where there is only a sing value in $6 that matches a line, then if that value is 1, then the original $2 is the new $3 in the result. If the value in $6 that matches a line, then if that value is anything but 1, then the digit is added to the original $2 and the new $3 in the result. I hope this is possibe. Thanks

.

file

Code:

chrX    110961329    110961512    chrX:110961329-110961512    ALG13    1    7
chrX    110961329    110961512    chrX:110961329-110961512    ALG13    2    7
chrX    110961329    110961512    chrX:110961329-110961512    ALG13    3    7
chrX    110961329    110961512    chrX:110961329-110961512    ALG13    4    5
chrX    110961329    110961512    chrX:110961329-110961512    ALG13    5    4
chr2    50573818    50574097    chr2:50573818-50574097    NRXN1    268    9
chr2    50573818    50574097    chr2:50573818-50574097    NRXN1    269    8
chr2    50573818    50574097    chr2:50573818-50574097    NRXN1    270    7
chr2    50573818    50574097    chr2:50573818-50574097    NRXN1    271    7
chrX    135080256    135080354    chrX:135080256-135080354    SLC9A6    1    16
chr18    53298518    53298629    chr18:53298518-53298629    TCF4    11    1

desired output result

Code:

chrX    110961329    110961334    ALG13
chr2    50573818    50573822    NRXN1
chrX    135080256    135080256    SLC9A6
chr18    53298529    53298529    TCF4

Currently, I use

Code:

awk 'BEGIN {OFS="\t"}; {print $1,$2,$3,$5}' file | sort -u > result

but that only sorts by the unique entries and gives misleading results. Thanks.

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

07-20-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello cmccabe,

Could you please try following and let me know if this helps you.

Code:

awk 'FNR==NR{S[$4]++;next} ($4 in S){if(S[$4]>1){print $1 OFS $2 OFS $2+S[$4] OFS $5;} else {if($6==1){print $1 OFS $2 OFS $2 OFS $5} else {print $1 OFS $2+$6 OFS $2+$6 OFS $5}};delete S[$4]}'   Input_file  Input_file

Output will be as follows.

Code:

chrX 110961329 110961334 ALG13
chr2 50573818 50573822 NRXN1
chrX 135080256 135080256 SLC9A6
chr18 53298529 53298529 TCF4

Thanks,
R. Singh

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

07-20-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

The awk works great, do you mind explaining it a bit, why do you need two inputs? Thanks again

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

07-21-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Quote:

Originally Posted by cmccabe

The awk works great, do you mind explaining it a bit, why do you need two inputs? Thanks again Smilie

Hello cmccabe,

Following explanation may help you in same.

Code:

awk 'FNR==NR                            ####FNR==NR, FNR and NR are the built in variables in awk, where they represent the number of lines while reading any Input_file, so difference between FNR and NR is, FNR's value will be RESET(whenever we are reading mutiple Input_files) when it starts reading the next file. 
                                        ####On the other hand NR's value will keep on increasing till all the files are read. So in this condition it will be TURE only when 1st file is being read as when second Input_file will be reading then NR's value will be reater than FNR's value as per above explanation.
{S[$4]++;                               ####Creating an array named S whose index is field 4 and by doing ++ to this array so that it could the occurances of that particular index's value, like how many times a same value of a 4th field occured.
next}                                   ####Next, is a built in variable of awk again and it tells awk not to go further and it skips all upcoming statements. We are using it because we don't want to execute all further statements as we need to read Input_file completly 1st time and have the array S's values till the file's completion.
($4 in S)                               ####Now this statement will be executed when 2nd Input_file is being read. Here we are checking like which ever 4th field of line is present in array S. If it is present(which it should be) then following statements will be executed.
{if(S[$4]>1)                            ####Now we are checking here if value(occurance of that 4th field in the Input_file) of array S's whose index is current 4th field of the line being read is greater than 1 or not, if it is greater than 1, it means condition is TRUE then following statements will be executed.
{print $1 OFS $2 OFS $2+S[$4] OFS $5;}  ####Now printing $1, $2 and $3 as value of $2+s[$4](count of 4th field adding to 2nd field, as per your requirements), $5, off course OFS is a built in awk variable which stands for Output field seprator and it's default value is space.
else                                    ####If if condition which we checked above(where array S's whose index is 4th field is NOT greater than 1 then following statements will be executed.
{if($6==1)                              ####Now again here I am checking if 6th's field value is 1 or not, if it is 1 then following statements will be executed otherwise it will go to else.
{print $1 OFS $2 OFS $2 OFS $5}         ####Printing $1, $2, $2, $5's values as per your requirements, again OFS value is space by default here.
else                                    ####It will come to this statement if above if condition was NOT TRUE.
{print $1 OFS $2+$6 OFS $2+$6 OFS $5}}; ####Printing the values of $1, $2+$6(sum of 2nd and 6th field), $2+$6(sum of 2nd and 6th field), $5 as per your requirements.
delete S[$4]}                           ####deleting the value of array S's whose index is 4th field so that duplicate values shouldn't come.
' Input_file  Input_file                ####mentioning Input_files here.

Thanks,
R. Singh

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

07-21-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you very much for your help and explanations, I really appreciate it

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

awk to combine matches and use a field to adjust coordinates in other fields

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Perl to adjust coordinates based on repeat string

Discussion started by: cmccabe

2. UNIX for Beginners Questions & Answers

find pattern matches in consecutive lines in certain fields-awk

Discussion started by: jvoot

3. Shell Programming and Scripting

awk to adjust text and count based on value in field

Discussion started by: cmccabe

4. Shell Programming and Scripting

awk to format file and combine two fields using comma

Discussion started by: cmccabe

5. Shell Programming and Scripting

awk to adjust coordinates in field based on sequential numbers in another field

Discussion started by: cmccabe

6. Shell Programming and Scripting

awk to combine all matching fields in input but only print line with largest value in specific field

Discussion started by: cmccabe

7. Shell Programming and Scripting

awk to combine by field and average by another

Discussion started by: cmccabe

8. Shell Programming and Scripting

awk to match keyword and return matches and unique fields

Discussion started by: cmccabe

9. Shell Programming and Scripting

How to print 1st field and last 2 fields together and the rest of the fields after it using awk?

Discussion started by: 100bees

10. Shell Programming and Scripting

awk to sum specific field when pattern matches

Discussion started by: ux4me