awk to update file with numerical difference if condition is met

03-31-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

awk to update file with numerical difference if condition is met

In the file1 below if $9 and $12 are . (dot) then the value in $8 of file1 is used as a key (exact match) to lookup in each $2 of file2, when a match is found then the value of $4
in file1 is used to look for a range match within +/- 50 using the values in $4 and after in file2. The number of fields can be variable but will always start at $4.
For example, ISG15 has 2 fields in it with coordinates starting at $4 and ending at $5. CR2 has 19 coordinates in it starting at $4 ending at $22. The value in $1 of file2
tells you how many coordinates there are but the start or first will always be in $4.

There will only be one range match but if the number is closer to the first value before the - (hyphen) in it the $9 of file1 is updated from a . to the numerical difference between the two numbers with a - (minus) in front. If the number is closer to the second value after the - (hyphen) in it the $9 of
file1 is updated from a . (dot) to the numerical difference between the two numbers with a + (plus) in front. However is the calculated difference is greater than 50, then >50 is printed in $9 of file1.

If $9 or $12 of file1 have a value other then . (dot) in them then that line is skipped (nothing happens) and the next line is processed. In file1 lines 2 and 3 are skipped. The awk below will identify these lines and print them, but I am not sure how to do the rest and need some expert help. Thank you

.

file1 tab-delimited

Code:

R_Index	Chr	Start	End	Ref	Alt	Func.IDP.refGene	Gene.IDP.refGene	GeneDetail.IDP.refGene	Inheritence	ExonicFunc.IDP.refGene	AAChange.IDP.refGene
1	chr1	948846	948846	-	A	upstream	ISG15	.	.	.	.
2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101.3:c.-84C>G	.	.
3	chr1	949608	949608	G	A	exonic	ISG15	.	.	nonsynonymous SNV	ISG15:NM_005101.3:exon2:c.248G>A:p.S83N
4	chr1	949925	949925	C	T	downstream	ISG15	.	.	.	.
5	chr1	207646923	207646923	G	A	intronic	CR2	.	.	.	.
6	chr2	3653844	3653844	T	C	intronic	COLEC11	.	.	.	.
7	chr1	154562623	154562625	CCG	-	intronic	ADAR	.	.	.	.
8	chr1	948840	948840	-	C	upstream	ISG15	.	.	.	.

file2 space-delimited

Code:

2 ISG15 NM_005101.3 948846-948956 949363-949919
19 CR2 NM_001006658.2 207627644-207627821 207639870-207640257 207641871-207642060 207642144-207642244 207642494-207642577 207643039-207643447 207644084-207644261 207644341-207644432 207644767-207644844 207646116-207646524 207647145-207647230 207647585-207647668 207648168-207648561 207649578-207649764 207651229-207651415 207652601-207652625 207653322-207653398 207658808-207658917 207662486-207663240
6 COLEC11 NM_024027.4 3642421-3642758 3651904-3652060 3660900-3660972 3687867-3687921 3691033-3691129 3691316-3692234
15 ADAR NM_001111.4 154554533-154557519 154557692-154557820 154558228-154558341 154558656-154558839 154560600-154560734 154561026-154561149 154561844-154561938 154562232-154562404 154562737-154562885 154569280-154569414 154569598-154569743 154570303-154570452 154570877-154571061 154573516-154575102 154580467-154580724

desired updated file1

Code:

R_Index	Chr	Start	End	Ref	Alt	Func.IDP.refGene	Gene.IDP.refGene	GeneDetail.IDP.refGene	Inheritence	ExonicFunc.IDP.refGene	AAChange.IDP.refGene
1	chr1	948846	948846	-	A	upstream	ISG15	0	.	.	.
2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101.3:c.-84C>G	.	.   .
3	chr1	949608	949608	G	A	exonic	ISG15	.	.	nonsynonymous SNV	ISG15:NM_005101.3:exon2:c.248G>A:p.S83N
4	chr1	949925	949925	C	T	downstream	ISG15	+6	.	.	.
5	chr1	207646923	207646923	G	A	intronic	CR2	>50	.	.	.
6	chr2	3653844	3653844	T	C	intronic	COLEC11	>50	.	.	.
7	chr1	154562623	154562625	CCG	-	intronic	ADAR	>50	.	.	.
8	chr1	948840	948840	-	C	upstream	ISG15	-6	.	.	.

Description of updated file1

Code:

line1:file1 $9 updated to 0 because ISG15 is matched to line 1, $2 of file2 and the value in $4 of file1, 948846 is a exact match to the first cordinate in $4 before the - 
line2:not updated, skipped because $9 or $12 in file1 have a value other then . in them
line3:not updated, skipped because $9 or $12 in file1 have a value other then . in them
line4:file1 $9 updated to +6 because ISG15 is matched to line 1, $2 of file2 and the value in $4 of file1, 949925 is a range match to the second coordinate in $5 after the - 
line5:file1 $9 updated to >50 because CR2 is matched to line 2, $2 of file2 and the value in $4 of file1, 207646923 is a range match to the first coordinate in $14 before the - but the difference of 222 is > 50
line6:file1 $9 updated to >50 because COLEC11 is matched to line 3, $2 of file2 and the value in $4 of file1, 3653844 is a range match to the second coordinate in $2 after the - but the difference of 1784 is > 50
line7:file1 $9 updated to >50 because ADAR is matched to line 4, $2 of file2 and the value in $4 of file1, 154562625 is a range match to the second coordinate in $12 after the - but the difference of112  is > 50
line8: file1 $9 updated to -6 because ISG15 is matched to line 1, $2 of file2 and the value in $4 of file1, 948840 is a range match to the first coordinate in $4 before the -

awk

Code:

 awk -F'\t' -v OFS='\t' '{if ($9=="." && $12==".") print }' file1

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

04-01-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

There doesn't seem to be anything required in this thread that hasn't been done for you in one or more of the other 340 threads you have submitted in this forum.

Please put a little more effort into showing us that you have learned something from the suggestions we have provided during your membership here of more than two years.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-01-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I apologize, but have not done anything like this before. I do always check previous threads and look to try something but thos seems different. I will read through them again, but this seemed more complex than others, Thank you

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

04-01-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You have seen examples that showed you how to read one file and gather information from that file to be used while processing a second file. You have seen examples that showed you how to split a field into subfields based on a subfield delimiter (in this case the minus sign). You have seen examples that showed you how to create arrays from fields in a line (or two arrays with one array containing data and a related array containing counts of elements in the first array). With more than two years of learning from the examples we have provided in 340 threads, we would like to think that you could put those pieces together to solve this problem on your own (or at least get close to it).

Please give it a try and, if you get stuck, show us what you have accomplished and explain what you can't get to work. We want to help you learn how to write your own awk scripts; not to act as your unpaid programming staff.

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-01-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I have learned quite a lot in the past couple of years. I guess that am just not too confident, but will try. In writing the detail of what I was trying to do, it seemed very complex. I will try to come up with a workable solution that will hopefully be a start. Thank you and others for all your help in making this scientist better able to handle large data sets and complex issues. I really appreciate it

.

---------- Post updated at 08:25 PM ---------- Previous update was at 04:41 AM ----------

The below awk is what I was able to come up with. I included comments on each line as well. The section in bold is where each condition is checked (that is where the distance is calculated)... I am also not sure if the split is done correctly and accounts for the possibility the coordinates are in multiple fields. I tried to follow the description to get the desired output. Thank you very much

Code:

awk '        # Run awk to process the following awk script...
BEGIN {    OFS = "\t"        # Set output field separator.
}
FNR == NR {                 # For each line in the 1st input file (file2)...
         A[$1] = $2     # Read each 2nd field in file2 into array A
                next    # process next line in file2
        }
        {
                for ( k in A )   # Read each line in file1 into Array k
                { 
                        if ( $9=="." && $12=="." == A[k] }  # only store file1 lines where field 9 and 12 are a . in them and match them to Array A store the results in updated Array k
                        
                F[k] = $2   # match Array k from file1 with 2nd field of file2 and store in Array k
                }
        
        split($4, a, /[-]/)    # Split the 4th field  of file1 into b[] with hyphen and as subfield separator
        s[++c] = a[1]        # Save the symbol found at the start of this line.
                # c is the current line number in the 1st file.
        m[c] = b[2]        # Save the minimum value found on this line. (left of hyphen)
        M[c] = b[3]        # Save the maximum value found on this line.  (right of hyphen)
        t[c] = $4        # Save the tag found on this line.
        next            # Skip the remaining steps in this script.
           }
        }
---- if then checks here?
        END {
                for ( k in F )     # read each K in variable F to update
                        print k, F[k]   # update each k value   
        
                 break          # break out of this for loop.
        }
}1                # Print the updated
' file2 file1    # End the awk script and name the input files to be processed.

These 2 Users Gave Thanks to cmccabe For This Post:

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

04-01-2017

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

I think this case may be a bit more complex than earlier cases as it involves multi-dimensional arrays and looping over them. Try this and see if it works for you and compare it to your own code..

Code:

awk -v m=50 '
  NR==FNR {                         # For each line in the 1st input file (file2)...
    for(i=4; i<=NF; i++) {          # for all ranges in $4 and further
      split($i,F,/-/)               # split the range in a left field and a right field
      L[$2,i-3]=F[1]                # Assign them to multi-dimensional associative array L(eft)
      R[$2,i-3]=F[2]                # And R(ight), which will be needed in the second part
      N[$2]=NF-3                    # Record the number of ranges in associative array N for index $2 
    }
  } 

  $9=="." && $12=="." {             # For each line in the 2nd input file (file1) if both field 9 and 12 equal "."
    d=">50"                         # set the default difference to >50
    for(i=1; i<=N[$8]; i++) {       # loop over the number of ranges for key $8
      if((d1=$4-L[$8,i])<0)         # set the left difference for a range
        d1=-d1 
      if((d2=$4-R[$8,i])<0)         # set the right difference for a range
        d2=-d2 
      if(d1<=50 || d2<=50) {        # if either of those difference is less than or equal to 50 
        if(d1<=50 && d2<=50) {      # if they are both less than or equal to 50
          if(d1<d2)                 # calculated the smallest difference. 
            d=-d1                   # If it is the left one then the difference is minus that value
          else                      #
            d=d2                    # If it is the right one then the difference is plus that value
        }
        else {                      # if only one of them is less than or equal to 50
          if(d1<=50)                # if it is the left one
            d=-d1
          else                      # if it is the right one
            d=d2
        }
        break                       # found a match, no need to loop further (assuming there are no overlapping ranges) 
      }
    }
    $9=d                            # set field 9 to the difference that was found
  }
  1                                 # print the record    
' file2 FS='\t' OFS='\t' file1

Last edited by Scrutinizer; 04-01-2017 at 11:24 PM..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

04-01-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Amazing, works perfect.

Using the original code I put comments by what I think I understand (not gonna lie most I don't, but maybe eventually I will).... thank you very much for all the great help and continued learning

Code:

awk -v m=50 '    # set maximum distance for difference and store as variable m
  NR==FNR {      # For each line in the 1st input file (file2)...
    for(i=4; i<=NF; i++) {   # start at field 4 and store in i as a loop (capture variable fields lenght of each line)
      split($i,F,/-/)   # split each i on the - (hyphen)
      L[$2,i-3]=F[1]    # store the left of the - as F[1]
      R[$2,i-3]=F[2]    # store the right of the - asF[2}
    }    # end file2 processing
    if(NF>n)
      n=NF
    next
  } 
  $9=="." && $12=="." {   # only look at lines where fields 9 and 12 have a . (dot) in file1
    d=">50" 
    for(i=1; i<=n; i++) {   # create loop for file1 i variable
      if( ($8,i) in L) {    # loop hrough field 8 of file1 and store in variable L
        if((d1=$4-L[$8,i])<0)
          d1=-d1 
        if((d2=$4-R[$8,i])<0)
          d2=-d2 
        if(d1<=50 || d2<=50) {
          if(d1<=50 && d2<=50) {
            if(d1<d2)
              d=-d1
            else
              d=d2
          }
          else {
            if(d1<=50)
              d=-d1
            else
              d=d2
          }
          break     # break out of this for loop.
        }
      }
    }
    $9=d
  }
  1    
' file2 FS='\t' OFS='\t' file1   # End the awk script and name the input files to be processed with field seperators set as tab.

Last edited by Scrutinizer; 04-02-2017 at 05:08 AM..

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

awk to update file with numerical difference if condition is met

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk - print when condition is met

Discussion started by: tons92

2. Shell Programming and Scripting

Add another condition to bash for when not met

Discussion started by: cmccabe

3. Shell Programming and Scripting

Need help on how to append on the filename when condition met.

Discussion started by: znesotomayor

4. Shell Programming and Scripting

Getting the records once condition met

Discussion started by: znesotomayor

5. Shell Programming and Scripting

Awk. Abort script if condition was met.

Discussion started by: urello

6. Shell Programming and Scripting

Delete if condition met in a column

Discussion started by: jazzyzha

7. UNIX for Advanced & Expert Users

While loop only if a condition is met

Discussion started by: Cranie

8. Shell Programming and Scripting

do nothing if condition is not met but not exit

Discussion started by: abdul.irfan2

9. Shell Programming and Scripting

sed/awk to update 1st column if condition met

Discussion started by: plennon

10. Shell Programming and Scripting

How to break a loop if condition is met

Discussion started by: Issemael