match range of different numbers by AWK


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting match range of different numbers by AWK
# 50  
Old 09-08-2009
input1

Code:
A	239861347 239858777	-
B	233849110 233849388	+
C	202864284 202864396	+
D	187984662 187982263	-

input2

Code:
A	239858789 239865855	-
B	233849110 233849388	+
C	202864284 202864396	+
D	187984054 187984122	+
 	187984914 187984960
 	187985046 187985179
 	187985444 187985584
 	187986365 187986534
 	187986646 187986756
 	187986984 187987128
 	187987609 187987747
 	187987977 187988067
 	187988285 187988365
 	187989607 187990379

output
Code:
zsh-4.3.10[t]% ./s input2 input1
A       239861347 239858777     -               descoutnotexact
B       233849110 233849388     +               ascinlower[itshouldbe ascinlowerexact]
C       202864284 202864396     +               ascinlower[itshouldbe ascinlowerexact]
D       187984662 187982263     -               descoutnotexact[It should be descinexact]

I played with all the coimbinations you used in the script but i couldn't able to solve it.
1st one (A)is fine. second one (B)and third ones (C)are the same numbers in both input1 and input2. so they have be "exact" right??last one (D) is overlapping with the second range of input2(D). it shouldbe like this descinexact.!!!

Could you please suggest any ideas. or advice me howto play with the script wisely.I will tryto do it my self.

Thanx
# 51  
Old 09-08-2009
As far as the records B and C are concerned, you only need to comment the following statements (in red):

Code:
#!/usr/bin/awk -f

BEGIN {
OFS="\t"; ORS="\n" 
  def["ascoutlower"]    = "ARANGE"   
  def["ascoutupper"]    = "BRANGE"
  def["descoutlower"]   = "CRANGE"
  def["descoutupper"]   = "DRANGE"
  def["ascinnotexact"]  = "ERANGE"
  def["descinnotexact"] = "FRANGE"
  def["ascinexact"]     = "GRANGE"
  def["descinexact"]    = "HRANGE"
  
  }

NR == FNR && NF {
  NF > 2 && k = $1
  in2[k] = in2[k] ? in2[k] RS $1 FS $2 : $2 FS $3
  next
  }
$1 in in2 {
  n = split(in2[$1], tmp, RS) 
  split(tmp[1], Tmp); min = Tmp[1]
  m = split(tmp[n], Tmp); max = Tmp[m]
  # asc - desc
  Def = $2 > $3 ? "desc" : "asc"
  # inrange - outofrange
  if (Def == "asc")
    Def = Def ($2 >= min && $3 <= max ? "in" : "out") 
  else
    Def = Def ($3 >= min && $2 <= max ? "in" : "out")
  # lower - upper
  if ((Def ~ /ascout/ ? $3 : $2) <= min) {
    Def = Def "lower"
#    print $0 "\t" def[Def], Def
#    next
    }
  if ((Def ~ /ascout/ ? $3 : $2) >= max) {
    Def = Def "upper"
#    print $0 "\t" def[Def] "\t" Def
#    next
    }    
  # exact - not exact
  for (i=1; i<=n; i++) {
    split(tmp[i], range)
    if (Def ~ /asc/) { k1 = $2; k2 = $3 }      
    else { k1 = $3; k2 = $2 }
    if (k1 >= range[1] && k2 <= range[2]) {
      Def = Def "exact"
      print $0 "\t" def[Def] "\t" Def
      next
      }
    }
      Def = Def "notexact"
    print $0 "\t" def[Def] "\t" Def
    next    
}!/^[ \t]/ { print $0 "\tUNKNOWN" }

And you' ll get:

Code:
A       239861347 239858777     -               descoutnotexact
B       233849110 233849388     +               ascinlowerexact
C       202864284 202864396     +               ascinlowerexact
D       187984662 187982263     -               descoutnotexact

Now, what should be in and what should be out of range? As far as the D record is concerned, we get "out of range" because the range min value in the input2 file (187984054) is greater than the min value (187982263) in the file input1.
# 52  
Old 09-08-2009
Hey is that it. My god you don't believe how much time I spent on this script. Thank you for still considering my long time boring queries and answering. Anyways coming to the point......

Quote:
Now, what should be in and what should be out of range? As far as the D record is concerned, we get "out of range" because the range min value in the input2 file (187984054) is greater than the min value (187982263) in the file input1.
But
The values starts from 187982263 to 187984662 (range) of input1 is overlapping with one of the range(bold letters) in input2 187984054 187984122 and 187984914 187984960 ........Therefor it should be "inexact".I mean "inrange"
If the input1 like this then its outnotexact 187982263 to 187984053 (range) of input1. The number in red bold letter increases even by 1 i.e 187984054 or 187984055.. its overlapping with input 2
input1
Code:
D	187984662 187982263	-

Code:
D	187984054 187984122	+
 	187984914 187984960
 	187985046 187985179
 	187985444 187985584
 	187986365 187986534
 	187986646 187986756
 	187986984 187987128
 	187987609 187987747
 	187987977 187988067
 	187988285 187988365
 	187989607 187990379


We need to name the ranges based on the above description. I'm afraid I think we already did like that isn't it?

Last edited by repinementer; 09-08-2009 at 10:30 AM..
# 53  
Old 09-08-2009
Try this:

Code:
#!/usr/bin/awk -f

BEGIN {
OFS="\t"; ORS="\n" 
  def["ascoutlower"]    = "ARANGE"   
  def["ascoutupper"]    = "BRANGE"
  def["descoutlower"]   = "CRANGE"
  def["descoutupper"]   = "DRANGE"
  def["ascinnotexact"]  = "ERANGE"
  def["descinnotexact"] = "FRANGE"
  def["ascinexact"]     = "GRANGE"
  def["descinexact"]    = "HRANGE"
  
  }

func in_range(_num, _min, _max) { 
    return _min <= _num && _num <= _max 
    }
    
NR == FNR && NF {
  NF > 2 && k = $1
  in2[k] = in2[k] ? in2[k] RS $1 FS $2 : $2 FS $3
  next
  }
$1 in in2 {
  n = split(in2[$1], tmp, RS) 
  split(tmp[1], Tmp); min = Tmp[1]
  m = split(tmp[n], Tmp); max = Tmp[m]
  # asc - desc
  Def = $2 > $3 ? "desc" : "asc"
  # inrange - outofrange
  Def = Def (in_range($2, min, max) || \
    in_range($3, min, max) ? "in" : "out") 
  # lower - upper
  if ((Def ~ /ascout/ ? $3 : $2) < min) {
    Def = Def "lower"
#    print $0 "\t" def[Def], Def
#    next
    }
  if ((Def ~ /ascout/ ? $3 : $2) > max) {
    Def = Def "upper"
#    print $0 "\t" def[Def] "\t" Def
#    next
    }    
  # exact - not exact
  for (i=1; i<=n; i++) {
    split(tmp[i], range)
    if (Def ~ /asc/) { k1 = $2; k2 = $3 }      
    else { k1 = $3; k2 = $2 }
    if (k1 >= range[1] && k2 <= range[2]) {
      Def = Def "exact"
      print $0 "\t" def[Def] "\t" Def
      next
      }
    }
      Def = Def "notexact"
    print $0 "\t" def[Def] "\t" Def
    next    
}!/^[ \t]/ { print $0 "\tUNKNOWN" }

In this case there is no need to set ORS, the newline is the default value.

Last edited by radoulov; 09-08-2009 at 11:56 AM..
# 54  
Old 09-09-2009
O.k Seems to be the code is completely with perfectionism :-).I have updated and modified little bit. This script is officially over. Thanx for your kind support.

Code:
#!/usr/bin/awk -f

BEGIN {
OFS="\t"; ORS="\n" 
  def["plus_out_left_not_overlapping"]    = "ARANGE"
  def["plus_out_right_not_overlapping"]    = "BRANGE"
  def["minus_out_left_not_overlapping"]    = "CRANGE"
  def["minus_out_right_not_overlapping"]    = "DRANGE"
  def["plus_in_not_overlapping"]    = "ERANGE"
  def["minus_in_not_overlapping"]    = "FRANGE"
  def["plus_in_overlapping"]    = "GRANGE"
  def["minus_in_overlapping"]    = "HRANGE"
  def["minus_out_right_overlapping"]    = "HRANGE"
  def["plus_in_left_overlapping"]    = "GRANGE"
  def["minus_in_right_overlapping"]    = "HRANGE"   
  }

func in_range(_num, _min, _max) { 
    return _min <= _num && _num <= _max 
    }
    
NR == FNR && NF {
  NF > 2 && k = $1
  in2[k] = in2[k] ? in2[k] RS $1 FS $2 : $2 FS $3
  next
  }
$1 in in2 {
  n = split(in2[$1], tmp, RS) 
  split(tmp[1], Tmp); min = Tmp[1]
  m = split(tmp[n], Tmp); max = Tmp[m]
  # plus - minus
  Def = $2 > $3 ? "minus" : "plus"
  # inrange - outofrange
  Def = Def (in_range($2, min, max) || \
    in_range($3, min, max) ? "_in" : "_out") 
  # left - right
  if ((Def ~ /plusout/ ? $3 : $2) < min) {
    Def = Def "_left"
#    print $0 "\t" def[Def], Def
#    next
    }
  if ((Def ~ /plusout/ ? $3 : $2) > max) {
    Def = Def "_right"
#    print $0 "\t" def[Def] "\t" Def
#    next
    }    
   # overlapping-not overlapping
  for (i=1; i<=n; i++) {
    split(tmp[i], range)
    if (Def ~ /minus/) { k1 = $2; k2 = $3 }      
    else { k1 = $3; k2 = $2 }
    if (k1 >= range[1] && k2 <= range[2]) {
      Def = Def "_overlapping"
      print $0 "\t" def[Def] "\t" Def
      next
      }
    }
      Def = Def "_not_overlapping"
    print $0 "\t" def[Def] "\t" Def
        next    
}!/^[ \t]/ { print $0 "\tUNKNOWN" }


Last edited by repinementer; 09-09-2009 at 09:57 AM..
# 55  
Old 09-14-2009
Hey sorry for the troule
A small query. Where do I need to change the script for the following input(New kind of input(2ndone) in this every value at the end ends with additional comma)

Old-input
Code:
H19	1874741	2075014	-	619,123,113,135,1318	0,700,903,1111,1342
AF232216	77573119	77573241	-	122	0
AJ012497	168277851	168277971	-	131	0
X74605	153317693	153317967	-	274	0
AJ609435	120809513	120809640	-	128	0
AY216680	42520523	42521690	+	15,123	0,1044
EF212256	131009249	131009385	+	136	0
AY122469	50119368	50119432	-	64	0

Old-script
Code:
#!/usr/bin/awk -f 

BEGIN {
   OFS="\t"; ORS="\n" 
}
NF {
  sec = $2; fifth = split($5, _fifth, ","); sixth = split($6, _sixth, ",")
  counter = rec = ""; key = $1; flag = $4; sub(/[^ \t*]*/, "")
  dummy = sprintf("%*s", length(key),x)
  for (i=1; i<=sixth; i++) {
    second_third = sec + _sixth[i] FS _fifth[i] + sec + _sixth[i] 
    third_second = _fifth[i] + sec + _sixth[i] FS sec + _sixth[i] 
    if (flag == "+")
      rec = rec ? rec RS key dummy OFS second_third OFS flag : key OFS second_third  OFS flag
    else if (flag == "-")
      rec = rec ? rec RS key dummy OFS third_second OFS flag : key OFS third_second  OFS flag
  }
  {print (flag == "+" ? rec : rec) >>"Exon_output1.txt"}

Old-output
Code:
H19	1875360 1874741	-
H19   	1875564 1875441	-
H19   	1875757 1875644	-
H19   	1875987 1875852	-
H19   	1877401 1876083	-
AF232216	77573241 77573119	-
AJ012497	168277982 168277851	-
X74605	153317967 153317693	-
AJ609435	120809641 120809513	-
AY216680	42520523 42520538	+
AY216680        	42521567 42521690	+
EF212256	131009249 131009385	+
AY122469	50119432 50119368	-

New kind of input but need the same output like the above input

New-input needs old output

Code:
H19	1874741	2075014	-	619,123,113,135,1318,	0,700,903,1111,1342,
AF232216	77573119	77573241	-	122,	0,
AJ012497	168277851	168277971	-	131	0
X74605	153317693	153317967	-	274,	0,
AJ609435	120809513	120809640	-	128	0
AY216680	42520523	42521690	+	15,123,	0,1044,
EF212256	131009249	131009385	+	136,	0,
AY122469	50119368	50119432	-	64,	0,


Last edited by repinementer; 09-14-2009 at 04:12 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to print text in field if match and range is met

In the awk below I am trying to match the value in $4 of file1 with the split value from $4 in file2. I store the value of $4 in file1 in A and the split value (using the _ for the split) in array. I then strore the value in $2 as min, the value in $3 as max, and the value in $1 as chr. If A is... (6 Replies)
Discussion started by: cmccabe
6 Replies

2. Shell Programming and Scripting

Get range out using sed or awk, only if given pattern match

Input: START OS:: UNIX Release: xxx Version: xxx END START OS:: LINUX Release: xxx Version: xxx END START OS:: Windows Release: xxx Version: xxx ENDHere i am trying to get all the information between START and END, only if i could match OS Type. I can get all the data between the... (3 Replies)
Discussion started by: Dharmaraja
3 Replies

3. Shell Programming and Scripting

Match on a range of numbers

Hi, I'm trying to match a filename that could be called anything from vout001 to vout252 and was trying to do a small test but I'm not getting the result I thought I would.. Can some one tell me what I'm doing wrong? *****@********>echo $mynumber ... (4 Replies)
Discussion started by: Jazmania
4 Replies

4. Shell Programming and Scripting

awk : match only the pattern string , not letters or numbers after that.

Hi Experts, I am finding difficulty to get exact match: file OPERATING_SYSTEM=HP-UX LOOPBACK_ADDRESS=127.0.0.1 INTERFACE_NAME="lan3" IP_ADDRESS="10.53.52.241" SUBNET_MASK="255.255.255.192" BROADCAST_ADDRESS="" INTERFACE_STATE="" DHCP_ENABLE=0 INTERFACE_NAME="lan3:1"... (6 Replies)
Discussion started by: rveri
6 Replies

5. Shell Programming and Scripting

Complex match of numbers between 2 files awk script

Hello to all, I hope some awk guru could help me. I have 2 input files: File1: Is the complete database File2: Contains some numbers which I want to compare File1: "NUMBERKEY","SERVICENAME","PARAMETERNAME","PARAMETERVALUE","ALTERNATENUMBERKEY"... (9 Replies)
Discussion started by: Ophiuchus
9 Replies

6. Shell Programming and Scripting

Awk numeric range match only one digit?

Hello, I have a text file with lines that look like this: 1974 12 27 -0.72743 -1.0169 2 1.25029 1974 12 28 -0.4958 -0.72926 2 0.881839 1974 12 29 -0.26331 -0.53426 2 0.595623 1974 12 30 7.71432E-02 -0.71887 3 0.723001 1974 12 31 0.187789 -1.07114 3 1.08748 1975 1 1 0.349933 -1.02217... (2 Replies)
Discussion started by: meridionaljet
2 Replies

7. Shell Programming and Scripting

Range of numbers in HEX using AWK

Hi , How do i found out all the number in a range ( HEX) for example Input is 15CF:15D2 Output needed 15CF 15D0 15D1 15D2 Thanks (2 Replies)
Discussion started by: greycells
2 Replies

8. Shell Programming and Scripting

awk to match a numeric range specified by two columns

Hi Everyone, Here's a snippet of my data: File 1 = testRef2: A1BG - 13208 13284 AAA1 - 34758475 34873943 AAAS - 53701240 53715412File 2 = 42MLN.3.bedS2: 13208 13208 13360 13363 13484 13518 13518My awk script: awk 'NR == FNR{a=$1;next} {$1>=a}{$1<=a}{print... (5 Replies)
Discussion started by: heecha
5 Replies

9. Shell Programming and Scripting

Match real numbers in AWK

I am looking for a better way to match real numbers within a specified tolerance range. My current code is as follows: if ($1 !~ /^CASE/) for(i=1;i in G;i++) if (G >= $5-1 && G <= $5+1) { print $1,$4,$5,J,G } else { print $1,"NO MATCH" } where $5 and G are... (3 Replies)
Discussion started by: cold_Que
3 Replies

10. Shell Programming and Scripting

match numbers (awk)

i would like to enter (user input) a bunch of numbers seperated by space: 10 15 20 25 and use awk to print out any lines in a file that have matching numbers so output is: 22 44 66 55 (10) 77 (20) (numbers 10 and 20 matched for example) is this possible in awk . im using gawk for... (5 Replies)
Discussion started by: tanku
5 Replies
Login or Register to Ask a Question