awk to print text in field if match and range is met


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to print text in field if match and range is met
# 1  
Old 02-15-2018
awk to print text in field if match and range is met

In the awk below I am trying to match the value in $4 of file1 with the split value from $4 in file2. I store the value of $4 in file1 in A and the split value (using the _ for the split) in array. I then strore the value in $2 as min, the value in $3 as max, and the value in $1 as chr.
If A is equal to array, then i use the values stored in min, max, and chr to check if there is overlap or not between the $2, $3, and $1 values in file2. If there is then overlap is printing but if there is not missing is printed. I am trying to ensure that the lines match and that the coordinates are in covered from file1 to file2. My actual data is several thousands of lines all in the below format and a match should result for each line in file2. I commented the awk as well and hope it helps as I am getting multiple syntax errors and maybe there is a better way, but I wanted to try and see. Thank you Smilie.


file1 tab-delimeted
Code:
chr19	42373737	42373856	RPS19
chr6	32790021	32790140	TAP2

file2 tab-delimeted
Code:
chr19	42364844	42364915	RPS19_cds_1_0_chr19_42364845_f	0	+
chr19	42365180	42365281	RPS19_cds_2_0_chr19_42365181_f	0	+
chr19	42373100	42373284	RPS19_cds_3_0_chr19_42373101_f	0	+
chr19	42373768	42373823	RPS19_cds_4_0_chr19_42373769_f	0	+
chr19	42375418	42375445	RPS19_cds_5_0_chr19_42375419_f	0	+
chr6	32790065	32790095	TAP2_cds_0_0_chr6_32790066_r	0	-
chr6	32797176	32797313	TAP2_cds_1_0_chr6_32797177_r	0	-
chr6	32797706	32797866	TAP2_cds_2_0_chr6_32797707_r	0	-
chr6	32798043	32798217	TAP2_cds_3_0_chr6_32798044_r	0	-
chr6	32798394	32798583	TAP2_cds_4_0_chr6_32798395_r	0	-
chr6	32800109	32800238	TAP2_cds_5_0_chr6_32800110_r	0	-
chr6	32800403	32800601	TAP2_cds_6_0_chr6_32800404_r	0	-
chr6	32802930	32803136	TAP2_cds_7_0_chr6_32802931_r	0	-
chr6	32803419	32803550	TAP2_cds_8_0_chr6_32803420_r	0	-
chr6	32805313	32805428	TAP2_cds_9_0_chr6_32805314_r	0	-
chr6	32805517	32806010	TAP2_cds_10_0_chr6_32805518_r	0	-

desired output tab-delimeted
Code:
chr19	42364844	42364915	RPS19_cds_1_0_chr19_42364845_f	0	+	missing
chr19	42365180	42365281	RPS19_cds_2_0_chr19_42365181_f	0	+	missing
chr19	42373100	42373284	RPS19_cds_3_0_chr19_42373101_f	0	+	missing
chr19	42373768	42373823	RPS19_cds_4_0_chr19_42373769_f	0	+	overlap
chr19	42375418	42375445	RPS19_cds_5_0_chr19_42375419_f	0	+	missing
chr6	32790065	32790095	TAP2_cds_0_0_chr6_32790066_r	0	-	overlap
chr6	32797176	32797313	TAP2_cds_1_0_chr6_32797177_r	0	-	missing
chr6	32797706	32797866	TAP2_cds_2_0_chr6_32797707_r	0	-	missing
chr6	32798043	32798217	TAP2_cds_3_0_chr6_32798044_r	0	-	missing
chr6	32798394	32798583	TAP2_cds_4_0_chr6_32798395_r	0	-	missing
chr6	32800109	32800238	TAP2_cds_5_0_chr6_32800110_r	0	-	missing
chr6	32800403	32800601	TAP2_cds_6_0_chr6_32800404_r	0	-	missing
chr6	32802930	32803136	TAP2_cds_7_0_chr6_32802931_r	0	-	missing
chr6	32803419	32803550	TAP2_cds_8_0_chr6_32803420_r	0	-	missing
chr6	32805313	32805428	TAP2_cds_9_0_chr6_32805314_r	0	-	missing
chr6	32805517	32806010	TAP2_cds_10_0_chr6_32805518_r	0	-	missing

awk
Code:
awk '
BEGIN { FS=OFS="\t" }                               # define FS and OFS as tab
              NR==FNR{                              # process same line in file1 and file2
           {
       A[$1]=$4;next}                               # store $4 value from file1 into A
       {min[NR]=$2; max[NR]=$3; chr[NR]=$1; next}   # store $1,$2,$3 values into seperate arrays
{
  split($4,array,"_")                               # split $4 in file2 by the _
}
                  {                
     for (id in min)
           if([A] ~ array) && (($1==chr[NR])&&(min[id] <= $2 && $3 < max[id])) {  # match $4 in A with array split and check for overlap using min max and chr from file1
                  $7 = print "overlap" else "missing";   # print value in $7 of file 2
}' file1 file2


Last edited by cmccabe; 02-15-2018 at 12:58 PM.. Reason: added code tags
# 2  
Old 02-15-2018
Code:
if([A] ~ array)

what is [A]?
And what the above is trying to test?

FYI, split
Code:
      split(s, a [, r ])
                               Split the string s into the  array  a  
                               and return the number of fields. 
 The first piece is stored in array[1], the second piece in array[2], and so forth.

a is an array indexed by numbers (not an associative array as you might expect).

Last edited by vgersh99; 02-15-2018 at 01:38 PM..
This User Gave Thanks to vgersh99 For This Post:
# 3  
Old 02-15-2018
Code:
if[A] ~ array)

is ensuring, or is supposed to, match $4 in file1 with the array split from file2. So using the first value RPS19 as an example, only those lines in file2 with RPS19 are used. Thank you Smilie.
# 4  
Old 02-15-2018
Quote:
Originally Posted by cmccabe
Code:
if[A] ~ array)

is ensuring, or is supposed to, match $4 in file1 with the array split from file2. So using the first value RPS19 as an example, only those lines in file2 with RPS19 are used. Thank you Smilie.
okie dokie - interesting "construct" [A] - see my previous comments.
you probably meant: ($1 in A) as $1 and array[5] are the same in file2.

$7 = print "overlap" else "missing"

what is that supposed to mean/do?
I think you also are missing a } in your last block with the for

Last edited by vgersh99; 02-15-2018 at 02:16 PM..
This User Gave Thanks to vgersh99 For This Post:
# 5  
Old 02-15-2018
Quote:
$7 = print "overlap" else "missing"
what is that supposed to mean/do?
Is meant to print either overlap or missing in $7 depending on if the condition's are met.

If I understand the comments correctly:

Code:
awk '
BEGIN { FS=OFS="\t" }                               # define FS and OFS as tab
              NR==FNR{                              # process same line in file1 and file2
           {
       A[$1]=$4;next}                               # store $4 value from file1 into A
       {min[NR]=$2; max[NR]=$3; chr[NR]=$1; next}   # store $1,$2,$3 values into seperate arrays
{
  split($4,array,"_")                               # split $4 in file2 by the _
}
                  {                
     for (id in min)
           if($1 in A) ~ array[5])) && ($1==chr[NR]) && (min[id] <= $2 && $3 < max[id])) {  # match $4 in A with array split and check for overlap using min max and chr from file1
                  {
                    print "overlap", $7 else print "missing", $7  # print value in $7 of file 2
                  }                                          
  }
}' file1 file2

I updated the awk (specifically the print statement) but am getting syntax errors.

Thank you very much Smilie.

Last edited by cmccabe; 02-15-2018 at 02:54 PM.. Reason: updated awk, corrected typo
# 6  
Old 02-15-2018
Code:
if($1 in A)...

Code:
print "overlap", $7; else print "missing", $7

you'll also need to work out your balancing ()-s in the if and other potential syntax errors.
This User Gave Thanks to vgersh99 For This Post:
# 7  
Old 02-19-2018
Thank you very much for your help Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to print lines based on text in field and value in two additional fields

In the awk below I am trying to print the entire line, along with the header row, if $2 is SNV or MNV or INDEL. If that condition is met or is true, and $3 is less than or equal to 0.05, then in $7 the sub pattern :GMAF= is found and the value after the = sign is checked. If that value is less than... (0 Replies)
Discussion started by: cmccabe
0 Replies

2. UNIX for Beginners Questions & Answers

awk - print when condition is met

I have a file.txt containing the following: Query= HWI-ST863:386:C5Y8UACXX:3:2302:16454:89688 1:N:0:ACACGAAT Length=100 Score E Sequences producing significant alignments: (Bits) Value ... (2 Replies)
Discussion started by: tons92
2 Replies

3. Shell Programming and Scripting

awk to search field2 in file2 using range of fields file1 and using match to another field in file1

I am trying to use awk to find all the $2 values in file2 which is ~30MB and tab-delimited, that are between $2 and $3 in file1 which is ~2GB and tab-delimited. I have just found out that I need to use $1 and $2 and $3 from file1 and $1 and $2of file2 must match $1 of file1 and be in the range... (6 Replies)
Discussion started by: cmccabe
6 Replies

4. Shell Programming and Scripting

awk to remove field and match strings to add text

In file1 field $18 is removed.... column header is "Otherinfo", then each line in file1 is used to search file2 for a match. When a match is found the last four strings in file2 are copied to file1. Maybe: cut -f1-17 file1 and then match each line to file2 file1 Chr Start End ... (6 Replies)
Discussion started by: cmccabe
6 Replies

5. Shell Programming and Scripting

awk to print unique text in field before hyphen

Trying to print the unique values in $2 before the -, currently the count is displayed. Hopefully, the below is close. Thank you :). file chr2:46603668-46603902 EPAS1-902|gc=54.3 253.1 chr2:211471445-211471675 CPS1-1205|gc=48.3 264.7 chr19:15291762-15291983 NOTCH3-1003|gc=68.8 195.8... (3 Replies)
Discussion started by: cmccabe
3 Replies

6. Shell Programming and Scripting

awk to print unique text in field

I am trying to use awk to print the unique entries in $2 So in the example below there are 3 lines but 2 of the lines match in $2 so only one is used in the output. File.txt chr17:29667512-29667673 NF1:exon.1;NF1:exon.2;NF1:exon.38;NF1:exon.4;NF1:exon.46;NF1:exon.47 703.807... (5 Replies)
Discussion started by: cmccabe
5 Replies

7. Shell Programming and Scripting

Command/script to match a field and print the next field of each line in a file.

Hello, I have a text file in the below format: Source Destination State Lag Status CQA02W2K12pl:D:\CAQA ... (10 Replies)
Discussion started by: pocodot
10 Replies

8. Shell Programming and Scripting

Print specific field when condition met

Hi All, Seeking for your assistance to print all the specific field when the condition met. Ex: file1.txt 1|203|3|31243|5341|6452|623|22|00|01 3|45345|123214|6534|3423|6565|643|343|232|10 if field 1 = 1 and field 3 = 3 and field 5 = 5341 and field 6 = 6452 it will print from $1 to $10.... (2 Replies)
Discussion started by: znesotomayor
2 Replies

9. Shell Programming and Scripting

Match text in a range and copy value

In the files attached, I am trying to: if Files.txt $1 is in the range of Exons.txt $1, then in Files.txt $4 the value from Exons.txt $3 is copied else if no match is found Exons.txt $3 = "Intron" For example, the first value in File.txt $1 is chr1:14895-14944 and is not found in any range... (4 Replies)
Discussion started by: cmccabe
4 Replies

10. Shell Programming and Scripting

Grep range of lines to print a line number on match

Hi Guru's, I am trying to grep a range of line numbers (based on match) and then look for another match which starts with a special character '$' and print the line number. I have the below code but it is actually printing the line number counting starting from the first line of the range i am... (15 Replies)
Discussion started by: Kevin Tivoli
15 Replies
Login or Register to Ask a Question