awk to update unknown value in file using range of another


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to update unknown value in file using range of another
# 1  
Old 11-18-2016
awk to update unknown value in file using range of another

I am trying to use awk to update all the unknown values in $6 of file2, if the $4 value in file 2 is within the range of $1 of file1. If there is already a value in $6 other then unknown, it is skipped and the next line is processed. In my awk attempt below the final output is 6 tab-delimited fields. Thank you Smilie.

file1 (space-delimited)

Code:
chr1:4714792-4852594 AJAP1
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
chr1:15783224-15798586 CELA2A
chr1:16888814-16940057 NBPF1
chr1:16888814-16939982 NBPF1
chr1:25568728-25664704 C1orf63
chr1:25568747-25664656 C1orf63

file2 (tab-delimited)

Code:
chr1    3649533    3649653    chr1:3649533-3649653    .    TP73
chr1    4736396    4736516    chr1:4736396-4736516    .    unknown   
chr1    5923314    5923434    chr1:5923314-5923434    .    NPHP4
chr1    9161991    9162111    chr1:9161991-9162111    .    unknown
chr1    9162050    9162051    chr1:9162050-9162051    .    rs6697376

desired output
--- the second and fourth unknown values are updated based on the $4value and the range that they fall in $1 of file1
Code:
chr1    3649533    3649653    chr1:3649533-3649653    .    TP73
chr1    4736396    4736516    chr1:4736396-4736516    .    AJAP1  
chr1    5923314    5923434    chr1:5923314-5923434    .    NPHP4
chr1    9161991    9162111    chr1:9161991-9162111    .    unknown
chr1    9162050    9162051    chr1:9162050-9162051    .    rs6697376

awk with current output
Code:
awk -v OFS='\t' 'NR==FNR{
                           rstart[a[1]]=a[2]
                           rend[a[1]]=a[3]
                           value[a[1]]=$2
                           next} 
     $6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1]
                          {sub(/unknown/,value[$1],$6)}1' file1 file2 |    column -t
chr1  3649533  3649653  chr1:3649533-3649653  .  TP73
chr1  4736396  4736516  chr1:4736396-4736516  .
chr1  5923314  5923434  chr1:5923314-5923434  .  NPHP4
chr1  9161991  9162111  chr1:9161991-9162111  .
chr1  9162050  9162051  chr1:9162050-9162051  .  rs6697376

awk number 2 with output ---- sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t all one one line
Code:
awk -v OFS='\t' 'NR==FNR{
                           rstart[a[1]]=a[2]
                           rend[a[1]]=a[3]
                           value[a[1]]=$2
                           next} 
     $6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input |    column -t
chr1  3649533  3649653  chr1:3649533-3649653  .  TP73
chr1  4736396  4736516  chr1:4736396-4736516  .  unknown
chr1  5923314  5923434  chr1:5923314-5923434  .  NPHP4
chr1  9161991  9162111  chr1:9161991-9162111  .  unknown
chr1  9162050  9162051  chr1:9162050-9162051  .  rs6697376

I think I need a split split($1,a,/[:-]/) but the key is not unique, is there a better way?

The only way I can think off to make the key unique is, though I am not sure how to implement i:
----- matching $2 values in file1 are combined with the first lines rstart[a[1]]=a[2] being the start and the last lines rend[a[1]]=a[3] being the end
Code:
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A


Last edited by cmccabe; 11-18-2016 at 12:42 PM.. Reason: addes awk attempt 2
# 2  
Old 11-19-2016
Yes, clearly you need to use the split() call to define the array a[] that you are using, but (as you noted) you can't use the elements of that array as subscripts in another array because the values are not unique. Instead of you an array of minimum values and an array of maximum values indexed by the line number in your first file.

But, I don't understand the output that you say should be produced. Why do you want the output to be (with all occurrences of four spaces in your output replaced by <tab> characters):
Code:
chr1	3649533	3649653	chr1:3649533-3649653	.	TP73
chr1	4736396	4736516	chr1:4736396-4736516	.	AJAP1
chr1	5923314	5923434	chr1:5923314-5923434	.	NPHP4
chr1	9161991	9162111	chr1:9161991-9162111	.	unknown
chr1	9162050	9162051	chr1:9162050-9162051	.	rs6697376

instead of:
Code:
chr1	3649533	3649653	chr1:3649533-3649653	.	TP73
chr1	4736396	4736516	chr1:4736396-4736516	.	AJAP1
chr1	5923314	5923434	chr1:5923314-5923434	.	NPHP4
chr1	9161991	9162111	chr1:9161991-9162111	.	GPR157
chr1	9162050	9162051	chr1:9162050-9162051	.	rs6697376

?
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 11-19-2016
Your output is correct in the post but I do not know how to index correctly. Thank you Smilie.
# 4  
Old 11-19-2016
Try something more like:
Code:
awk '
BEGIN {	OFS = "\t"
}
FNR == NR {
	split($1, a, /[-:]/)
	s[++c] = a[1]
	m[c] = a[2]
	M[c] = a[3]
	t[c] = $2
	next
}
$6 == "unknown" {
	for(i = 1; i <= c; i++)
		if($1 == s[i] && $2 + 0 >= m[i] + 0 && $3 + 0 <= M[i] + 0) {
			$6 = t[i]
			break
		}
}
1' file1 file2

This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 11-19-2016
Thank you very much for your help, so each line is indexed by the minimum and maximum of all matching values in file1 $2. I added what I hope is close to understanding. Thank you Smilie.

Code:
awk '
BEGIN { OFS = "\t"     --- output tab-delimeted
}
FNR == NR {   ---- process alll rows  and columns
        split($1, a, /[-:]/)     ---- split $1 on [-:]
        s[++c] = a[1]            ---- array to read each lline
        m[c] = a[2]       ---- array to store min line value
        M[c] = a[3]       ---- array to store max line value
        t[c] = $2          ---- define min and max match (all matching values used)
        next             ---- process next line
}
$6 == "unknown" {       ----- only use this alue from $6 of file2
        for(i = 1; i <= c; i++)    ---- loop od each line in file
                if($1 == s[i] && $2 + 0 >= m[i] + 0 && $3 + 0 <= M[i] + 0) {  – min and max for each match line
                        $6 = t[i]    ---- update each “unknown”
                        break      ---- break out of process
                }
}
1' file1 file2


Last edited by cmccabe; 11-19-2016 at 06:54 PM.. Reason: added details
# 6  
Old 11-19-2016
I would have commented the code this way:
Code:
awk '		# Run awk to process the following awk script...
BEGIN {	OFS = "\t"		# Set output field separator.
}
FNR == NR {			# For each line in the 1st input file...
	split($1, a, /[-:]/)	# Split the 1st field into a[] with hyphen and
				# colon as subfield separators.
	s[++c] = a[1]		# Save the symbol found at the start of this line.
				# c is the current line number in the 1st file.
	m[c] = a[2]		# Save the minimum value found on this line.
	M[c] = a[3]		# Save the maximum value found on this line.
	t[c] = $2		# Save the tag found on this line.
	next			# Skip the remaining steps in this script.
}
$6 == "unknown" {		# For lines in the 2nd input file where $6 is
				# the string "unknown"...
	for(i = 1; i <= c; i++)	# Search for an entry in the arrays set while
				# reading the 1st input file where the symbol
				# on that line is the same as $1 on this line
				# and the minimum on that line is <= $2 on this
				# line and the maximum on that line is >= $3 on
				# this line...
		if($1 == s[i] && $2 + 0 >= m[i] + 0 && $3 + 0 <= M[i] + 0) {
			$6 = t[i]	# and if all of those conditions are met,
					# replace the "unknown" with the tag from
					# that line, and
			break		# break out of this for loop.
		}
}
1				# Print the (possibly updated) line from the 2nd
				# file.
' file1 file2	# End the awk script and name the input files to be processed.

This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 11-23-2016
Thank you very much Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk to update file with partial matching line in another file and append text

In the awk below I am trying to cp and paste each matching line in f2 to $3 in f1 if $2 of f1 is in the line in f2 somewhere. There will always be a match (usually more then 1) and my actual data is much larger (several hundreds of lines) in both f1 and f2. When the line in f2 is pasted to $3 in... (4 Replies)
Discussion started by: cmccabe
4 Replies

2. UNIX for Beginners Questions & Answers

How to sum value of a column by range defined in another file awk?

I have two files, file1.table is the count table, and the other is the range condition file2.range. file1.table chr start end count N1 0 48 1 N1 48 181 2 N1 181 193 0 N1 193 326 2 N1 326 457 0 N1 457 471 1 N1 471 590 2 N1 590 604 1 N1 604 752 1 N1 752 875 1 file2.range... (12 Replies)
Discussion started by: yifangt
12 Replies

3. Shell Programming and Scripting

awk to update file with sum of matching fields in another file

In the awk below I am trying to add a penalty to a score to each matching $1 in file2 based on the sum of $3+$4 (variable TL) from file1. Then the $4 value in file1 is divided by TL and multiplied by 100 (this valvue is variable S). Finally, $2 in file2 - S gives the updated $2 result in file2.... (2 Replies)
Discussion started by: cmccabe
2 Replies

4. Shell Programming and Scripting

awk to print out lines that do not fall between range in file

In the awk below I am trying to print out those lines in file2 that are no between $2 and $3 in file1. Both files are tab-delimeted and I think it's close but currently it is printeing out the matches. The --- are not part of the files they are just to show what lines match or fall into the range... (6 Replies)
Discussion started by: cmccabe
6 Replies

5. Shell Programming and Scripting

awk to update file based on 5 conditions

I am trying to use awk to update the below tab-delimited file based on 5 different rules/conditions. The final output is also tab-delimited and each line in the file will meet one of the conditions. My attemp is below as well though I am not very confident in it. Thank you :). Condition 1: The... (10 Replies)
Discussion started by: cmccabe
10 Replies

6. Shell Programming and Scripting

awk to filter file using range in another file

I have a very large tab-delimited, ~2GB file2 that I am trying to filter using $2 of file1. If $2 of file1 is in the range of $2 and $3 in file1 then the entire line of file2 is outputed. If the range match is not found then that line is skipped. The awk below does run but no output results. ... (3 Replies)
Discussion started by: cmccabe
3 Replies

7. Shell Programming and Scripting

awk to lookup value in one file in another range

I am trying to update the below awk, kindly provided by @RavinderSingh13, to update each line of file1 with either Low or No Low based on matching $2 of file1 to a range in $2 and $3 of file2. If the $2 value in file1 matches the range in file2 then that line is Low, otherwise it is No Low in the... (3 Replies)
Discussion started by: cmccabe
3 Replies

8. Shell Programming and Scripting

awk to lookup section of file in a range of another file

In the below, I am trying to lookup $1 and $2 from file1, in a range search using $1 $2 $3 of file2. If the search key from file1 is found in file2, then the word low is printed in the last field of that line in the updated file1. Only the last section of file1 needs to be searched, but I am not... (6 Replies)
Discussion started by: cmccabe
6 Replies

9. Shell Programming and Scripting

awk match to update contents of file

I am trying to match $1 in file1 with $2 in file2. If a match is found then $3 and $4 of file2 are copied to file1. Both files are tab-delimeted and I am getting a syntax error and would also like to update file1 in-place without creating a new file, but am not sure how. Thank you :). file1 ... (19 Replies)
Discussion started by: cmccabe
19 Replies

10. Shell Programming and Scripting

awk to update file if value within range

I have a file (sorted_unknown) with ~1400 $5 values before the - that are "unknown". What I am trying to do is use the text in $2 of (sort_targets) to update those "unknown" values in the (sorted_unknown). In $1 of (sort_targets) there are a set of numbers that can be used to update the "unknown"... (8 Replies)
Discussion started by: cmccabe
8 Replies
Login or Register to Ask a Question