awk to lookup section of file in a range of another file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to lookup section of file in a range of another file
# 1  
Old 09-10-2016
awk to lookup section of file in a range of another file

In the below, I am trying to lookup $1 and $2 from file1, in a range search using $1 $2 $3 of file2. If the search key from file1 is found in file2, then the word low is printed in the last field of that line in the updated file1. Only the last section of file1 needs to be searched, but I am not sure how to do this or if I am on the right track. The last section of file1 is Missing in IDP but found in Reference: and only the lines in this section need to be searched (in this example there are 2 lines). Both files are tab-delimited as well. Thank you Smilie.

file1
Code:
Match:
chr15    68521889    C    T    exonic    CLN6    GOOD    50    het    4
chr7    147183143    A    G    intronic    CNTNAP2    GOOD    382    het    22
Missing in Reference but found in IDP:
chr2    51666313    T    C    intergenic    NRXN1,NONE    GOOD    108    het    7
chr2    166903445    T    C    exonic    SCN1A    GOOD    400    het    28
Missing in IDP but found in Reference:
2    166210776    C    T    exonic    SCN2A    c.[2994C>T]+[=]    3095    23.1    24.56
7    148106478    -    GT    intronic    CNTNAP2    c.3716-5_3716-4insGT    4168    28.6    51.01

file2
Code:
chr2    50573818    50574097    NRXN1
chr7    148106400    148106550    CNTNAP2

desired output
Code:
Match:
chr15    68521889    C    T    exonic    CLN6    GOOD    50    het    4
chr7    147183143    A    G    intronic    CNTNAP2    GOOD    382    het    22
Missing in Reference but found in IDP:
chr2    51666313    T    C    intergenic    NRXN1,NONE    GOOD    108    het    7
chr2    166903445    T    C    exonic    SCN1A    GOOD    400    het    28
Missing in IDP but found in Reference:
2    166210776    C    T    exonic    SCN2A    c.[2994C>T]+[=]    3095    23.1    24.56
7    148106478    -    GT    intronic    CNTNAP2    c.3716-5_3716-4insGT    4168    28.6    51.01     low

awk
Code:
awk -F'\t' -v OFS='\t' '
NR==FNR{ range[$1,$2]; next }
FNR==1
{
 for(x in range) {
 split(x, check, SUBSEP);
 if($1==check[1] && $2>=check[2] && $2<=check[3]) print "low"
 }
}
' file1 file2


Last edited by Don Cragun; 09-10-2016 at 12:51 PM.. Reason: added details, fixed format
# 2  
Old 09-10-2016
Hello cmccabe,

I think some more information would have given added by you into your post, like as follows:
i- Is value of field2 is going to be always lesser than field3's value?(though I have considered both the cases into my code following)
ii- You have mentioned about last section of Input_file1 as Missing in Reference but found in IDP:, so do you mean by last column? Sorry I couldn't get it, so I didn't keep this point into my code.
iii- Could you please let us know if we need to update the Input_file1 itself here?

As a start we could try following code, if you could specify more clearly about your requirements, I would like to help more on this please.
Code:
awk 'FNR==NR{if($0 ~ /^[[:digit:]]/ || $0 ~ /^chr/){gsub(/[[:alpha:]]/,X,$1)};A[$1]=$2;B[$1]=$0;next} {gsub(/[[:alpha:]]/,X,$1)} ($1 in A){if((A[$1]>=$2 && A[$1]<=$3) || (A[$1]<=$2 && A[$1]>=$2)){print $0 FS "low";next}} END{for(i in A){print B[i]}} '  Input_file1  Input_file2

Output will be as follows.
Code:
7 148106400 148106550 CNTNAP2 low
Missing in IDP but found in Reference:
7    148106478    -    GT    intronic    CNTNAP2    c.3716-5_3716-4insGT    4168    28.6    51.01
Match:
2    166210776    C    T    exonic    SCN2A    c.[2994C>T]+[=]    3095    23.1    24.56
15 68521889 C T exonic CLN6 GOOD 50 het 4

Thanks,
R. Singh

Last edited by RavinderSingh13; 09-10-2016 at 12:30 PM..
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
Old 09-10-2016
Quote:
i- Is value of field2 is going to be always lesser than field3's value?(though I have considered both the cases into my code following)
- yes field2 will always be less than field 3

Quote:
ii- You have mentioned about last section of Input_file1 as Missing in Reference but found in IDP: , so do you mean by last column? Sorry I couldn't get it, so I didn't keep this point into my code.
-in file1 there are 3 sections or headers
Code:
section 1 is Match:
section 2 is Missing in Reference but found in IDP:
section 3 is Missing in IDP but found in Reference:

- only the last section needs to be searched and the other two sections can be skipped. Each section ends with a :

Quote:
iii- Could you please let us know if we need to update the Input_file1 itself here?
- yes, if the file1 can be updated in-place that would be helpful. Maybe through redirection?

Thank you very much Smilie.
# 4  
Old 09-10-2016
There is still a missing piece to the puzzle. The field 1 values in the 3rd section of file1 are 2 and 7 neither of which are equal to any field 1 values in file2 (chr2 and chr7).

If field 1 string equality is not required for a match; what are the rules for determining whether or not a line in file1 matches a line in file2? If field #1 supposed to be ignored and just look for field #2 values in file1 being between field #2 and #3 on any line in file2?
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 09-10-2016
I didn't even realize that the $1 values were formatted differently between the two files. In this case, since $1 will always not be equal, then field #2 values in file1 being between field #2 and #3 on any line in file2 should be used. These values should be unique. Also, if a match (or near match) is found it prints "low" if no match (or near match) is found then "not low" is printed. Thank you Smilie.

Last edited by cmccabe; 09-10-2016 at 03:01 PM.. Reason: added details
# 6  
Old 09-10-2016
Maybe something like:
Code:
#!/bin/ksh
TmpFile=${0##*/}.$$
awk '
BEGIN {	# Set input and output field separators...
	FS = OFS = "\t"
}
NR == FNR {
	# Grab low and high ends of ranges from the 1st input file...
	low[++c] = $2
	high[c] = $3
	next
}
sect == 3 {
	# We are in the 3rd section of the 2nd input file (after the section
	# header line)...
	# Look for a range of vaues from a line in the 1st file that includes
	# the 2nd field in this file...
	for(i = 1; i <= c; i++)
		if(low[i] <= $2 && $2 <= high[i]) {
			# Match found, add field and break out of loop.
			$(NF + 1) = "low"
			break
		}
}
/:$/ {	# Increment the 2nd input file section number when we find a colon at
	# the end of a line...
	sect++
}
1	# print the current contents of the 2nd file input line.
' file2 file1 > "$TmpFile" &&	# End awk script, specifying input files and
				# redirect the output to a temp file...
    cp "$TmpFile" file1 && 	# If the awk script was successful, copy the
    				# temp file back to the 2nd input file...
    rm "$TmpFile"		# and, if that was also successful, remove the
    				# temp file.

This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 09-14-2016
Thank you both very much Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk conditional operators- lookup value in 2nd file

I use conditional operators alot in AWK to print rows from large text files based on values in a certain column. For example: awk -F '\t' '{ if ($1 == "A" || $1 == "C" ) print $0}' OFS="\t" file1.txt > file2.txt In this case every row is printed from file1 to file2 for which the column 1... (5 Replies)
Discussion started by: Geneanalyst
5 Replies

2. Shell Programming and Scripting

awk to lookup value in one file in another range

I am trying to update the below awk, kindly provided by @RavinderSingh13, to update each line of file1 with either Low or No Low based on matching $2 of file1 to a range in $2 and $3 of file2. If the $2 value in file1 matches the range in file2 then that line is Low, otherwise it is No Low in the... (3 Replies)
Discussion started by: cmccabe
3 Replies

3. Shell Programming and Scripting

awk to print field from lookup file in output

The below awk uses $3 and $4 in search as the min and max, then takes each $2 value in lookup and compares it. If the value in lookupfalls within the range in searchthen it prints the entire line in lookup/ICODE]. What I can't seem to figure out is how to print the matching $5 from search on that... (4 Replies)
Discussion started by: cmccabe
4 Replies

4. Shell Programming and Scripting

Combined sed+awk for lookup csv file

have written a combined sed+awk to perform a lookup operation which works but looking to enhance it. looking to match a record using any of the comma separated values + return selected fields from the record - including the field header. so: cat foo make,model,engine,trim,value... (6 Replies)
Discussion started by: jack.bauer
6 Replies

5. UNIX for Dummies Questions & Answers

Help with AWK - Compare a field in a file to lookup file and substitute if only a match

I have the below 2 files: 1) Third field from file1.txt should be compared to the first field of lookup.txt. 2) If match found then third field, file1.txt should be substituted with the second field from lookup.txt. 3)Else just print the line from file1.txt. File1.txt:... (4 Replies)
Discussion started by: venalla_shine
4 Replies

6. Shell Programming and Scripting

Extract section of file based on word in section

I have a list of Servers in no particular order as follows: virtualMachines="IIBSBS IIBVICDMS01 IIBVICMA01"And I am generating some output from a pre-existing script that gives me the following (this is a sample output selection). 9/17/2010 8:00:05 PM: Normal backup using VDRBACKUPS... (2 Replies)
Discussion started by: jelloir
2 Replies

7. Shell Programming and Scripting

Multiple file lookup using awk

I want to lookup filea with fileb,filec and filed. If entry in filea exist in fileb and filec mark Y and then if entry in filea exist in filed mark as Y. Final output should have all the entries from filea. This prints only matching entries from file a in fileb i want all entries from... (9 Replies)
Discussion started by: pinnacle
9 Replies

8. UNIX for Advanced & Expert Users

Clueless about how to lookup and reverse lookup IP addresses under a file!!.pls help

Write a quick shell snippet to find all of the IPV4 IP addresses in any and all of the files under /var/lib/output/*, ignoring whatever else may be in those files. Perform a reverse lookup on each, and format the output neatly, like "IP=192.168.0.1, ... (0 Replies)
Discussion started by: choco4202002
0 Replies

9. Shell Programming and Scripting

file Lookup using awk

Hi All, I have two files file1 and file2(lookup file).I need to map more than one keyfields of file1 with file2.how can we achieve it using awk. file1(max 2.2 million records) -------------------------- 680720|680721|077 680720|680721|978 680721|680722|090 file2(no idea about the... (1 Reply)
Discussion started by: jerome Sukumar
1 Replies

10. Shell Programming and Scripting

sed & awk--get section of file based 2 params

I need to get a section of a file based on 2 params. I want the part of the file between param 1 & 2. I have tried a bunch of ways and just can't seem to get it right. Can someone please help me out.....its much appreciated. Here is what I have found that looks like what I want....but doesn't... (12 Replies)
Discussion started by: Andy Cook
12 Replies
Login or Register to Ask a Question