awk to search field2 in file2 using range of fields file1 and using match to another field in file1


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to search field2 in file2 using range of fields file1 and using match to another field in file1
# 1  
Old 12-21-2016
awk to search field2 in file2 using range of fields file1 and using match to another field in file1

I am trying to use awk to find all the $2 values in file2 which is ~30MB and tab-delimited, that are between $2 and $3 in file1 which is ~2GB and tab-delimited.

I have just found out that I need to use $1 and $2 and $3 from file1 and $1 and $2of file2 must match $1 of file1 and be in the range of $2 and $3 of file1.

If the line in file2 is not found then it is skipped and the next line is processed.

I have removed the chr from $1 file1 so it now looks like:

file1 ~2GB
Code:
1  948953  948956  chr1:948953-948956  .   ISG15
1  949363  949858  chr1:949363-949858  .   ISG15

file 2 ~30MB
Code:
1  12214   rs202068986 C   G
1  949800  rs201725126 T   G

so in order for the line to be printed in the output it must match $1 and be in the range of $2 and $3 of file2

So, since the line from file2 matches $1 in file1 and is in the $2 and $3 range it is printed in the output.

desired output tab-delimited
Code:
1   949800  rs201725126 T   G     ISG15

awk
Code:
awk -F'\t' -v OFS='\t' 'FNR==1 { next }
      FNR == NR { file1[$1,$2,$3] = $1 FS $2 FS $3 }
      FNR != NR { file2[$1,$2,] = $1 FS $2 FS $3 }
            END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
      }' file1 file2 > out

awk 2 (very slow) and doesn't use $1 as a match
Code:
awk -F'\t' -v OFS='\t' '                   
NR == FNR {min[NR]=$2; max[NR]=$3; Gene[NR]=$NF; next}
 {                
     for (id in min) 
         if (min[id] < $2 && $2 < max[id]) {
             print $0, id, Gene[id]
             break              
         }
}                                     
' file1 file2

Also tried:
Code:
awk 'NR==FNR{a[$2]=$0;next}{for(i in a) if(i=$1&&i>=$2&&i<=$3) print a[i],$6}' file2 file1

Thank you Smilie.

Last edited by cmccabe; 12-21-2016 at 12:24 PM.. Reason: Changed a CODE tag pair to be ICODE
# 2  
Old 12-21-2016
Hello cmccabe,

I am not sure if I get your requirement completely so following code assumes that you want to match the contents according to Input_file2(means whatever and how many number of lines and fields in Input_file2 will be only matched to Input_fil1), also it checks condition where it checks values from Input_file1 should be greater than second field and lesser than 3rd field not vice versa.
Code:
awk 'FNR==NR{A[$1]=$0;B[$1,++D[$1]]=$2;next} {++C[$1]}($2<B[$1,C[$1]] && $3>B[$1,C[$1]]){print A[$1] FS $NF}'  Input_file2   Input_file1

You could make it tab delimited too, also if above assumption is not ok for you, kindly do let us know all the conditions and your query with complete details.

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
Old 12-21-2016
I don't know if this is the best way to do this, but it seems close. The below uses only $2 in file2 and looks in file1 using $2 as the min or lower and $3 as the max or upper.

So using the files below both line 2 and 3 would be printed in the output because only $2 from file2 was used to lookup in file1 with $2 being the lower and $3 being the upper. Since the value from file2 is between those two, it is printed. If $1 was also used in the lookup only line 2 would be printed. However I am not sure how to incorporate that into the awk or if there is a more efficient way to do this? Thank you very much Smilie.

Does this help?

awk
Code:
awk -F'\t' -v OFS='\t' '                   
NR == FNR {min[NR]=$2; max[NR]=$3; Gene[NR]=$NF; next}
 {                
     for (id in min) 
         if (min[id] < $2 && $2 < max[id]) {
             print $0, id, Gene[id]
             break              
         }
}                                     
' file1 file2

file1 tab-delimited

Code:
1  948953  948956  chr1:948953-948956  .   ISG15
1  949363  949858  chr1:949363-949858  .   ISG15

file2 tab-delimited

Code:
1  12214   rs202068986 C   G
1  949800  rs201725126 T   G
2  949800  rs201725155 -   G

desired output tab-delimited with only the unique line from file2

Code:
1  949800  rs201725126 T   G


Last edited by cmccabe; 12-21-2016 at 01:21 PM.. Reason: added details
# 4  
Old 12-21-2016
Hello cmccabe,

Sorry still little confuse about requirement, trying the same code but with small change of code, because I could see my script's produced output was very near to you output shown in POST#3. You could try following.
Code:
awk 'FNR==NR{A[$1]=$0;B[$1,++D[$1]]=$2;next} {++C[$1]}($2<B[$1,C[$1]] && $3>B[$1,C[$1]]){print A[$1]}'  Input_file2   Input_file1

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 5  
Old 12-21-2016
Code:
awk -F'\t' -v OFS='\t' 'FNR==NR{A[$1]=$0;B[$1,++D[$1]]=$2;next} {++C[$1]}($2<B[$1,C[$1]] && $3>B[$1,C[$1]]){print A[$1]}' NA12878_GIAB_highconf.txt IDP.bed > NA12878_GIAB_highconf_IDP.txt

The above command works great except when I use my actual datasets, the output is blank. They are the exact same file formats


NA12878_GIAB_highconf is file2 in the example
~80MB
Code:
1    13116    .    T    G
1    13118    .    A    G
1    13656    .    CAG    C

IDP.txt is file1 in the example
~3MB
Code:
1    948953    948956    chr1:948953-948956    .    ISG15
1    949363    949858    chr1:949363-949858    .    ISG15

I was able to decrease the size but they are still too large to attach. Thank you Smilie.

Last edited by cmccabe; 12-21-2016 at 03:16 PM.. Reason: fixed format
# 6  
Old 12-22-2016
Try:
Code:
awk '
BEGIN {	FS = OFS = "\t"
}
FNR == NR {
	low[$1, ++c[$1]] = $2
	high[$1, c[$1]] = $3
	data[$1, c[$1]] = $NF
	next
}
{	for(i = 1; i <= c[$1]; i++)
		if($2 >= low[$1, i] && $2 <= high[$1, i]) {
			print $0, data[$1, i]
			next
		}
}' file1 file2

This obviously won't work with the sample data you provided (since your sample data contains varying numbers of spaces as field separators, not tabs; and are not aligned on normal tab stop boundaries), but works fine with the data you provided if we 1st change each sequence of one or more spaces in your sample files to a single tab character.

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
These 2 Users Gave Thanks to Don Cragun For This Post:
# 7  
Old 12-22-2016
Thank you both very much Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update field using matching value in file1 and substring in field in file2

In the awk below I am trying to set/update the value of $14 in file2 in bold, using the matching NM_ in $12 or $9 in file2 with the NM_ in $2 of file1. The lengths of $9 and $12 can be variable but what is consistent is the start pattern will always be NM_ and the end pattern is always ;... (2 Replies)
Discussion started by: cmccabe
2 Replies

2. Shell Programming and Scripting

awk to update field in file2 if not the same as file1

Trying to use awk to: update $2 in file2 with the $2 value in file1, if $1 in file1 matches $13 in file2, which is tab-delimeted. The $2values may already be the same so in that case nothing happens and the next line is processed. There are exactly 4,605 unique $13 values. Thank you :). ... (4 Replies)
Discussion started by: cmccabe
4 Replies

3. UNIX for Dummies Questions & Answers

Compare file1 and file2, print matching lines in same order as file1

I want to print only the lines in file2 that match file1, in the same order as they appear in file 1 file1 file2 desired output: I'm getting the lines to match awk 'FNR==NR {a++}; FNR!=NR && a' file1 file2 but they are in sorted order, which is not what I want: Can anyone... (4 Replies)
Discussion started by: pathunkathunk
4 Replies

4. Shell Programming and Scripting

Retreive the records from file2 by using the first field in file1

Hi Freinds, i have a file1 as below file1 1|ndmf|fdd|d3484|34874 2|jdehf|wru7|478|w489 3|dfkj|wej|484|49894 file2 contains lakhs of records and not in sorted order i want to retrive only the records from file2 by searcing the first field of file 1 i used grep ^1 file2... (4 Replies)
Discussion started by: i150371485
4 Replies

5. UNIX for Dummies Questions & Answers

if matching strings in file1 and file2, add column from file1 to file2

I have very limited coding skills but I'm wondering if someone could help me with this. There are many threads about matching strings in two files, but I have no idea how to add a column from one file to another based on a matching string. I'm looking to match column1 in file1 to the number... (3 Replies)
Discussion started by: pathunkathunk
3 Replies

6. Shell Programming and Scripting

Match one column of file1 with that of file2

Hi, I have file1 like this aaa ggg ddd vvv eeeand file2 aaa 2 aaa 443 xxx 76 aaa 34 ggg 33 wee 99 ggg 33 ddd 1 ddd 10 ddd 98 sds 23 (4 Replies)
Discussion started by: polsum
4 Replies

7. Shell Programming and Scripting

AWK: Pattern match between 2 files, then compare a field in file1 as > or < field in file2

First, thanks for the help in previous posts... couldn't have gotten where I am now without it! So here is what I have, I use AWK to match $1 and $2 as 1 string in file1 to $1 and $2 as 1 string in file2. Now I'm wondering if I can extend this AWK command to incorporate the following: If $1... (4 Replies)
Discussion started by: right_coaster
4 Replies

8. Shell Programming and Scripting

Search & replace fields from file1 to file2

hi, I have two xml files with the name source.xml and tobe_replaced.xml. Sample data: source.xml contains: <?xml version="1.0"?> <product description="prod1" product_info="some/info"> <product description="prod2" product_info="xyz/allinfo"> <product description="abc/partialinfo"... (2 Replies)
Discussion started by: dragon.1431
2 Replies

9. Shell Programming and Scripting

awk/sed search lines in file1 matching columns in file2

Hi All, as you can see I'm pretty new to this board. :D I'm struggling around with small script to search a few fields in another file. Basically I have file1 looking like this: 15:38:28 sz:10001 pr:14.16 15:38:28 sz:10002 pr:18.41 15:38:29 sz:10003 pr:19.28 15:38:30 sz:10004... (1 Reply)
Discussion started by: floripoint
1 Replies

10. Shell Programming and Scripting

match value from file1 in file2

Hi, i've two files (file1, file2) i want to take value (in column1) and search in file2 if the they match print the value from file2. this is what i have so far. awk 'FILENAME=="file1"{ arr=$1 } FILENAME=="file2" {print $0} ' file1 file2 (2 Replies)
Discussion started by: myguess21
2 Replies
Login or Register to Ask a Question