awk to filter file using range in another file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to filter file using range in another file
# 1  
Old 11-02-2016
awk to filter file using range in another file

I have a very large tab-delimited, ~2GB file2 that I am trying to filter using $2 of file1. If $2 of file1 is in the range of $2 and $3 in file1 then the entire line of file2 is outputed. If the range match is not found then that line is skipped. The awk below does run but no output results. Thank you Smilie.

File1
Code:
chr1    913116    913250    chr1:913116-913250    .    ISG15
chr1    13100    13590    chr1:13100-13590    .    ISG15
chr1    955542    955763    chr1:955542-955763    .    AGRN

File2
Code:
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    INTEGRATION
1    13016    .    T    G    50    PASS    platforms=1;platformnames=Illumina;datasets=1;datasetnames=HiSeqPE300x;callsets=2;callsetnames=HiSeqPE300xfreebayes,HiSeqPE300xGATK;datasetsmissingcall=CGnormal,IonExome,SolidPE50x50bp,SolidSE75bp;lowcov=CS_CGnormal_lowcov,CS_IonExomeTVC_lowcov,CS_SolidPE50x50GATKHC_lowcov,CS_SolidSE75GATKHC_lowcov    GT:PS:DP:GQ    1|1:.:316:160
1    13118    .    A    G    50    PASS    platforms=1;platformnames=Illumina;datasets=1;datasetnames=HiSeqPE300x;callsets=2;callsetnames=HiSeqPE300xfreebayes,HiSeqPE300xGATK;datasetsmissingcall=CGnormal,IonExome,SolidPE50x50bp,SolidSE75bp;lowcov=CS_CGnormal_lowcov,CS_IonExomeTVC_lowcov,CS_SolidPE50x50GATKHC_lowcov,CS_SolidSE75GATKHC_lowcov    GT:PS:DP:GQ    1|1:.:310:160
1    15211    .    T    G    50    PASS    platforms=1;platformnames=Illumina;datasets=1;datasetnames=HiSeqPE300x;callsets=2;callsetnames=HiSeqPE300xfreebayes,HiSeqPE300xGATK;datasetsmissingcall=CGnormal,IonExome,SolidPE50x50bp,SolidSE75bp;lowcov=CS_CGnormal_lowcov,CS_IonExomeTVC_lowcov,CS_SolidPE50x50GATKHC_lowcov,CS_SolidSE75GATKHC_lowcov;filt=CS_HiSeqPE300xfreebayes_filt    GT:PS:DP:GQ    1/1:.:627:160

desired output tab-delimited
Code:
1    13118    .    A    G    50    PASS    platforms=1;platformnames=Illumina;datasets=1;datasetnames=HiSeqPE300x;callsets=2;callsetnames=HiSeqPE300xfreebayes,HiSeqPE300xGATK;datasetsmissingcall=CGnormal,IonExome,SolidPE50x50bp,SolidSE75bp;lowcov=CS_CGnormal_lowcov,CS_IonExomeTVC_lowcov,CS_SolidPE50x50GATKHC_lowcov,CS_SolidSE75GATKHC_lowcov    GT:PS:DP:GQ    1|1:.:310:160

awk
Code:
awk -F'\t' -v OFS='\t' '                   
    NR == FNR {min[$1]=$2; max[$1]=$3;}
    {                
        for (id in min) 
            if (min[id] < $2 && $2 < max[id]) {
                print $0, id
                break              
            }
    }                                     
' file1 file2


Last edited by cmccabe; 11-02-2016 at 06:32 PM.. Reason: added details
# 2  
Old 11-02-2016
Hi,
Quote:
Originally Posted by cmccabe
If $2 of file1 is in the range of $2 and $3 in file1 then the entire line of file2 is outputed.
This assertion is always true...and your example does not allow to understand your requirements.

Could you give more precision ?

Regards.
This User Gave Thanks to disedorgue For This Post:
# 3  
Old 11-02-2016
An example that I hope is helpful is the first line in $2 of file2 is 13016 and that number does not fall in the $2 and $3 range of file1, so that entire line inj file2is removed. The second line in $2 of file2 is 13118 and that number does not in the $2 and $3 range of file1 in line 2, so that entire line in file2is printed in the output. Is this helpful? Thank you very much Smilie.
# 4  
Old 11-03-2016
Quote:
Originally Posted by cmccabe
I have a very large tab-delimited, ~2GB file2 that I am trying to filter using $2 of file1. If $2 of file1 is in the range of $2 and $3 in file1 then the entire line of file2 is outputed. If the range match is not found then that line is skipped. The awk below does run but no output results. Thank you Smilie.

File1
Code:
chr1    913116    913250    chr1:913116-913250    .    ISG15
chr1    13100    13590    chr1:13100-13590    .    ISG15
chr1    955542    955763    chr1:955542-955763    .    AGRN

File2
Code:
#CHROM    POS    ID    REF    ALT    ...
1    13016    .    T    G    ...
1    13118    .    A    G    ...
1    15211    .    T    G    ...

desired output tab-delimited
Code:
1    13118    .    A    G    ...

awk
Code:
awk -F'\t' -v OFS='\t' '                   
    NR == FNR {min[$1]=$2; max[$1]=$3;}
    {                
        for (id in min) 
            if (min[id] < $2 && $2 < max[id]) {
                print $0, id
                break              
            }
    }                                     
' file1 file2

Note that since $1 in file1 is always the same (i.e., chr1), there is only one element in the arrays min["chr1"] and max["chr1"] with their values being reset by each line that is read from file1.

If your input and output files really do have <tab> delimited fields (instead of fields separated by four <space>s as in the sample data you provided), the following slight changes to your code seem to produce the output you want:
Code:
awk -F'\t' -v OFS='\t' '                   
    NR == FNR {	min[NR]=$2; max[NR]=$3; next}
    {                
        for (id in min)
            if (min[id]+0 < $2+0 && $2+0 < max[id]+0) {
                print
                break              
            }
    }                                     
' file1 file2

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Filter and sort the file using awk

I have file and process it and provide clean output. input file Device Symmetrix Name : 000A4 Device Symmetrix Name : 000A5 Device Symmetrix Name : 000A6 Device Symmetrix Name : 000A7 Device Symmetrix Name : 000A8 Device Symmetrix Name : 000A9 Device Symmetrix Name ... (10 Replies)
Discussion started by: ranjancom2000
10 Replies

2. Shell Programming and Scripting

awk to filter file using another working on smaller subset

In the below awk if I use the attached file as the input, I get no results for TCF4. However, if I just copy that line from the attached file and use that as input I get results for TCF4. Basically the gene file is a 1 column list that is used to filter $8 of the attached file. When there is a... (9 Replies)
Discussion started by: cmccabe
9 Replies

3. Shell Programming and Scripting

awk to lookup value in one file in another range

I am trying to update the below awk, kindly provided by @RavinderSingh13, to update each line of file1 with either Low or No Low based on matching $2 of file1 to a range in $2 and $3 of file2. If the $2 value in file1 matches the range in file2 then that line is Low, otherwise it is No Low in the... (3 Replies)
Discussion started by: cmccabe
3 Replies

4. Shell Programming and Scripting

awk to lookup section of file in a range of another file

In the below, I am trying to lookup $1 and $2 from file1, in a range search using $1 $2 $3 of file2. If the search key from file1 is found in file2, then the word low is printed in the last field of that line in the updated file1. Only the last section of file1 needs to be searched, but I am not... (6 Replies)
Discussion started by: cmccabe
6 Replies

5. Shell Programming and Scripting

awk to update file if value within range

I have a file (sorted_unknown) with ~1400 $5 values before the - that are "unknown". What I am trying to do is use the text in $2 of (sort_targets) to update those "unknown" values in the (sorted_unknown). In $1 of (sort_targets) there are a set of numbers that can be used to update the "unknown"... (8 Replies)
Discussion started by: cmccabe
8 Replies

6. Shell Programming and Scripting

awk filter by columns of file csv

Hi, I would like extract some lines from file csv using awk , below the example: I have the file test.csv with in content below. FLUSSO;COD;DATA_LAV;ESITO ULL;78;17/09/2013;OL ULL;45;05/09/2013;Apertura NP;45;13/09/2013;Riallineamento ULLNP;78;17/09/2013;OL NPG;14;12/09/2013;AperturaTK... (6 Replies)
Discussion started by: giankan
6 Replies

7. Shell Programming and Scripting

Help with awk, using a file to filter another one

I have a main file: ... 17,466971 0,095185 17,562156 id 676 17,466971 0,096694 17,563665 id 677 17,466971 0,09816 17,565131 id 678 17,466971 0,099625 17,566596 id 679 17,466971 0,101091 17,568062 id 680 17,466971 0,016175 17,483146 id... (4 Replies)
Discussion started by: boblix
4 Replies

8. Shell Programming and Scripting

awk-filter record by another file

I have file1 3049 3138 4672 22631 45324 112382 121240 125470 130289 186128 193996 194002 202776 228002 253221 273523 284601 284605 641858 (8 Replies)
Discussion started by: biomed
8 Replies

9. Shell Programming and Scripting

AWK filter from file and print

Dear all, I am using awk to filter some data like this:- awk 'NR==FNR{a;next}($1 in a)' FS=":" filter.dat data.dat >! out.dat where the filter and input data look like this:- filter.dat... n_o00j_1900_40_007195350_0:n_o00j_1940_40_007308526... (3 Replies)
Discussion started by: atb299
3 Replies

10. Shell Programming and Scripting

Filter records in a file using AWK

I want to filter records in one of my file using AWK command (or anyother command). I am using the below code awk -F@ '$1=="0003"&&"$2==20100402" print {$0}' $INPUT > $OUTPUT I want to pass the 0003 and 20100402 values through a variable. How can I do this? Any help is much... (1 Reply)
Discussion started by: gpaulose
1 Replies
Login or Register to Ask a Question