awk to filter file using another working on smaller subset


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to filter file using another working on smaller subset
# 1  
Old 12-02-2016
awk to filter file using another working on smaller subset

In the below awk if I use the attached file as the input, I get no results for TCF4. However, if I just copy that line from the attached file and use that as input I get results for TCF4.

Basically the gene file is a 1 column list that is used to filter $8 of the attached file. When there is a match that entire line is printed. I am not sure why the awk works on the smaller input but not the attached file, which is the real input. Thank you Smilie.

The tab-delimited file is ~8,500 lines.

contents of gene
Code:
SCN1A
SCN2A
TCF4

TCF4 line as input
Code:
7722    chr18    53303101    53303101    C    G    intergenic    TCF4;ST8SIA3    dist=47241;dist=1716620    .    .    .    rs611326    1.    1.    0.99    1.    1.    1.    1.    1.    0.99    1.    1.    1.    1.    1.    1.    1.    1.    1.    0.99    .    .    1    T    .    B    .    B    .    .    1.000    P    .    .    .    .    .    .    .    GOOD    80    hom    23    .    .

result
Code:
7722    chr18    53303101    53303101    C    G    intergenic    TCF4;ST8SIA3    dist=47241;dist=1716620    .    .    .    rs611326    1.    1.    0.99    1.    1.    1.    1.    1.    0.99    1.    1.    1.    1.    1.    1.    1.    1.    1.    0.99    .    .    1    T    .    B    .    B    .    .    1.000    P    .    .    .    .    .    .    .    GOOD    80    hom    23    .    .

awk
Code:
awk -F'\t' 'NR==FNR{a[$0];next} FNR==1{print} $8 in a{$1=++c; print}' gene file


Last edited by cmccabe; 12-02-2016 at 10:49 PM.. Reason: added awk
# 2  
Old 12-02-2016
Quote:
Originally Posted by cmccabe
In the below awk ... ... ...
What "below awk"???
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 12-02-2016
Sorry, I added it to the post. Thank you Smilie.
# 4  
Old 12-02-2016
I don't see why you would think that $8 (TCF4;ST8SIA3) in that line in that file would be found in the array a[] when the only values you put into that array are SCN1A, SCN2A, and TCF4.
# 5  
Old 12-02-2016
What would you recommend? The awk seems to work as expected with a limited data set. There are many lines that are similar in that they have ; separating but the name will be in there.

Thank you Smilie
# 6  
Old 12-03-2016
"the name will be in there" is quite vague, but you might try (totally untested):
Code:
awk -F'\t' 'NR==FNR{a[$0];next} FNR==1{print} {x=$8; sub(/;.*/,"",x)} x in a{$1=++c; print}' gene file

This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 12-03-2016
Quote:
Originally Posted by cmccabe
What would you recommend? The awk seems to work as expected with a limited data set. There are many lines that are similar in that they have ; separating but the name will be in there.
Thank you Smilie
Hello cmccabe,

Not sure if this is the required output you need by seeing your try only I have made it, could you please try following and let me know if this helps you.
Code:
awk -F"\t" 'FNR==NR{A[$0];next} {split($8, B,";");P=B[1]} (P in A){$1=++c;print}' gene file

Output will be as follows.
Code:
1 chr18 53303101 53303101 C G intergenic TCF4;ST8SIA3 dist=47241;dist=1716620 . . . rs611326 1. 1. 0.99 1. 1. 1. 1. 1. 0.99 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.99 . . 1 T . B . B . . 1.000 P . . . . . . . GOOD 80 hom 23 . .

You could set output field seprator as TAB in case you need it.

NOTE: You haven't splited 8th field in Input_file named file so only it can't find it in the array which is being created during first file reading.

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Splitting a text file into smaller files with awk, how to create a different name for each new file

Hello, I have some large text files that look like, putrescine Mrv1583 01041713302D 6 5 0 0 0 0 999 V2000 2.0928 -0.2063 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 5.6650 0.2063 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3.5217 ... (3 Replies)
Discussion started by: LMHmedchem
3 Replies

2. Shell Programming and Scripting

Filter and sort the file using awk

I have file and process it and provide clean output. input file Device Symmetrix Name : 000A4 Device Symmetrix Name : 000A5 Device Symmetrix Name : 000A6 Device Symmetrix Name : 000A7 Device Symmetrix Name : 000A8 Device Symmetrix Name : 000A9 Device Symmetrix Name ... (10 Replies)
Discussion started by: ranjancom2000
10 Replies

3. Shell Programming and Scripting

awk to filter file based on seperate conditions

The below awk will filter a list of 30,000 lines in the tab-delimited file. What I am having trouble with is adding a condition to SVTYPE=CNV that will only print that line if CI= must be >.05 . The other condition to add is if SVTYPE=Fusion, then in order to print that line READ_COUNT must... (3 Replies)
Discussion started by: cmccabe
3 Replies

4. UNIX for Beginners Questions & Answers

Awk: subset of fields as variable with sprint

Dear Unix Gurus, input: A|1|2|3|4|5 B|1|2|3|4|3 C|1|2|3|4|1 D|1|9|3|4|12 output: A_(5);B(3);C(1)|1|2|3|4 D_(12)|1|9|3|4 Details: If $2, $3, $3, $5 are identical, concatenate $1 and associated $NF together in the first field. But I am trying to do the above by passing the identical... (6 Replies)
Discussion started by: beca123456
6 Replies

5. Shell Programming and Scripting

awk filter by columns of file csv

Hi, I would like extract some lines from file csv using awk , below the example: I have the file test.csv with in content below. FLUSSO;COD;DATA_LAV;ESITO ULL;78;17/09/2013;OL ULL;45;05/09/2013;Apertura NP;45;13/09/2013;Riallineamento ULLNP;78;17/09/2013;OL NPG;14;12/09/2013;AperturaTK... (6 Replies)
Discussion started by: giankan
6 Replies

6. Shell Programming and Scripting

Help with awk, using a file to filter another one

I have a main file: ... 17,466971 0,095185 17,562156 id 676 17,466971 0,096694 17,563665 id 677 17,466971 0,09816 17,565131 id 678 17,466971 0,099625 17,566596 id 679 17,466971 0,101091 17,568062 id 680 17,466971 0,016175 17,483146 id... (4 Replies)
Discussion started by: boblix
4 Replies

7. Shell Programming and Scripting

awk-filter record by another file

I have file1 3049 3138 4672 22631 45324 112382 121240 125470 130289 186128 193996 194002 202776 228002 253221 273523 284601 284605 641858 (8 Replies)
Discussion started by: biomed
8 Replies

8. Shell Programming and Scripting

AWK filter from file and print

Dear all, I am using awk to filter some data like this:- awk 'NR==FNR{a;next}($1 in a)' FS=":" filter.dat data.dat >! out.dat where the filter and input data look like this:- filter.dat... n_o00j_1900_40_007195350_0:n_o00j_1940_40_007308526... (3 Replies)
Discussion started by: atb299
3 Replies

9. Shell Programming and Scripting

Filter records in a file using AWK

I want to filter records in one of my file using AWK command (or anyother command). I am using the below code awk -F@ '$1=="0003"&&"$2==20100402" print {$0}' $INPUT > $OUTPUT I want to pass the 0003 and 20100402 values through a variable. How can I do this? Any help is much... (1 Reply)
Discussion started by: gpaulose
1 Replies

10. Shell Programming and Scripting

filter parts of a big file using awk or sed script

I need an assistance in file generation using awk, sed or anything... I have a big file that i need to filter desired parts only. The objective is to select (and print) the report # having the string "apple" on 2 consecutive lines in every report. Please note that the "apple" line has a HEX... (1 Reply)
Discussion started by: apalex
1 Replies
Login or Register to Ask a Question