awk to filter file using another working on smaller subset


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to filter file using another working on smaller subset
# 8  
Old 12-03-2016
I apologize, I am on my cell and its hard to post but in gene TCF4is the name. In file it may exist or be in there as TCF4 or TCF4;xxx. I will try the code. Thank you Smilie.

---------- Post updated 12-03-16 at 09:49 AM ---------- Previous update was 12-02-16 at 10:32 PM ----------

Thank you both, they both work great Smilie.
# 9  
Old 12-06-2016
I can not seem to adjust the awk] to capture all conditions of KCNMA1, the line in gene.txt attached. I have also attached data.txt, which is tab-delimeted

So in the below example both NONE;KCNMA1 andKCNMA1 would be captured in the output. The only other possibility would be KCNMA1;NONE, though that is not in the file it is a possibility.

There could also be multiple ;, however the name, in this case KCNMA1 will be included. Thank you Smilie.

awk
Code:
awk -F'\t' -v OFS='\t' 'NR==FNR{a[$0];next} FNR==1{print} {x=$8; sub(/;.*/,"",x)} x in a{$1=++c; print}' gene.txt data.txt  > out

desired out
Code:
R_Index    Chr    Start    End    Ref    Alt    Func.IDP.refGene    Gene.IDP.refGene    GeneDetail.IDP.refGene    Inheritence    ExonicFunc.IDP.refGene    AAChange.IDP.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    dpsi_max_tissue    dpsi_zscore    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Phred    Classification    HGMD    Sanger
4629    chr10    78944590    78944590    G    A    intergenic    NONE;KCNMA1    dist=NONE;dist=451371    .    .    .    rs1131824    0.7    0.41    0.7    0.27    0.25    0.34    0.33    0.36    0.64    0.19    0.27    0.38    0.37    0.35    0.32    0.45    0.62    0.36    0.47    -1.6276    -1.768    .    .    .    .    .    .    .    .    .    P    .    .    other    not_specified    RCV000117331.6    MedGen    CN169374    GOOD    117    het    6    .    .    .
4630    chr10    79396463    79396463    C    T    intronic    KCNMA1    .    .    .    .    rs12217221    0.21    0.14    0.02    0.16    0.18    0.21    0.17    .    .    .    .    .    .    .    .    .    .    .    0.14    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    GOOD    160    hom    43    .    .    .

Maybe:

Code:
awk -F'\t' -v OFS='\t' 'NR==FNR{a[$0];next} FNR==1{print} {x=$8; sub(/;.*/,"",""/;.*,x)} x in a{$1=++c; print}' gene.txt data.txt  > out


Last edited by cmccabe; 12-06-2016 at 05:05 PM..
# 10  
Old 12-06-2016
You might want to try something more like:
Code:
awk -F'\t' -v OFS='\t' '
NR == FNR {
	a[$0]
	next
}
FNR == 1
{	n = split($8, x, /;/)
	for(i = 1; i <= n; i++)
		if(x[i] in a) {
			print
			next
		}
}' gene.txt data.txt > out

which produces the output you said you wanted with those two input files (as long as we change each occurrence of four adjacent <space> characters in the output you said you wanted to a single <tab> character).

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Splitting a text file into smaller files with awk, how to create a different name for each new file

Hello, I have some large text files that look like, putrescine Mrv1583 01041713302D 6 5 0 0 0 0 999 V2000 2.0928 -0.2063 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 5.6650 0.2063 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3.5217 ... (3 Replies)
Discussion started by: LMHmedchem
3 Replies

2. Shell Programming and Scripting

Filter and sort the file using awk

I have file and process it and provide clean output. input file Device Symmetrix Name : 000A4 Device Symmetrix Name : 000A5 Device Symmetrix Name : 000A6 Device Symmetrix Name : 000A7 Device Symmetrix Name : 000A8 Device Symmetrix Name : 000A9 Device Symmetrix Name ... (10 Replies)
Discussion started by: ranjancom2000
10 Replies

3. Shell Programming and Scripting

awk to filter file based on seperate conditions

The below awk will filter a list of 30,000 lines in the tab-delimited file. What I am having trouble with is adding a condition to SVTYPE=CNV that will only print that line if CI= must be >.05 . The other condition to add is if SVTYPE=Fusion, then in order to print that line READ_COUNT must... (3 Replies)
Discussion started by: cmccabe
3 Replies

4. UNIX for Beginners Questions & Answers

Awk: subset of fields as variable with sprint

Dear Unix Gurus, input: A|1|2|3|4|5 B|1|2|3|4|3 C|1|2|3|4|1 D|1|9|3|4|12 output: A_(5);B(3);C(1)|1|2|3|4 D_(12)|1|9|3|4 Details: If $2, $3, $3, $5 are identical, concatenate $1 and associated $NF together in the first field. But I am trying to do the above by passing the identical... (6 Replies)
Discussion started by: beca123456
6 Replies

5. Shell Programming and Scripting

awk filter by columns of file csv

Hi, I would like extract some lines from file csv using awk , below the example: I have the file test.csv with in content below. FLUSSO;COD;DATA_LAV;ESITO ULL;78;17/09/2013;OL ULL;45;05/09/2013;Apertura NP;45;13/09/2013;Riallineamento ULLNP;78;17/09/2013;OL NPG;14;12/09/2013;AperturaTK... (6 Replies)
Discussion started by: giankan
6 Replies

6. Shell Programming and Scripting

Help with awk, using a file to filter another one

I have a main file: ... 17,466971 0,095185 17,562156 id 676 17,466971 0,096694 17,563665 id 677 17,466971 0,09816 17,565131 id 678 17,466971 0,099625 17,566596 id 679 17,466971 0,101091 17,568062 id 680 17,466971 0,016175 17,483146 id... (4 Replies)
Discussion started by: boblix
4 Replies

7. Shell Programming and Scripting

awk-filter record by another file

I have file1 3049 3138 4672 22631 45324 112382 121240 125470 130289 186128 193996 194002 202776 228002 253221 273523 284601 284605 641858 (8 Replies)
Discussion started by: biomed
8 Replies

8. Shell Programming and Scripting

AWK filter from file and print

Dear all, I am using awk to filter some data like this:- awk 'NR==FNR{a;next}($1 in a)' FS=":" filter.dat data.dat >! out.dat where the filter and input data look like this:- filter.dat... n_o00j_1900_40_007195350_0:n_o00j_1940_40_007308526... (3 Replies)
Discussion started by: atb299
3 Replies

9. Shell Programming and Scripting

Filter records in a file using AWK

I want to filter records in one of my file using AWK command (or anyother command). I am using the below code awk -F@ '$1=="0003"&&"$2==20100402" print {$0}' $INPUT > $OUTPUT I want to pass the 0003 and 20100402 values through a variable. How can I do this? Any help is much... (1 Reply)
Discussion started by: gpaulose
1 Replies

10. Shell Programming and Scripting

filter parts of a big file using awk or sed script

I need an assistance in file generation using awk, sed or anything... I have a big file that i need to filter desired parts only. The objective is to select (and print) the report # having the string "apple" on 2 consecutive lines in every report. Please note that the "apple" line has a HEX... (1 Reply)
Discussion started by: apalex
1 Replies
Login or Register to Ask a Question