awk to filter file using another working on smaller subset

12-03-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I apologize, I am on my cell and its hard to post but in gene TCF4is the name. In file it may exist or be in there as TCF4 or TCF4;xxx. I will try the code. Thank you

.

---------- Post updated 12-03-16 at 09:49 AM ---------- Previous update was 12-02-16 at 10:32 PM ----------

Thank you both, they both work great

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

12-06-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I can not seem to adjust the awk] to capture all conditions of KCNMA1, the line in gene.txt attached. I have also attached data.txt, which is tab-delimeted

So in the below example both NONE;KCNMA1 andKCNMA1 would be captured in the output. The only other possibility would be KCNMA1;NONE, though that is not in the file it is a possibility.

There could also be multiple ;, however the name, in this case KCNMA1 will be included. Thank you

.

awk

Code:

awk -F'\t' -v OFS='\t' 'NR==FNR{a[$0];next} FNR==1{print} {x=$8; sub(/;.*/,"",x)} x in a{$1=++c; print}' gene.txt data.txt  > out

desired out

Code:

R_Index    Chr    Start    End    Ref    Alt    Func.IDP.refGene    Gene.IDP.refGene    GeneDetail.IDP.refGene    Inheritence    ExonicFunc.IDP.refGene    AAChange.IDP.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    dpsi_max_tissue    dpsi_zscore    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Phred    Classification    HGMD    Sanger
4629    chr10    78944590    78944590    G    A    intergenic    NONE;KCNMA1    dist=NONE;dist=451371    .    .    .    rs1131824    0.7    0.41    0.7    0.27    0.25    0.34    0.33    0.36    0.64    0.19    0.27    0.38    0.37    0.35    0.32    0.45    0.62    0.36    0.47    -1.6276    -1.768    .    .    .    .    .    .    .    .    .    P    .    .    other    not_specified    RCV000117331.6    MedGen    CN169374    GOOD    117    het    6    .    .    .
4630    chr10    79396463    79396463    C    T    intronic    KCNMA1    .    .    .    .    rs12217221    0.21    0.14    0.02    0.16    0.18    0.21    0.17    .    .    .    .    .    .    .    .    .    .    .    0.14    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    GOOD    160    hom    43    .    .    .

Maybe:

Code:

awk -F'\t' -v OFS='\t' 'NR==FNR{a[$0];next} FNR==1{print} {x=$8; sub(/;.*/,"",""/;.*,x)} x in a{$1=++c; print}' gene.txt data.txt  > out

gene.txt (7 Bytes)

data.txt.tar.gz (622.4 KB)

Last edited by cmccabe; 12-06-2016 at 05:05 PM..

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

12-06-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You might want to try something more like:

Code:

awk -F'\t' -v OFS='\t' '
NR == FNR {
	a[$0]
	next
}
FNR == 1
{	n = split($8, x, /;/)
	for(i = 1; i <= n; i++)
		if(x[i] in a) {
			print
			next
		}
}' gene.txt data.txt > out

which produces the output you said you wanted with those two input files (as long as we change each occurrence of four adjacent <space> characters in the output you said you wanted to a single <tab> character).

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

awk to filter file using another working on smaller subset

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Splitting a text file into smaller files with awk, how to create a different name for each new file

Discussion started by: LMHmedchem

2. Shell Programming and Scripting

Filter and sort the file using awk

Discussion started by: ranjancom2000

3. Shell Programming and Scripting

awk to filter file based on seperate conditions

Discussion started by: cmccabe

4. UNIX for Beginners Questions & Answers

Awk: subset of fields as variable with sprint

Discussion started by: beca123456

5. Shell Programming and Scripting

awk filter by columns of file csv

Discussion started by: giankan

6. Shell Programming and Scripting

Help with awk, using a file to filter another one

Discussion started by: boblix

7. Shell Programming and Scripting

awk-filter record by another file

Discussion started by: biomed

8. Shell Programming and Scripting

AWK filter from file and print

Discussion started by: atb299

9. Shell Programming and Scripting

Filter records in a file using AWK

Discussion started by: gpaulose

10. Shell Programming and Scripting

filter parts of a big file using awk or sed script

Discussion started by: apalex