awk to remove mutiple values from specific pattern, leaving a single value


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to remove mutiple values from specific pattern, leaving a single value
# 1  
Old 05-30-2017
awk to remove mutiple values from specific pattern, leaving a single value

In the awk below I am trying to remove all instances after a ; (semi-colon) or , (comma) in the ANN= pattern. I am using gsub
to substitute an empty string in these, so that ANN= is a single value (with only one value in it the one right after the ANN=). Thank you Smilie.
I have comented my awk and included a description of each line as well.

input tab-deliimeted
Code:
chr1	987200	.	C	T	1217.2	PASS	AF=1;AO=127;DP=127;FAO=127;FDP=127;FR=.;FRO=0;FSAF=63;FSAR=64;FSRF=0;FSRR=0;FWDB=-0.0049104;FXX=0;HRUN=1;LEN=1;MLLD=167.668;OALT=T;OID=.;OMAPALT=T;OPOS=987200;OREF=C;PB=.;PBP=.;QD=38.3369;RBI=0.0213032;REFB=0;REVB=0.0207296;RO=0;SAF=63;SAR=64;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=-5.21543e-08;STB=0.5;STBP=1;TYPE=snp;VARB=2.41961e-05;ANN=AGRN	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT	1/1:57:127:127:0:0:127:127:1:64:63:0:0:64:63:0:0:1	GOOD	127	hom	37
chr1	990280	.	C	T	2418.92	PASS	AF=1;AO=258;DP=264;FAO=260;FDP=260;FR=.;FRO=0;FSAF=120;FSAR=140;FSRF=0;FSRR=0;FWDB=0.0249502;FXX=0.0225555;HRUN=1;LEN=1;MLLD=92.2049;OALT=T;OID=.;OMAPALT=T;OPOS=990280;OREF=C;PB=.;PBP=.;QD=37.2141;RBI=0.0261262;REFB=-0.11255;REVB=-0.00775041;RO=0;SAF=118;SAR=140;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=0;STB=0.5;STBP=1;TYPE=snp;VARB=0.000526608;ANN=AGRN	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT	1/1:86:264:260:0:0:258:260:1:140:118:0:0:140:120:0:0:1	GOOD	260	hom	73
chr2	48915871	.	A	G	1624.87	PASS	AF=1;AO=170;DP=171;FAO=172;FDP=172;FR=.;FRO=0;FSAF=92;FSAR=80;FSRF=0;FSRR=0;FWDB=0.0234407;FXX=0;HRUN=1;LEN=1;MLLD=70.9343;OALT=G;OID=.;OMAPALT=G;OPOS=48915871;OREF=A;PB=.;PBP=.;QD=37.7877;RBI=0.0331357;REFB=0;REVB=0.0234202;RO=1;SAF=91;SAR=79;SRF=0;SRR=1;SSEN=0;SSEP=0;SSSB=0.00598669;STB=0.5;STBP=1;TYPE=snp;VARB=0.000172399;ANN=LHCGR;STON1-GTF2A1L,LHCGR	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT	1/1:76:171:172:1:0:170:172:1:79:91:0:1:80:92:0:0:1	GOOD	172	hom	49
chr2	48921375	.	T	C	481.192	PASS	AF=1;AO=51;DP=51;FAO=51;FDP=51;FR=.;FRO=0;FSAF=27;FSAR=24;FSRF=0;FSRR=0;FWDB=0.0171521;FXX=0;HRUN=2;LEN=1;MLLD=203.707;OALT=C;OID=.;OMAPALT=C;OPOS=48921375;OREF=T;PB=.;PBP=.;QD=37.7406;RBI=0.0241572;REFB=0;REVB=0.0170111;RO=0;SAF=27;SAR=24;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=0;STB=0.5;STBP=1;TYPE=snp;VARB=8.09379e-05;ANN=LHCGR,LHCGR;STON1-GTF2A1L	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT	1/1:23:51:51:0:0:51:51:1:24:27:0:0:24:27:0:0:1	GOOD	51	hom	15
chr2	48925746	.	C	T	1144.07	PASS	AF=1;AO=114;DP=114;FAO=119;FDP=119;FR=.,REALIGNEDx0.958;FRO=0;FSAF=54;FSAR=65;FSRF=0;FSRR=0;FWDB=0.0374429;FXX=0;HRUN=1;LEN=1;MLLD=261.838;OALT=T;OID=.;OMAPALT=T;OPOS=48925746;OREF=C;PB=.;PBP=.;QD=38.456;RBI=0.0379673;REFB=0;REVB=0.00628838;RO=0;SAF=51;SAR=63;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=3.26593e-08;STB=0.5;STBP=1;TYPE=snp;VARB=2.12074e-05;ANN=LHCGR;STON1-GTF2A1L,LHCGR	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT	1/1:54:114:119:0:0:114:119:1:63:51:0:0:65:54:0:0:1	GOOD	119	hom	35
chr2	49189921	.	C	T	570.875	PASS	AF=0.582474;AO=113;DP=193;FAO=113;FDP=194;FR=.,REALIGNEDx0.5825;FRO=81;FSAF=53;FSAR=60;FSRF=44;FSRR=37;FWDB=-0.00244613;FXX=0;HRUN=1;LEN=1;MLLD=239.763;OALT=T;OID=.;OMAPALT=T;OPOS=49189921;OREF=C;PB=.;PBP=.;QD=11.7706;RBI=0.0159301;REFB=-0.00100522;REVB=-0.0157412;RO=80;SAF=53;SAR=60;SRF=44;SRR=36;SSEN=0;SSEP=0;SSSB=-0.061858;STB=0.530968;STBP=0.315;TYPE=snp;VARB=0.000865756;ANN=FSHR	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT	0/1:280:193:194:80:81:113:113:0.582474:60:53:44:36:60:53:44:37:1	GOOD	194	het	17

desired output tab-delimeted
Code:
chr1	987200	.	C	T	1217.2	PASS	AF=1;AO=127;DP=127;FAO=127;FDP=127;FR=.;FRO=0;FSAF=63;FSAR=64;FSRF=0;FSRR=0;FWDB=-0.0049104;FXX=0;HRUN=1;LEN=1;MLLD=167.668;OALT=T;OID=.;OMAPALT=T;OPOS=987200;OREF=C;PB=.;PBP=.;QD=38.3369;RBI=0.0213032;REFB=0;REVB=0.0207296;RO=0;SAF=63;SAR=64;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=-5.21543e-08;STB=0.5;STBP=1;TYPE=snp;VARB=2.41961e-05;ANN=AGRN	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT	1/1:57:127:127:0:0:127:127:1:64:63:0:0:64:63:0:0:1	GOOD	127	hom	37
chr1	990280	.	C	T	2418.92	PASS	AF=1;AO=258;DP=264;FAO=260;FDP=260;FR=.;FRO=0;FSAF=120;FSAR=140;FSRF=0;FSRR=0;FWDB=0.0249502;FXX=0.0225555;HRUN=1;LEN=1;MLLD=92.2049;OALT=T;OID=.;OMAPALT=T;OPOS=990280;OREF=C;PB=.;PBP=.;QD=37.2141;RBI=0.0261262;REFB=-0.11255;REVB=-0.00775041;RO=0;SAF=118;SAR=140;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=0;STB=0.5;STBP=1;TYPE=snp;VARB=0.000526608;ANN=AGRN	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT	1/1:86:264:260:0:0:258:260:1:140:118:0:0:140:120:0:0:1	GOOD	260	hom	73
chr2	48915871	.	A	G	1624.87	PASS	AF=1;AO=170;DP=171;FAO=172;FDP=172;FR=.;FRO=0;FSAF=92;FSAR=80;FSRF=0;FSRR=0;FWDB=0.0234407;FXX=0;HRUN=1;LEN=1;MLLD=70.9343;OALT=G;OID=.;OMAPALT=G;OPOS=48915871;OREF=A;PB=.;PBP=.;QD=37.7877;RBI=0.0331357;REFB=0;REVB=0.0234202;RO=1;SAF=91;SAR=79;SRF=0;SRR=1;SSEN=0;SSEP=0;SSSB=0.00598669;STB=0.5;STBP=1;TYPE=snp;VARB=0.000172399;ANN=LHCGR	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT	1/1:76:171:172:1:0:170:172:1:79:91:0:1:80:92:0:0:1	GOOD	172	hom	49
chr2	48921375	.	T	C	481.192	PASS	AF=1;AO=51;DP=51;FAO=51;FDP=51;FR=.;FRO=0;FSAF=27;FSAR=24;FSRF=0;FSRR=0;FWDB=0.0171521;FXX=0;HRUN=2;LEN=1;MLLD=203.707;OALT=C;OID=.;OMAPALT=C;OPOS=48921375;OREF=T;PB=.;PBP=.;QD=37.7406;RBI=0.0241572;REFB=0;REVB=0.0170111;RO=0;SAF=27;SAR=24;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=0;STB=0.5;STBP=1;TYPE=snp;VARB=8.09379e-05;ANN=LHCGR	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT	1/1:23:51:51:0:0:51:51:1:24:27:0:0:24:27:0:0:1	GOOD	51	hom	15
chr2	48925746	.	C	T	1144.07	PASS	AF=1;AO=114;DP=114;FAO=119;FDP=119;FR=.,REALIGNEDx0.958;FRO=0;FSAF=54;FSAR=65;FSRF=0;FSRR=0;FWDB=0.0374429;FXX=0;HRUN=1;LEN=1;MLLD=261.838;OALT=T;OID=.;OMAPALT=T;OPOS=48925746;OREF=C;PB=.;PBP=.;QD=38.456;RBI=0.0379673;REFB=0;REVB=0.00628838;RO=0;SAF=51;SAR=63;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=3.26593e-08;STB=0.5;STBP=1;TYPE=snp;VARB=2.12074e-05;ANN=LHCGR	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT	1/1:54:114:119:0:0:114:119:1:63:51:0:0:65:54:0:0:1	GOOD	119	hom	35
chr2	49189921	.	C	T	570.875	PASS	AF=0.582474;AO=113;DP=193;FAO=113;FDP=194;FR=.,REALIGNEDx0.5825;FRO=81;FSAF=53;FSAR=60;FSRF=44;FSRR=37;FWDB=-0.00244613;FXX=0;HRUN=1;LEN=1;MLLD=239.763;OALT=T;OID=.;OMAPALT=T;OPOS=49189921;OREF=C;PB=.;PBP=.;QD=11.7706;RBI=0.0159301;REFB=-0.00100522;REVB=-0.0157412;RO=80;SAF=53;SAR=60;SRF=44;SRR=36;SSEN=0;SSEP=0;SSSB=-0.061858;STB=0.530968;STBP=0.315;TYPE=snp;VARB=0.000865756;ANN=FSHR	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT	0/1:280:193:194:80:81:113:113:0.582474:60:53:44:36:60:53:44:37:1	GOOD	194	het	17

description
Code:
line1 is good as ANN= has no ; or , in it ANN=AGRN portion after the = is a single value
line2 is good as ANN= has no ; or , in it ANN=AGRN portion after the = is a single value
line3 ANN=LHCGR;STON1-GTF2A1L,LHCGR has both ; and , in it so everything after the first value is removed
line4 ANN=LHCGR,LHCGR;STON1-GTF2A1L has both ; and , in it so everything after the first value is removed
line5 ANN=LHCGR;STON1-GTF2A1L,LHCGR has both ; and , in it so everything after the first value is removed
line1 is good as ANN= has no ; or , in it ANN=FSHR portion after the = is a single value

awk
Code:
awk -F'\t' -v OFS="\t" '   # define input and output FS as tab
                      {if(/ANN=/); = search each line for pattern ANN=
                      {sub(/;,*/,""); print}}' input  if ANN= has a ; or ' in it substitute values after with null values/empty strings (removing them)

# 2  
Old 05-30-2017
sub can't remove part of a string like that, it has no backreferences. Instead I use match to figure out exactly where the good part is, and keep only exactly that.

Code:
awk -F"\t" -v OFS="\t" '{ for(N=1; N<=NF; N++) if(match($N, /ANN=[^;,]*/)) $N=substr($N, 0, RLENGTH+RSTART-1) ; } 1' inputfile > outputfile

This User Gave Thanks to Corona688 For This Post:
# 3  
Old 05-30-2017
Hello cmccabe,

Not sure if I got this 100%, could you please try following and let me know if this helps you.
Code:
awk '!/ANN.*[;,]/{print;next} {match($0,/ANN[^:]*/);VAL=substr($0,RSTART,RLENGTH);if(VAL){split(VAL, A," ");sub(/[;,].*/,"",A[1]);sub(/ANN[^:]*/,A[1] "\t" A[2],$0)}} 1'  Input_file

Adding a non-one liner form of solution too now.
Code:
awk '!/ANN.*[;,]/{
                        print;
                        next
                 }
                 {
                        match($0,/ANN[^:]*/);
                        VAL=substr($0,RSTART,RLENGTH);
                        if(VAL){
                                split(VAL, A," ");
                                sub(/[;,].*/,"",A[1]);
                                sub(/ANN[^:]*/,A[1] "\t" A[2],$0)
                                }
                 }
      1
    '    Input_file

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 4  
Old 05-30-2017
Code:
perl -pe 's/(ANN=\w+)[;,][^\s]*/$1/' input

This User Gave Thanks to Aia For This Post:
# 5  
Old 05-31-2017
where [^\s] can be shortened \S.
This User Gave Thanks to MadeInGermany For This Post:
# 6  
Old 05-31-2017
Thank you all very much Smilie

Corona688 in the portion of code below:

Code:
$N=substr($N, 0, RLENGTH+RSTART-1)

is RLENGTH+RSTART-1 the entire string after ANN=, but the index of only the first value? So, basically captures everything but only print index 1. I am just trying to grasp this concept and think the explanation by RavinderSingh13 previously helped. Thank you Smilie.

Last edited by cmccabe; 05-31-2017 at 09:33 AM.. Reason: fixed format
# 7  
Old 05-31-2017
Hello cmccabe,

You could refer the following URL for explanation on this function too.
https://www.unix.com/302997499-post12.html

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk with sed to combine lines and remove specific odd # pattern from line

In the awk piped to sed below I am trying to format file by removing the odd xxxx_digits and whitespace after, then move the even xxxx_digit to the line above it and add a space between them. There may be multiple lines in file but they are in the same format. The Filename_ID line is the last line... (4 Replies)
Discussion started by: cmccabe
4 Replies

2. UNIX for Beginners Questions & Answers

awk to remove pattern and lines above pattern

In the awk below I am trying to remove all lines above and including the pattern Test or Test2. Each block is seperated by a newline and Test2 also appears in the lines to keep but it will always have additional text after it. The Test to remove will not. The awk executed until the || was added... (2 Replies)
Discussion started by: cmccabe
2 Replies

3. Shell Programming and Scripting

Writing single script with mutiple sbatch parts

Hello, I wrote several scripts that work with job arrays (sbatch) that look like this: #!/bin/bash #SBATCH --partition=carl.p #SBATCH --ntasks=4 #SBATCH --time=0-10:00 #SBATCH --mem-per-cpu=48G #SBATCH --job-name=basic_test #SBATCH --mail-type=END,FAIL #SBATCH --mail-user= #SBATCH... (2 Replies)
Discussion started by: idbemad
2 Replies

4. Shell Programming and Scripting

Find specific pattern and change some of block values using awk

Hi, Could you please help me finding a way to replace a specific value in a text block when matching a key pattern ? I got the keys and the values from a command similar to: echo -e "key01 Nvalue01-1 Nvalue01-2 Nvalue01-3\nkey02 Nvalue02-1 Nvalue02-2 Nvalue02-3 \nkey03 Nvalue03-1... (2 Replies)
Discussion started by: alex2005
2 Replies

5. Shell Programming and Scripting

Convert mutiple spaces file to single tab

I have the following file I wanted to convert mutiple spaces to tab: I tried cat filename | tr ' ' '\t' or sed 's/ */ /' FILE but it looses the format 5557263102 5557263102 5552074858 5726310211 5557263102 5557263102 5557263103 5557263103 2142406768 ... (2 Replies)
Discussion started by: amir07
2 Replies

6. Shell Programming and Scripting

Fetch the values based on a Key using awk from single file

Hi, Please help to fetch the values for a key from below data format in linux. Sample Input Data Format 11055005|PurchaseCondition|GiftQuantity|1 11055005|PurchaseCondition|MinimumPurchase|400 11055005|GiftCatalogEntryIdentifier|Id|207328014 11429510|PurchaseCondition|GiftQuantity|1... (2 Replies)
Discussion started by: mohanalakshmi
2 Replies

7. Shell Programming and Scripting

How to remove content present in between specific pattern ?

Hi, I have a file with following pattern. We are looking to filter out only specific content from this file. sample BLAdmins Server.* LinuxAdmins Server.* Policy Name: Recommended Default ACL Policy Everyone ACLPushJob.Read Everyone ACLTemplate.Read Everyone ... (9 Replies)
Discussion started by: Litu19
9 Replies

8. Shell Programming and Scripting

Help with remove last text of a file that have specific pattern

Input file matrix-remodelling_associated_8_ aurora_interacting_1_ L20 von_factor_A_domain_1 ATP_containing_3B_ . . Output file matrix-remodelling_associated_8 aurora_interacting_1 L20 von_factor_A_domain_1 ATP_containing_3B . . (3 Replies)
Discussion started by: perl_beginner
3 Replies

9. Shell Programming and Scripting

NAWK to remove lines that matches a specific pattern

Hi, I have requirement that I need to split my input file into two files based on a search pattern "abc" For eg. my input file has below content abc defgh zyx I need file 1 with abc and file2 with defgh zyx I can use grep command to acheive this. But with grep I need... (8 Replies)
Discussion started by: sbhuvana20
8 Replies

10. Shell Programming and Scripting

Remove specific pattern header and its content problem facing

Input file: >TRACK: Position: 1 TYPE: 1 Pos: SVAVPQRHHPGGTVFREPIIIPAIPRLVPGWNKPIIIGRHAFGDQYRATDRVIPGPGKLE LVYTPVNGEPETVKVYDFQGGGIAQTQYNTDESIRGFAHASFQMALLKGLPLYMSTKNTI LKRYDGRFKDIFQEIYESTYQKDFEAKNLWYEHRLIDDMVAQMIKSEGGFVMALKNYDGD >TRACK: Position: 1 TYPE: 2 Pos: FAHASFQMALLKGLPLYMS... (8 Replies)
Discussion started by: patrick87
8 Replies
Login or Register to Ask a Question