05-30-2017
awk to remove mutiple values from specific pattern, leaving a single value
In the
awk below I am trying to remove all instances after a
; (semi-colon) or
, (comma) in the
ANN= pattern. I am using
gsub
to substitute an empty string in these, so that
ANN= is a single value (with only one value in it the one right after the ANN=). Thank you
.
I have comented my
awk and included a description of each line as well.
input
tab-deliimeted
Code :
chr1 987200 . C T 1217.2 PASS AF=1;AO=127;DP=127;FAO=127;FDP=127;FR=.;FRO=0;FSAF=63;FSAR=64;FSRF=0;FSRR=0;FWDB=-0.0049104;FXX=0;HRUN=1;LEN=1;MLLD=167.668;OALT=T;OID=.;OMAPALT=T;OPOS=987200;OREF=C;PB=.;PBP=.;QD=38.3369;RBI=0.0213032;REFB=0;REVB=0.0207296;RO=0;SAF=63;SAR=64;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=-5.21543e-08;STB=0.5;STBP=1;TYPE=snp;VARB=2.41961e-05;ANN=AGRN GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT 1/1:57:127:127:0:0:127:127:1:64:63:0:0:64:63:0:0:1 GOOD 127 hom 37
chr1 990280 . C T 2418.92 PASS AF=1;AO=258;DP=264;FAO=260;FDP=260;FR=.;FRO=0;FSAF=120;FSAR=140;FSRF=0;FSRR=0;FWDB=0.0249502;FXX=0.0225555;HRUN=1;LEN=1;MLLD=92.2049;OALT=T;OID=.;OMAPALT=T;OPOS=990280;OREF=C;PB=.;PBP=.;QD=37.2141;RBI=0.0261262;REFB=-0.11255;REVB=-0.00775041;RO=0;SAF=118;SAR=140;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=0;STB=0.5;STBP=1;TYPE=snp;VARB=0.000526608;ANN=AGRN GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT 1/1:86:264:260:0:0:258:260:1:140:118:0:0:140:120:0:0:1 GOOD 260 hom 73
chr2 48915871 . A G 1624.87 PASS AF=1;AO=170;DP=171;FAO=172;FDP=172;FR=.;FRO=0;FSAF=92;FSAR=80;FSRF=0;FSRR=0;FWDB=0.0234407;FXX=0;HRUN=1;LEN=1;MLLD=70.9343;OALT=G;OID=.;OMAPALT=G;OPOS=48915871;OREF=A;PB=.;PBP=.;QD=37.7877;RBI=0.0331357;REFB=0;REVB=0.0234202;RO=1;SAF=91;SAR=79;SRF=0;SRR=1;SSEN=0;SSEP=0;SSSB=0.00598669;STB=0.5;STBP=1;TYPE=snp;VARB=0.000172399;ANN=LHCGR;STON1-GTF2A1L,LHCGR GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT 1/1:76:171:172:1:0:170:172:1:79:91:0:1:80:92:0:0:1 GOOD 172 hom 49
chr2 48921375 . T C 481.192 PASS AF=1;AO=51;DP=51;FAO=51;FDP=51;FR=.;FRO=0;FSAF=27;FSAR=24;FSRF=0;FSRR=0;FWDB=0.0171521;FXX=0;HRUN=2;LEN=1;MLLD=203.707;OALT=C;OID=.;OMAPALT=C;OPOS=48921375;OREF=T;PB=.;PBP=.;QD=37.7406;RBI=0.0241572;REFB=0;REVB=0.0170111;RO=0;SAF=27;SAR=24;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=0;STB=0.5;STBP=1;TYPE=snp;VARB=8.09379e-05;ANN=LHCGR,LHCGR;STON1-GTF2A1L GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT 1/1:23:51:51:0:0:51:51:1:24:27:0:0:24:27:0:0:1 GOOD 51 hom 15
chr2 48925746 . C T 1144.07 PASS AF=1;AO=114;DP=114;FAO=119;FDP=119;FR=.,REALIGNEDx0.958;FRO=0;FSAF=54;FSAR=65;FSRF=0;FSRR=0;FWDB=0.0374429;FXX=0;HRUN=1;LEN=1;MLLD=261.838;OALT=T;OID=.;OMAPALT=T;OPOS=48925746;OREF=C;PB=.;PBP=.;QD=38.456;RBI=0.0379673;REFB=0;REVB=0.00628838;RO=0;SAF=51;SAR=63;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=3.26593e-08;STB=0.5;STBP=1;TYPE=snp;VARB=2.12074e-05;ANN=LHCGR;STON1-GTF2A1L,LHCGR GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT 1/1:54:114:119:0:0:114:119:1:63:51:0:0:65:54:0:0:1 GOOD 119 hom 35
chr2 49189921 . C T 570.875 PASS AF=0.582474;AO=113;DP=193;FAO=113;FDP=194;FR=.,REALIGNEDx0.5825;FRO=81;FSAF=53;FSAR=60;FSRF=44;FSRR=37;FWDB=-0.00244613;FXX=0;HRUN=1;LEN=1;MLLD=239.763;OALT=T;OID=.;OMAPALT=T;OPOS=49189921;OREF=C;PB=.;PBP=.;QD=11.7706;RBI=0.0159301;REFB=-0.00100522;REVB=-0.0157412;RO=80;SAF=53;SAR=60;SRF=44;SRR=36;SSEN=0;SSEP=0;SSSB=-0.061858;STB=0.530968;STBP=0.315;TYPE=snp;VARB=0.000865756;ANN=FSHR GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT 0/1:280:193:194:80:81:113:113:0.582474:60:53:44:36:60:53:44:37:1 GOOD 194 het 17
desired output
tab-delimeted
Code :
chr1 987200 . C T 1217.2 PASS AF=1;AO=127;DP=127;FAO=127;FDP=127;FR=.;FRO=0;FSAF=63;FSAR=64;FSRF=0;FSRR=0;FWDB=-0.0049104;FXX=0;HRUN=1;LEN=1;MLLD=167.668;OALT=T;OID=.;OMAPALT=T;OPOS=987200;OREF=C;PB=.;PBP=.;QD=38.3369;RBI=0.0213032;REFB=0;REVB=0.0207296;RO=0;SAF=63;SAR=64;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=-5.21543e-08;STB=0.5;STBP=1;TYPE=snp;VARB=2.41961e-05;ANN=AGRN GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT 1/1:57:127:127:0:0:127:127:1:64:63:0:0:64:63:0:0:1 GOOD 127 hom 37
chr1 990280 . C T 2418.92 PASS AF=1;AO=258;DP=264;FAO=260;FDP=260;FR=.;FRO=0;FSAF=120;FSAR=140;FSRF=0;FSRR=0;FWDB=0.0249502;FXX=0.0225555;HRUN=1;LEN=1;MLLD=92.2049;OALT=T;OID=.;OMAPALT=T;OPOS=990280;OREF=C;PB=.;PBP=.;QD=37.2141;RBI=0.0261262;REFB=-0.11255;REVB=-0.00775041;RO=0;SAF=118;SAR=140;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=0;STB=0.5;STBP=1;TYPE=snp;VARB=0.000526608;ANN=AGRN GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT 1/1:86:264:260:0:0:258:260:1:140:118:0:0:140:120:0:0:1 GOOD 260 hom 73
chr2 48915871 . A G 1624.87 PASS AF=1;AO=170;DP=171;FAO=172;FDP=172;FR=.;FRO=0;FSAF=92;FSAR=80;FSRF=0;FSRR=0;FWDB=0.0234407;FXX=0;HRUN=1;LEN=1;MLLD=70.9343;OALT=G;OID=.;OMAPALT=G;OPOS=48915871;OREF=A;PB=.;PBP=.;QD=37.7877;RBI=0.0331357;REFB=0;REVB=0.0234202;RO=1;SAF=91;SAR=79;SRF=0;SRR=1;SSEN=0;SSEP=0;SSSB=0.00598669;STB=0.5;STBP=1;TYPE=snp;VARB=0.000172399;ANN=LHCGR GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT 1/1:76:171:172:1:0:170:172:1:79:91:0:1:80:92:0:0:1 GOOD 172 hom 49
chr2 48921375 . T C 481.192 PASS AF=1;AO=51;DP=51;FAO=51;FDP=51;FR=.;FRO=0;FSAF=27;FSAR=24;FSRF=0;FSRR=0;FWDB=0.0171521;FXX=0;HRUN=2;LEN=1;MLLD=203.707;OALT=C;OID=.;OMAPALT=C;OPOS=48921375;OREF=T;PB=.;PBP=.;QD=37.7406;RBI=0.0241572;REFB=0;REVB=0.0170111;RO=0;SAF=27;SAR=24;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=0;STB=0.5;STBP=1;TYPE=snp;VARB=8.09379e-05;ANN=LHCGR GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT 1/1:23:51:51:0:0:51:51:1:24:27:0:0:24:27:0:0:1 GOOD 51 hom 15
chr2 48925746 . C T 1144.07 PASS AF=1;AO=114;DP=114;FAO=119;FDP=119;FR=.,REALIGNEDx0.958;FRO=0;FSAF=54;FSAR=65;FSRF=0;FSRR=0;FWDB=0.0374429;FXX=0;HRUN=1;LEN=1;MLLD=261.838;OALT=T;OID=.;OMAPALT=T;OPOS=48925746;OREF=C;PB=.;PBP=.;QD=38.456;RBI=0.0379673;REFB=0;REVB=0.00628838;RO=0;SAF=51;SAR=63;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=3.26593e-08;STB=0.5;STBP=1;TYPE=snp;VARB=2.12074e-05;ANN=LHCGR GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT 1/1:54:114:119:0:0:114:119:1:63:51:0:0:65:54:0:0:1 GOOD 119 hom 35
chr2 49189921 . C T 570.875 PASS AF=0.582474;AO=113;DP=193;FAO=113;FDP=194;FR=.,REALIGNEDx0.5825;FRO=81;FSAF=53;FSAR=60;FSRF=44;FSRR=37;FWDB=-0.00244613;FXX=0;HRUN=1;LEN=1;MLLD=239.763;OALT=T;OID=.;OMAPALT=T;OPOS=49189921;OREF=C;PB=.;PBP=.;QD=11.7706;RBI=0.0159301;REFB=-0.00100522;REVB=-0.0157412;RO=80;SAF=53;SAR=60;SRF=44;SRR=36;SSEN=0;SSEP=0;SSSB=-0.061858;STB=0.530968;STBP=0.315;TYPE=snp;VARB=0.000865756;ANN=FSHR GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT 0/1:280:193:194:80:81:113:113:0.582474:60:53:44:36:60:53:44:37:1 GOOD 194 het 17
description
Code :
line1 is good as ANN= has no ; or , in it ANN=AGRN portion after the = is a single value
line2 is good as ANN= has no ; or , in it ANN=AGRN portion after the = is a single value
line3 ANN=LHCGR;STON1-GTF2A1L,LHCGR has both ; and , in it so everything after the first value is removed
line4 ANN=LHCGR,LHCGR;STON1-GTF2A1L has both ; and , in it so everything after the first value is removed
line5 ANN=LHCGR;STON1-GTF2A1L,LHCGR has both ; and , in it so everything after the first value is removed
line1 is good as ANN= has no ; or , in it ANN=FSHR portion after the = is a single value
awk
Code :
awk -F'\t' -v OFS="\t" ' # define input and output FS as tab
{if(/ANN=/); = search each line for pattern ANN=
{sub(/;,*/,""); print}}' input if ANN= has a ; or ' in it substitute values after with null values/empty strings (removing them)
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Input file:
>TRACK: Position: 1 TYPE: 1 Pos:
SVAVPQRHHPGGTVFREPIIIPAIPRLVPGWNKPIIIGRHAFGDQYRATDRVIPGPGKLE
LVYTPVNGEPETVKVYDFQGGGIAQTQYNTDESIRGFAHASFQMALLKGLPLYMSTKNTI
LKRYDGRFKDIFQEIYESTYQKDFEAKNLWYEHRLIDDMVAQMIKSEGGFVMALKNYDGD
>TRACK: Position: 1 TYPE: 2 Pos:
FAHASFQMALLKGLPLYMS... (8 Replies)
Discussion started by: patrick87
8 Replies
2. Shell Programming and Scripting
Hi,
I have requirement that I need to split my input file into two files based on a search pattern "abc"
For eg. my input file has below content
abc
defgh
zyx
I need file 1 with
abc
and file2 with
defgh
zyx
I can use grep command to acheive this. But with grep I need... (8 Replies)
Discussion started by: sbhuvana20
8 Replies
3. Shell Programming and Scripting
Input file
matrix-remodelling_associated_8_
aurora_interacting_1_
L20
von_factor_A_domain_1
ATP_containing_3B_
.
.
Output file
matrix-remodelling_associated_8
aurora_interacting_1
L20
von_factor_A_domain_1
ATP_containing_3B
.
. (3 Replies)
Discussion started by: perl_beginner
3 Replies
4. Shell Programming and Scripting
Hi,
I have a file with following pattern. We are looking to filter out only specific content from this file.
sample
BLAdmins Server.*
LinuxAdmins Server.*
Policy Name: Recommended Default ACL Policy
Everyone ACLPushJob.Read
Everyone ACLTemplate.Read
Everyone ... (9 Replies)
Discussion started by: Litu19
9 Replies
5. Shell Programming and Scripting
Hi,
Please help to fetch the values for a key from below data format in linux.
Sample Input Data Format
11055005|PurchaseCondition|GiftQuantity|1
11055005|PurchaseCondition|MinimumPurchase|400
11055005|GiftCatalogEntryIdentifier|Id|207328014
11429510|PurchaseCondition|GiftQuantity|1... (2 Replies)
Discussion started by: mohanalakshmi
2 Replies
6. Shell Programming and Scripting
I have the following file I wanted to convert mutiple spaces to tab:
I tried cat filename | tr ' ' '\t' or sed 's/ */ /' FILE
but it looses the format
5557263102 5557263102 5552074858 5726310211 5557263102 5557263102
5557263103 5557263103 2142406768 ... (2 Replies)
Discussion started by: amir07
2 Replies
7. Shell Programming and Scripting
Hi,
Could you please help me finding a way to replace a specific value in a text block when matching a key pattern ?
I got the keys and the values from a command similar to:
echo -e "key01 Nvalue01-1 Nvalue01-2 Nvalue01-3\nkey02 Nvalue02-1 Nvalue02-2 Nvalue02-3 \nkey03 Nvalue03-1... (2 Replies)
Discussion started by: alex2005
2 Replies
8. Shell Programming and Scripting
Hello,
I wrote several scripts that work with job arrays (sbatch) that look like this:
#!/bin/bash
#SBATCH --partition=carl.p
#SBATCH --ntasks=4
#SBATCH --time=0-10:00
#SBATCH --mem-per-cpu=48G
#SBATCH --job-name=basic_test
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=
#SBATCH... (2 Replies)
Discussion started by: idbemad
2 Replies
9. UNIX for Beginners Questions & Answers
In the awk below I am trying to remove all lines above and including the pattern Test or Test2. Each block is seperated by a newline and Test2 also appears in the lines to keep but it will always have additional text after it. The Test to remove will not. The awk executed until the || was added... (2 Replies)
Discussion started by: cmccabe
2 Replies
10. UNIX for Beginners Questions & Answers
In the awk piped to sed below I am trying to format file by removing the odd xxxx_digits and whitespace after, then move the even xxxx_digit to the line above it and add a space between them. There may be multiple lines in file but they are in the same format. The Filename_ID line is the last line... (4 Replies)
Discussion started by: cmccabe
4 Replies