awk to retain header lines in output


Login or Register to Reply

 
Thread Tools Search this Thread
# 1  
Old 3 Weeks Ago
awk to retain header lines in output

The awk below executes and produces the current output, which is correct, except I can not seem to include the header lines # and ## in the output as well. I tried adding !/^#/ thinking that it would skip the lines with # and output them but the entire file prints as is. Thank you Smilie.

file
Code:
##bcftools_normVersion=1.9+htslib-1.9
##bcftools_normCommand=norm --do-not-normalize -m -both /path/to/xxxxx.vcf; Date=Tue Feb 26 12:59:30 2019
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	xxxx
chr1	11174372	MTOR	A	<CNV>	100	PASS	FR=.;PRECISE=FALSE;SVTYPE=CNV;END=11217311;LEN=42939;NUMTILES=7;SD=0.47;CDF_MAPD=0.01:1.480581,0.025:1.544948,0.05:1.602554,0.1:1.671659,0.2:1.759366,0.25:1.793881,0.5:1.94,0.75:2.098021,0.8:2.139179,0.9:2.251416,0.95:2.348502,0.975:2.436069,0.99:2.541976;REF_CN=2;CI=0.05:1.60255,0.95:2.3485;RAW_CN=1.94;FUNC=[{'gene':'MTOR'}]	GT:GQ:CN	./.:0:1.94
chr1	11174383	COSM1161896	A	G	264.674	PASS	AF=0;AO=0;DP=4229;FAO=0;FDP=2000;FDVR=5;FR=.;FRO=2000;FSAF=0;FSAR=0;FSRF=1166;FSRR=834;FWDB=-0.0180893;FXX=0;HRUN=1;HS_ONLY=0;LEN=1;MLLD=127.881;OALT=G;OID=COSM1161896;OMAPALT=G;OPOS=11174383;OREF=A;PB=.;PBP=.;QD=0.529347;RBI=0.0955223;REFB=9.08841e-06;REVB=0.0937938;RO=4212;SAF=0;SAR=0;SRF=2442;SRR=1770;SSEN=0;SSEP=0;SSSB=3.6281e-08;STB=0.5;STBP=1;TYPE=snp;VARB=0;HS;FUNC=[{'transcript':'NM_004958.3','gene':'MTOR','location':'exonic','exon':'53'}]	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	0/0:264:4229:2000:4212:2000:0:0:0:0:0:2442:1770:0:0:1166:834
chr1	43814978	COSM1342796;COSM86963	A	G	231.262	PASS	AF=0.0010005;AO=4;DP=3351;FAO=2;FDP=1999;FDVR=10;FR=.,.;FRO=1997;FSAF=1;FSAR=1;FSRF=944;FSRR=1053;FWDB=0.00987233;FXX=0.000499998;HRUN=1;HS_ONLY=0;LEN=1,1;MLLD=106.81;OALT=G,T;OID=COSM1342796,COSM86963;OMAPALT=G,T;OPOS=43814978,43814978;OREF=A,A;PB=.;PBP=.;QD=0.462755;RBI=0.014386;REFB=4.80559e-05;REVB=0.010464;RO=3338;SAF=1;SAR=3;SRF=1576;SRR=1762;SSEN=0;SSEP=0;SSSB=-0.0113679;STB=0.526994;STBP=0.848;TYPE=snp;VARB=-0.0370454;HS;FUNC=[{'transcript':'NM_005373.2','gene':'MPL','location':'exonic','exon':'10'}]	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	0/0:231:3351:1999:3338:1997:4:2:0.0010005:3:1:1576:1762:1:1:944:1053
chr1	43814978	COSM1342796;COSM86963	A	G	231.262	PASS	AF=0.05;AO=4;DP=3351;FAO=2;FDP=1999;FDVR=10;FR=.,.;FRO=1997;FSAF=1;FSAR=1;FSRF=944;FSRR=1053;FWDB=0.00987233;FXX=0.000499998;HRUN=1;HS_ONLY=0;LEN=1,1;MLLD=106.81;OALT=G,T;OID=COSM1342796,COSM86963;OMAPALT=G,T;OPOS=43814978,43814978;OREF=A,A;PB=.;PBP=.;QD=0.462755;RBI=0.014386;REFB=4.80559e-05;REVB=0.010464;RO=3338;SAF=1;SAR=3;SRF=1576;SRR=1762;SSEN=0;SSEP=0;SSSB=-0.0113679;STB=0.526994;STBP=0.848;TYPE=snp;VARB=-0.0370454;HS;FUNC=[{'transcript':'NM_005373.2','gene':'MPL','location':'exonic','exon':'10'}]	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	0/0:231:3351:1999:3338:1997:4:2:0.0010005:3:1:1576:1762:1:1:944:1053

current output
Code:
chr1	43814978	COSM1342796;COSM86963	A	G	231.262	PASS	AF=0.05;AO=4;DP=3351;FAO=2;FDP=1999;FDVR=10;FR=.,.;FRO=1997;FSAF=1;FSAR=1;FSRF=944;FSRR=1053;FWDB=0.00987233;FXX=0.000499998;HRUN=1;HS_ONLY=0;LEN=1,1;MLLD=106.81;OALT=G,T;OID=COSM1342796,COSM86963;OMAPALT=G,T;OPOS=43814978,43814978;OREF=A,A;PB=.;PBP=.;QD=0.462755;RBI=0.014386;REFB=4.80559e-05;REVB=0.010464;RO=3338;SAF=1;SAR=3;SRF=1576;SRR=1762;SSEN=0;SSEP=0;SSSB=-0.0113679;STB=0.526994;STBP=0.848;TYPE=snp;VARB=-0.0370454;HS;FUNC=[{'transcript':'NM_005373.2','gene':'MPL','location':'exonic','exon':'10'}]	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	0/0:231:3351:1999:3338:1997:4:2:0.0010005:3:1:1576:1762:1:1:944:1053

desired output
Code:
##bcftools_normVersion=1.9+htslib-1.9
##bcftools_normCommand=norm --do-not-normalize -m -both /path/to/xxxxx.vcf; Date=Tue Feb 26 12:59:30 2019
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	xxxx
chr1	43814978	COSM1342796;COSM86963	A	G	231.262	PASS	AF=0.05;AO=4;DP=3351;FAO=2;FDP=1999;FDVR=10;FR=.,.;FRO=1997;FSAF=1;FSAR=1;FSRF=944;FSRR=1053;FWDB=0.00987233;FXX=0.000499998;HRUN=1;HS_ONLY=0;LEN=1,1;MLLD=106.81;OALT=G,T;OID=COSM1342796,COSM86963;OMAPALT=G,T;OPOS=43814978,43814978;OREF=A,A;PB=.;PBP=.;QD=0.462755;RBI=0.014386;REFB=4.80559e-05;REVB=0.010464;RO=3338;SAF=1;SAR=3;SRF=1576;SRR=1762;SSEN=0;SSEP=0;SSSB=-0.0113679;STB=0.526994;STBP=0.848;TYPE=snp;VARB=-0.0370454;HS;FUNC=[{'transcript':'NM_005373.2','gene':'MPL','location':'exonic','exon':'10'}]	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	0/0:231:3351:1999:3338:1997:4:2:0.0010005:3:1:1576:1762:1:1:944:1053


awk
Code:
awk -F'[\t;]' '
  {
    split(x,V)
    for(i=1; i<=NF; i++) {
      split($i,F,/=/)
      V[F[1]]=F[2]
    }
  }
  (V["AF"]+0 > .03) && 
  (V["DP"]+0 > 20)
' file

# 2  
Old 3 Weeks Ago
Hi, try:
Code:
  (V["AF"]+0 > .03) && 
  (V["DP"]+0 > 20) ||
  /^#/

or
Code:
awk -F'[\t;]' '
  /^#/ {
    print
    next
  }
  {
    split(x,V)
    for(i=1; i<=NF; i++) {
      split($i,F,/=/)
      V[F[1]]=F[2]
    }
  }
  (V["AF"]+0 > .03) && 
  (V["DP"]+0 > 20)
' file

This User Gave Thanks to Scrutinizer For This Post:
cmccabe (3 Weeks Ago)
# 3  
Old 3 Weeks Ago
Code:
awk -F'[\t;]' '
/^#/ { print;next}
  {
    split(x,V)
....
}

This User Gave Thanks to vgersh99 For This Post:
cmccabe (3 Weeks Ago)
# 4  
Old 3 Weeks Ago
Works great, thank you. I am currently learning python (or trying) and was going to use the awk as practice.... that is try rewriting it in python. Could I post back comments on each line to see if my thinking is correct? Thank you Smilie.

awk

Code:
awk -F'[\t;]' ' # call awk script and define FS as pattern of tab and semi-colon
  {
    split(x,V) # split each tab and ; and read into array V
    for(i=1; i<=NF; i++) {  # start loop iterating over each line
      split($i,F,/=/)  # split on the = and store in array F
      V[F[1]]=F[2]  # each V is tag=value (example AF=0.05)
    }
  }
  (V["AF"]+0 > .03) && # check AF is greater then 3% and
  (V["DP"]+0 => 20) || check DP is greter than or equal to 20
  /^#/  # retain header lines (if AF and DP criteria are met, print line(s) and header
' file  # define output file

/^#/ { print;next} # retains header as well

Last edited by cmccabe; 3 Weeks Ago at 01:18 PM.. Reason: commented awk
# 5  
Old 3 Weeks Ago
Quote:
Originally Posted by cmccabe
Could I post back comments on each line to see if my thinking is correct?
Of course you can do that - in fact you are explicitly encouraged to do so. This forum is all about self-empowerment and learning to help yourself. But you probably knew that already, didn't you?

A major difference between awk and sed is that the latter outputs every line, changed or not, by default. i.e.

Code:
sed 's/old/NEW/g' /some/file

will not only output all lines containing "old" with "old" changed to "NEW" but also all other lines, simply without any change at all. awk works different and will only output what it is explicitly told to output - through the print command or whatever means. Therefore, if there is no rule to print lines starting with a "#" then these lines will not be printed.

Quote:
Originally Posted by cmccabe
define FS as pattern of tab and semi-colon
Not quite: FS is defined as either a tab or a semicolon. [....] is a so-called "character-class" and often used in regexps. It always means "one of the enclosed characters". i.e. d[ae]n would match either "dan" or "den" but neither "dean" nor "daen". There is the possibility of grouping characters instead of enumerating them, i.e [a-z] is "any (non-capitalised) character a-z" and [a-zA-Z] is "any character a-z, capitalised or not".

You can also negate these classes by using "^" as first character: [^0-9] is "anything but a digit".

I hope this helps.

bakunin

Last edited by bakunin; 3 Weeks Ago at 01:28 PM..
These 2 Users Gave Thanks to bakunin For This Post:
cmccabe (3 Weeks Ago) nezabudka (3 Weeks Ago)
# 7  
Old 3 Weeks Ago
Only the last 2 lines are correctly compared with this separator -F'[\t;]
Code:
awk -F "AF=|DP=" '
/^#/    {print; next}
        {split($2 $3, V, ";")}
( V[1] > 0.03 ) && ( V[3] > 20 )
' file

This User Gave Thanks to nezabudka For This Post:
cmccabe (3 Weeks Ago)
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
Print header and lines that meet both conditions in awk cmccabe Shell Programming and Scripting 0 08-23-2017 09:13 AM
awk to add lines with symbol to output file cmccabe Shell Programming and Scripting 5 06-26-2017 03:16 PM
Remove lines from output in files using awk cmccabe Shell Programming and Scripting 4 07-01-2016 01:19 PM
awk to output lines less than number cmccabe Shell Programming and Scripting 3 07-30-2015 01:18 PM
Manipulate all rows except header, but header should be output as well juzz4fun Shell Programming and Scripting 2 05-10-2013 10:42 AM
AWK print and retain original format chrisjorg Shell Programming and Scripting 7 03-29-2012 04:46 AM
cat to a file but retain header Grueben UNIX for Dummies Questions & Answers 4 03-09-2012 12:22 AM
awk file comparison, x lines after matching as output killerbee Shell Programming and Scripting 5 01-22-2012 07:04 AM
How to retain blank spaces in AWK? bsn2011 Shell Programming and Scripting 8 02-06-2011 09:01 PM
Print duplicate only lines as normal output - Awk quincyjones Shell Programming and Scripting 2 09-30-2010 03:29 AM
Need to extract some lines from output via AWK asirohi Shell Programming and Scripting 3 06-10-2010 04:31 AM
AWK: Backslash \ and forcing output not to go onto new lines ingli UNIX for Dummies Questions & Answers 4 02-05-2010 03:48 PM
How to retain the header information of a file ahjiefreak UNIX for Dummies Questions & Answers 0 12-04-2007 06:21 PM
Strip 3 header lines and 4 trailer lines ganesh123 Shell Programming and Scripting 9 03-10-2007 04:15 PM