awk to add value and text to specific lines

11-29-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

awk to add value and text to specific lines

In the awk I have a very large tab-delimeted file that I am trying to extract the DP= value put it in $16 and add
specific text to $16 with . (dot) in $11-$15 and $18. Only the lines (there are several) that have the formating below in file
will have an empty $16. Other lines will be in a different format and have a value in $16 and can be skipped. Thank you

.

awk

Code:

awk -F'\t' -v OFS="\t" '{if (!$16) {print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, ".", ".", ".", ".", ".", /DP=/, "homref", "."}' file

file

Code:

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224

desired output tab-delimeted

Code:

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224	.	.	.	.	.	159	homref	.

Last edited by cmccabe; 11-29-2017 at 10:06 AM.. Reason: fixed format

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

11-29-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Other than adding text to the end of lines that start with a #, what is the code you're trying to use doing wrong?

What are you hoping that /DP=/ will print. I would expect it to print 1 if the string DP= appears somewhere in the current line and print 0 otherwise. Are you looking for the value between the first = and the first ; in field #8? If so, it would seem that you could split field 8 using /[=;]/ as the ERE to use as the subfield delimiter for field 8 and then print subfield 2. But making wild suggestions like this based on a sample size of 1 with no description of that field's contents is extremely dangerous.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-29-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Code:

...
...
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224

desired output

Code:

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224	.	.	.	.	.	159	homref	.

other format of lines to skip

Code:

chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36

If I split on $8 using DP= and print the value up to the ;, since that tag is iin both lines, wont it print for both? However since $16 has a value in it that line can be skipped (the much larger of the two lines).

awk

Code:

awk -F'\t' -v OFS="\t" '{if (!$16);{split($8,a,[=;]); {print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, ".", ".", ".", ".", ".", a[1], "homref", "."}' file

Not sure is the above is any closer and I did not really know what would print, but understand better know? Thank you

.

I also tried the perl below which executes but does not print any new output the desired output, but seems close.

perl

Code:

perl -plae ' 
 BEGIN{ %h = qw(0/0 homref) } /^[^#].*DP=(\d+);.*([0-0]\/[0-0])/ and $_ .= join "\t", ("", "."),".",".",".", ".", $1, $h{$7},"."' file > out

output of line to change --- seems to extract 224 and doesn't print homref though qw=0/0)

Code:

chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224	.	.	.	.	.	224		.
chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36  --- this line is correct ----

Last edited by cmccabe; 11-29-2017 at 03:08 PM.. Reason: fixed format and added perl

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

11-29-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I'm very confused.

Are you just trying to print lines that need to have fields added? Or, do you want to print all lines in the file whether they change or not? Your code makes no attempt to ignore the header/comment lines at the start of your file, but you don't seem to want them to be changed even though none of them have 16 tab separated fields and you try to change every line that doesn't have a field #16.

I suggested using /[=;]/ as an ERE argument for split(); you used [=;] instead (which I would expect to give you a syntax error).

I suggested printing array[2] IF DP=value; is ALWAYS at the start of field #8. You printed array[1] and showed us an example where DP=value; is the third subfield in field #8.

I know that you like writing awk 1-liners (instead of readable code). But, until you fully understand the syntax of an if statement in akw, you would be much better off writing awk code longhand so you can actually see how the pieces are supposed to fit together. (And please show us the output you get from the code you're running including the diagnostic messages as well as the normal output.)

Those of us who are trying to help you get tired quickly when there is no clear specification of what you are trying to do and no clear specification of the format of the data you're trying to interpret and modify.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-29-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I hope the below is a better description:

The DP= is in each line of the input file, however it is always at the start of $8 in the lines that the awk is goiog to update.
$11 will also always be empty (like in line1, in line2 there is a value so nothing is done). All lines are printed that are in the input file, that is the comments and each line is printed. In the example since line2 is good it is only printed as is, but line1 needs to be updated brfore it is printed. Do the comment lines need to be skipped as they do not have more then twwo fields in them? Thank you

.

file 11 tab-delimeted fields --- the actual file is over 22 million lines in the below format ----
line1 and all lines like it are the one that the awk updates
line2 and all lines like it are skipped(nothing needs to be done to them)

file

Code:

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224
chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36

awk

Code:

awk '		# Run awk to process the following awk script...
BEGIN {	OFS = "\t" }            # Set output field separator.
      'NR==FNR {if (!$11) {     # For each line in the input file check if $11 is empty
        split($8,a,"/[=;]/");    # if $11 is empty then split $8 on the = and ; and store the value in a 
                next    # process next line in file
        }   # close block
                                    {   # start print block
                                        print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,".",".",".",".",".",a[2],"homref",".";  # print desired output
                                    } # close print block
                    next  # print next line
                    }1' file   # define input

I have syntax errors in the awk but added comments that I hope will help. I use one-liners cause for whatever reason I have a hard time not using them (as evident by the syntax errors).... never-the-less commented seems to help. Thank you

Last edited by cmccabe; 11-29-2017 at 07:34 PM.. Reason: fixed format

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

11-29-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Assuming that your sample data is representative, the following seems to do what you want:

Code:

awk '
BEGIN {	FS = OFS = "\t"
}
NF == 10 {
	split($8, a, /[=;]/)
	$11 = $12 = $13 = $14 = $15 = $18 = "."
	$16 = (a[1] == "DP") ? a[2] : "DP=num_Missing"
	$17 = "homref"
}
1' file

Note that this assumes that lines that are to be modified only have 10 fields (i.e., 9 <tab>s) as in your sample. It verifies that field #8 does start with DP= and assigns a diagnostic message to field #16 if it isn't. This seemed safer to me than blindly assuming that the input file is correctly formatted. (And, I added a line to my sample input file to be sure that it worked as I expected. It didn't the first two times I tried it.)

When the contents of file is:

Code:

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224
chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36
ALT1	948798	.	D	.	1	PAST	END=948845;MAX_DP=224;MIN_DP=95;DP=159	GT:DP:MIN_DP:MAX_DP	1/1:160:96:225

then the output produced is:

Code:

##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224	.	.	.	159	homref	.
chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36
ALT1	948798	.	D	.	1	PAST	END=948845;MAX_DP=224;MIN_DP=95;DP=159	GT:DP:MIN_DP:MAX_DP	1/1:160:96:225	.	.	.	DP=num_Missing	homref	.

Hopefully, this will come close to what you want.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-04-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you very much, works perfect thank you

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

awk to add value and text to specific lines

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update specific value in file with match and add +1 to specific digit

Discussion started by: cmccabe

2. Shell Programming and Scripting

awk to skip lines find text and add text based on number

Discussion started by: cmccabe

3. Shell Programming and Scripting

Delete lines above and below specific line of text

Discussion started by: bomsom

4. UNIX for Dummies Questions & Answers

Add strings from one file at the end of specific lines in text file

Discussion started by: gus74

5. Shell Programming and Scripting

Summing over specific lines and replacing the lines with the sum using sed, awk

Discussion started by: kaaliakahn

6. Shell Programming and Scripting

extracting specific text from lines

Discussion started by: krater559

7. Shell Programming and Scripting

Assigning a specific format to a specific column in a text file using awk and printf

Discussion started by: goodbenito

8. Shell Programming and Scripting

[bash help]Adding multiple lines of text into a specific spot into a text file

Discussion started by: cdn_humbucker

9. Shell Programming and Scripting

Insert a text from a specific row into a specific column using SED or AWK

Discussion started by: Issemael

10. Shell Programming and Scripting

Extracting text out of specific lines

Discussion started by: krabu