awk to add value and text to specific lines


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to add value and text to specific lines
# 1  
Old 11-29-2017
awk to add value and text to specific lines

In the awk I have a very large tab-delimeted file that I am trying to extract the DP= value put it in $16 and add
specific text to $16 with . (dot) in $11-$15 and $18. Only the lines (there are several) that have the formating below in file
will have an empty $16. Other lines will be in a different format and have a value in $16 and can be skipped. Thank you Smilie.

awk
Code:
awk -F'\t' -v OFS="\t" '{if (!$16) {print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, ".", ".", ".", ".", ".", /DP=/, "homref", "."}' file

file
Code:
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224

desired output tab-delimeted
Code:
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224	.	.	.	.	.	159	homref	.


Last edited by cmccabe; 11-29-2017 at 10:06 AM.. Reason: fixed format
# 2  
Old 11-29-2017
Other than adding text to the end of lines that start with a #, what is the code you're trying to use doing wrong?

What are you hoping that /DP=/ will print. I would expect it to print 1 if the string DP= appears somewhere in the current line and print 0 otherwise. Are you looking for the value between the first = and the first ; in field #8? If so, it would seem that you could split field 8 using /[=;]/ as the ERE to use as the subfield delimiter for field 8 and then print subfield 2. But making wild suggestions like this based on a sample size of 1 with no description of that field's contents is extremely dangerous.
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 11-29-2017
Code:
...
...
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224

desired output
Code:
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224	.	.	.	.	.	159	homref	.

other format of lines to skip
Code:
chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36

If I split on $8 using DP= and print the value up to the ;, since that tag is iin both lines, wont it print for both? However since $16 has a value in it that line can be skipped (the much larger of the two lines).

awk
Code:
awk -F'\t' -v OFS="\t" '{if (!$16);{split($8,a,[=;]); {print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, ".", ".", ".", ".", ".", a[1], "homref", "."}' file


Not sure is the above is any closer and I did not really know what would print, but understand better know? Thank you Smilie.

I also tried the perl below which executes but does not print any new output the desired output, but seems close.

perl
Code:
perl -plae ' 
 BEGIN{ %h = qw(0/0 homref) } /^[^#].*DP=(\d+);.*([0-0]\/[0-0])/ and $_ .= join "\t", ("", "."),".",".",".", ".", $1, $h{$7},"."' file > out

output of line to change --- seems to extract 224 and doesn't print homref though qw=0/0)
Code:
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224	.	.	.	.	.	224		.
chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36  --- this line is correct ----


Last edited by cmccabe; 11-29-2017 at 03:08 PM.. Reason: fixed format and added perl
# 4  
Old 11-29-2017
I'm very confused.

Are you just trying to print lines that need to have fields added? Or, do you want to print all lines in the file whether they change or not? Your code makes no attempt to ignore the header/comment lines at the start of your file, but you don't seem to want them to be changed even though none of them have 16 tab separated fields and you try to change every line that doesn't have a field #16.

I suggested using /[=;]/ as an ERE argument for split(); you used [=;] instead (which I would expect to give you a syntax error).

I suggested printing array[2] IF DP=value; is ALWAYS at the start of field #8. You printed array[1] and showed us an example where DP=value; is the third subfield in field #8.

I know that you like writing awk 1-liners (instead of readable code). But, until you fully understand the syntax of an if statement in akw, you would be much better off writing awk code longhand so you can actually see how the pieces are supposed to fit together. (And please show us the output you get from the code you're running including the diagnostic messages as well as the normal output.)

Those of us who are trying to help you get tired quickly when there is no clear specification of what you are trying to do and no clear specification of the format of the data you're trying to interpret and modify.
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 11-29-2017
I hope the below is a better description:

The DP= is in each line of the input file, however it is always at the start of $8 in the lines that the awk is goiog to update.
$11 will also always be empty (like in line1, in line2 there is a value so nothing is done). All lines are printed that are in the input file, that is the comments and each line is printed. In the example since line2 is good it is only printed as is, but line1 needs to be updated brfore it is printed. Do the comment lines need to be skipped as they do not have more then twwo fields in them? Thank you Smilie.


file 11 tab-delimeted fields --- the actual file is over 22 million lines in the below format ----
line1 and all lines like it are the one that the awk updates
line2 and all lines like it are skipped(nothing needs to be done to them)

file
Code:
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224
chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36


awk
Code:
awk '		# Run awk to process the following awk script...
BEGIN {	OFS = "\t" }            # Set output field separator.
      'NR==FNR {if (!$11) {     # For each line in the input file check if $11 is empty
        split($8,a,"/[=;]/");    # if $11 is empty then split $8 on the = and ; and store the value in a 
                next    # process next line in file
        }   # close block
                                    {   # start print block
                                        print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,".",".",".",".",".",a[2],"homref",".";  # print desired output
                                    } # close print block
                    next  # print next line
                    }1' file   # define input

I have syntax errors in the awk but added comments that I hope will help. I use one-liners cause for whatever reason I have a hard time not using them (as evident by the syntax errors).... never-the-less commented seems to help. Thank you Smilie.

Last edited by cmccabe; 11-29-2017 at 07:34 PM.. Reason: fixed format
# 6  
Old 11-29-2017
Assuming that your sample data is representative, the following seems to do what you want:
Code:
awk '
BEGIN {	FS = OFS = "\t"
}
NF == 10 {
	split($8, a, /[=;]/)
	$11 = $12 = $13 = $14 = $15 = $18 = "."
	$16 = (a[1] == "DP") ? a[2] : "DP=num_Missing"
	$17 = "homref"
}
1' file

Note that this assumes that lines that are to be modified only have 10 fields (i.e., 9 <tab>s) as in your sample. It verifies that field #8 does start with DP= and assigns a diagnostic message to field #16 if it isn't. This seemed safer to me than blindly assuming that the input file is correctly formatted. (And, I added a line to my sample input file to be sure that it worked as I expected. It didn't the first two times I tried it.)

When the contents of file is:
Code:
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224
chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36
ALT1	948798	.	D	.	1	PAST	END=948845;MAX_DP=224;MIN_DP=95;DP=159	GT:DP:MIN_DP:MAX_DP	1/1:160:96:225

then the output produced is:
Code:
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	948797	.	C	.	0	PASS	DP=159;END=948845;MAX_DP=224;MIN_DP=95	GT:DP:MIN_DP:MAX_DP	0/0:159:95:224	.	.	.	159	homref	.
chr1	948846	.	T	TA	1185.38	PASS	AF=0.986111;AO=200;DP=227;FAO=213;FDP=216;FDVR=0;FR=.;FRO=3;FSAF=132;FSAR=81;FSRF=3;FSRR=0;FWDB=-0.038073;FXX=0.0226234;HRUN=1;LEN=1;MLLD=24.748;OALT=A;OID=.;OMAPALT=TA;OPOS=948847;OREF=-;PB=.;PBP=.;QD=21.9515;RBI=0.0905281;REFB=0.0706997;REVB=0.0821327;RO=17;SAF=122;SAR=78;SRF=15;SRR=2;SSEN=0;SSEP=0;SSSB=-0.0430195;STB=0.505617;STBP=0.201;TYPE=ins;VARB=0.000148797	GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR	1/1:219:227:216:17:3:200:213:0.986111:78:122:15:2:81:132:3:0	0.986	132	81	1	GOOD	216	homalt	36
ALT1	948798	.	D	.	1	PAST	END=948845;MAX_DP=224;MIN_DP=95;DP=159	GT:DP:MIN_DP:MAX_DP	1/1:160:96:225	.	.	.	DP=num_Missing	homref	.

Hopefully, this will come close to what you want.
This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 12-04-2017
Thank you very much, works perfect thank you Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update specific value in file with match and add +1 to specific digit

I am trying to use awk to match the NM_ in file with $1 of id which is tab-delimited. The NM_ will always be in the line of file that starts with > and be after the second _. When there is a match between each NM_ and id, then the value of $2 in id is substituted or used to update the NM_. Each NM_... (3 Replies)
Discussion started by: cmccabe
3 Replies

2. Shell Programming and Scripting

awk to skip lines find text and add text based on number

I am trying to use awk skip each line with a ## or # and check each line after for STB= and if that value in greater than or = to 0.8, then at the end of line the text "STRAND BIAS" is written in else "GOOD". So in the file of 4 entries attached. awk tried: awk NR > "##"' "#" -F"STB="... (6 Replies)
Discussion started by: cmccabe
6 Replies

3. Shell Programming and Scripting

Delete lines above and below specific line of text

I'm trying to remove a specific number of lines, above and below a specific line of text, highlighted in red: <STMTTRN> <TRNTYPE>CREDIT <DTPOSTED>20151205000001 <TRNAMT>10 <FITID>667800001 <CHECKNUM>667800001 <MEMO>BALANCE </STMTTRN> <STMTTRN> <TRNTYPE>DEBIT <DTPOSTED>20151207000001... (8 Replies)
Discussion started by: bomsom
8 Replies

4. UNIX for Dummies Questions & Answers

Add strings from one file at the end of specific lines in text file

Hello All, this is my first post so I don't know if I am doing this right. I would like to append entries from a series of strings (contained in a text file) consecutively at the end of specifically labeled lines in another file. As an example: - the file that contains the values to be... (3 Replies)
Discussion started by: gus74
3 Replies

5. Shell Programming and Scripting

Summing over specific lines and replacing the lines with the sum using sed, awk

Hi friends, This is sed & awk type question. I have a text file which has numbers spread all over the file. I want to sum the series of numbers whenever i find it and produce an output file with the sum. For example ###start of input text file #### abc def ghi 1 2 3 4 kjld random... (3 Replies)
Discussion started by: kaaliakahn
3 Replies

6. Shell Programming and Scripting

extracting specific text from lines

Hello, i've got this output text: and i need it to look something like this: which means that there won't be absolute path of each directory, just it's size and the last word after last '/' in each line, and i also don't need last line '1.7M /tmp' Looks like there is a simple... (5 Replies)
Discussion started by: krater559
5 Replies

7. Shell Programming and Scripting

Assigning a specific format to a specific column in a text file using awk and printf

Hi, I have the following text file: 8 T1mapping_flip02 ok 128 108 30 1 665000-000008-000001.dcm 9 T1mapping_flip05 ok 128 108 30 1 665000-000009-000001.dcm 10 T1mapping_flip10 ok 128 108 30 1 665000-000010-000001.dcm 11 T1mapping_flip15 ok 128 108 30... (2 Replies)
Discussion started by: goodbenito
2 Replies

8. Shell Programming and Scripting

[bash help]Adding multiple lines of text into a specific spot into a text file

I am attempting to insert multiple lines of text into a specific place in a text file based on the lines above or below it. For example, Here is a portion of a zone file. IN NS ns1.domain.tld. IN NS ns2.domain.tld. IN ... (2 Replies)
Discussion started by: cdn_humbucker
2 Replies

9. Shell Programming and Scripting

Insert a text from a specific row into a specific column using SED or AWK

Hi, I am having trouble converting a text file. I have been working for this whole day now, still i couldn't make it. Here is how the text file looks: _______________________________________________________ DEVICE STATUS INFORMATION FOR LOCATION 1: OPER STATES: Disabled E:Enabled ... (5 Replies)
Discussion started by: Issemael
5 Replies

10. Shell Programming and Scripting

Extracting text out of specific lines

Hi, I have a file like LAHORE 2009-04-16 16:04:19 THU S5830 FAULT MESSAGE SUPPRESS STATUS LOC : ASP00 STS : SUPPRESSING CONTINUE INF : F6201 TRUNK. DATA FAULT REPORT COMPLETED LAHORE 2009-04-16 16:04:20 THU S8400 ISUP SIGNALLING TRACE -... (3 Replies)
Discussion started by: krabu
3 Replies
Login or Register to Ask a Question