awk to add text to matching pattern in field

01-23-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

awk to add text to matching pattern in field

In the awk I am trying to add :p.=? to the end of each $9 that matches the pattern NM_. The below executes andis close but I can not seem to figure out why the :p.=? repeats in the split as in the green in the current output. I have added comments as well. Thank you

.

file

Code:

R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritance	ExonicFunc.refGene	AAChange.refGene
1	chr1	155870416	155870416	G	A	splicing	RIT1	NM_001256821:exon6:c.481-7C>T;NM_001256820:exon5:c.322-7C>T;NM_006912:exon6:c.430-7C>T
9	chr10	112760138	112760138	A	-	splicing	SHOC2	NM_007373:exon4:c.842-35A>-;NM_001269039:exon2:c.704-35A>-
11	chr18	53070914	53070914	G	A	exonic	TCF4	.	AD	nonsynonymous SNV	TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V

awk

Code:

awk '
  BEGIN { FS=OFS="\t" }  # define FS and OFS as tab and start processing
  $9 ~ /NM/ {            # look for pattern NM in $9
       # split $9 by ";" and cycle through them
          out=""   # array out is empty
      i=split($9,NM,/;/)
         for (n=1; n<=i; n++) {
          sub(/$/, ":p=", NM[i])   # add :p. to end off each NM[i] before the ;
          out = (out=="" ? "" : out";") NM[i]  # add ? to each NM[i] and store in array out
         }
      $9 = out  # update with array out
}1' file

desired output

Code:

R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritance	ExonicFunc.refGene	AAChange.refGene
1	chr1	155870416	155870416	G	A	splicing	RIT1	NM_001256821:exon6:c.481-7C>T:p=?;NM_001256820:exon5:c.322-7C>T:p=?;NM_006912:exon6:c.430-7C>T:p=?
9	chr10	112760138	112760138	A	-	splicing	SHOC2	NM_007373:exon4:c.842-35A>-:p=?;NM_001269039:exon2:c.704-35A>-:p=?
11	chr18	53070914	53070914	G	A	exonic	TCF4	.	AD	nonsynonymous SNV	TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V

current output

Code:

R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritance	ExonicFunc.refGene	AAChange.refGene
1	chr1	155870416	155870416	G	A	splicing	RIT1	NM_006912:exon6:c.430-7C>T:p=?;NM_006912:exon6:c.430-7C>T:p=?:p=?;NM_006912:exon6:c.430-7C>T:p=?:p=?:p=?
9	chr10	112760138	112760138	A	-	splicing	SHOC2	NM_001269039:exon2:c.704-35A>-:p=?;NM_001269039:exon2:c.704-35A>-:p=?:p=?
11	chr18	53070914	53070914	G	A	exonic	TCF4	.AD	nonsynonymous SNV	TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

01-23-2018

Read Only

1,278, 486

Join Date: Sep 2012

Last Activity: 27 February 2020, 8:59 PM EST

Location: Houston, Texas, USA

Posts: 1,278

Thanks Given: 0

Thanked 486 Times in 451 Posts

Code:

awk '
  BEGIN { FS=OFS="\t" }
  $9 ~ /NM/ {
       gsub(";", ":p=?;", $9);
       sub("$", ":p=?", $9);
  } 1' file

This User Gave Thanks to rdrtx1 For This Post:

rdrtx1

View Public Profile for rdrtx1

Find all posts by rdrtx1

01-23-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi cmccabe,
I agree with rdrtx1 that the code suggested in post #2 should do what you want.

What I don't understand is how the code you showed us in post #1 could produce the output that you labeled as "current output" in that post. Are you absolutely positive that the code you showed us in post #1 produced the output you showed us when file had the contents you showed us in that post?

The code you showed us seems like it would produce the desired number of additions to your field #9, but would omit the desired question marks and just append :p= to each subfield. The code to which you appended the comment:

Code:

# add ? to each NM[i] and store in array out

does not add question marks; it reforms the new field number 9 by adding back in the semicolons that were removed by the split(). And, note that the variable named out in your code is a string; not an array.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-23-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you very much rdrtx1, that works perfect

.

Don Cragun your are correct in that:

Quote:

The code you showed us seems like it would produce the desired number of additions to your field #9, but would omit the desired question marks and just append Smilie

= to each subfield. The code to which you appended the comment:

I forgot that I changed the
sub(/$/, ":p=", NM[i]) # add :p. to end off each NM[i] before the to
sub(/$/, ":p=?", NM[i]) # add :p. to end off each NM[i] before the

However the :p.=? seemed to be iterating based on the number of splits. Maybe it is the wrong terminology but I didn't understand why, no matter what I tried. Thank you for the correction on the array being a string, I was confused.

awk

Code:

awk '
   BEGIN { FS=OFS="\t" }  # define FS and OFS as tab and start processing
   $9 ~ /NM/ {            # look for pattern NM in $9
        # split $9 by ";" and cycle through them
           out=""
       i=split($9,NM,/;/)
          for (n=1; n<=i; n++) {
           sub(/$/, ":p=", NM[i])   # add :p. to end off each NM[i] before the ;
           out = (out=="" ? "" : out";") NM[i]
          }
       $9 = out
}1' file

R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritance	ExonicFunc.refGene	AAChange.refGene
1	chr1	155870416	155870416	G	A	splicing	RIT1	NM_006912:exon6:c.430-7C>T:p=;NM_006912:exon6:c.430-7C>T:p=:p=;NM_006912:exon6:c.430-7C>T:p=:p=:p=
9	chr10	112760138	112760138	A	-	splicing	SHOC2	NM_001269039:exon2:c.704-35A>-:p=;NM_001269039:exon2:c.704-35A>-:p=:p=
11	chr18	53070914	53070914	G	A	exonic	TCF4	.AD	nonsynonymous SNV	TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V

In rdrtx1 awk is the below close?

Code:

awk '
  BEGIN { FS=OFS="\t" }  # define FS and OFS as tab and start processing
  $9 ~ /NM/ { # look for pattern NM in $9
       gsub(";", ":p=?;", $9);  # split by ; in $9
       sub("$", ":p=?", $9);  # add :p=? to end of each split by ;
  } 1' file  # update input

Thank you very much

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

01-23-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Your code had four minor bugs. If you change what you showed us in post #1 to:

Code:

awk '
BEGIN { FS=OFS="\t" }  # define FS and OFS as tab and start processing
$9 ~ /NM/ {            # look for pattern NM in $9
	# split $9 by ";" and cycle through them
	out=""   # array out is empty
	i=split($9,NM,/;/)
	for (n=1; n<=i; n++) {
		sub(/$/, ":p=?", NM[n])   # add ":p=?" to end off each NM[i]
		out = (out=="" ? "" : out";") NM[n]  # add updated NM[i] to new output string, restoring ";"s.
	}
	$9 = out  # replace field #9 with updated output string
}1' file

you'll get the output you wanted.

But, rdrtx1's code is easier to read and probably faster. Some of your comments on rdrtx1's code are a little bit off. Try changing:

Code:

       gsub(";", ":p=?;", $9);  # split by ; in $9
       sub("$", ":p=?", $9);  # add :p=? to end of each split by ;

to:

Code:

       gsub(";", ":p=?;", $9);  # prepend ":p=?" to each of the subfield separators.
       sub("$", ":p=?", $9);  # add ":p=?" to end of the last subfield

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

awk to add text to matching pattern in field

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to add text to each line of matching id

Discussion started by: cmccabe

2. Shell Programming and Scripting

Using awk to add length of matching characters between field in file

Discussion started by: cmccabe

3. Shell Programming and Scripting

awk to update field using matching value in file1 and substring in field in file2

Discussion started by: cmccabe

4. Shell Programming and Scripting

awk to remove field and match strings to add text

Discussion started by: cmccabe

5. Shell Programming and Scripting

awk to parse field and include the text of 1 pipe in field 4

Discussion started by: cmccabe

6. Shell Programming and Scripting

Pattern Matching and text deletion using VI

Discussion started by: roadrunner

7. Shell Programming and Scripting

AWK: Pattern match between 2 files, then compare a field in file1 as > or < field in file2

Discussion started by: right_coaster

8. Shell Programming and Scripting

AWK : Add Fields of lines with matching field

Discussion started by: DerSeb

9. Shell Programming and Scripting

awk, comma as field separator and text inside double quotes as a field.

Discussion started by: kevintse

10. Shell Programming and Scripting

awk or sed to add field in a text file

Discussion started by: capnino