awk to add text to matching pattern in field


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to add text to matching pattern in field
# 1  
Old 01-23-2018
awk to add text to matching pattern in field

In the awk I am trying to add :p.=? to the end of each $9 that matches the pattern NM_. The below executes andis close but I can not seem to figure out why the :p.=? repeats in the split as in the green in the current output. I have added comments as well. Thank you Smilie.

file
Code:
R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritance	ExonicFunc.refGene	AAChange.refGene
1	chr1	155870416	155870416	G	A	splicing	RIT1	NM_001256821:exon6:c.481-7C>T;NM_001256820:exon5:c.322-7C>T;NM_006912:exon6:c.430-7C>T
9	chr10	112760138	112760138	A	-	splicing	SHOC2	NM_007373:exon4:c.842-35A>-;NM_001269039:exon2:c.704-35A>-
11	chr18	53070914	53070914	G	A	exonic	TCF4	.	AD	nonsynonymous SNV	TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V


awk
Code:
awk '
  BEGIN { FS=OFS="\t" }  # define FS and OFS as tab and start processing
  $9 ~ /NM/ {            # look for pattern NM in $9
       # split $9 by ";" and cycle through them
          out=""   # array out is empty
      i=split($9,NM,/;/)
         for (n=1; n<=i; n++) {
          sub(/$/, ":p=", NM[i])   # add :p. to end off each NM[i] before the ;
          out = (out=="" ? "" : out";") NM[i]  # add ? to each NM[i] and store in array out
         }
      $9 = out  # update with array out
}1' file

desired output
Code:
R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritance	ExonicFunc.refGene	AAChange.refGene
1	chr1	155870416	155870416	G	A	splicing	RIT1	NM_001256821:exon6:c.481-7C>T:p=?;NM_001256820:exon5:c.322-7C>T:p=?;NM_006912:exon6:c.430-7C>T:p=?
9	chr10	112760138	112760138	A	-	splicing	SHOC2	NM_007373:exon4:c.842-35A>-:p=?;NM_001269039:exon2:c.704-35A>-:p=?
11	chr18	53070914	53070914	G	A	exonic	TCF4	.	AD	nonsynonymous SNV	TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V

current output
Code:
R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritance	ExonicFunc.refGene	AAChange.refGene
1	chr1	155870416	155870416	G	A	splicing	RIT1	NM_006912:exon6:c.430-7C>T:p=?;NM_006912:exon6:c.430-7C>T:p=?:p=?;NM_006912:exon6:c.430-7C>T:p=?:p=?:p=?
9	chr10	112760138	112760138	A	-	splicing	SHOC2	NM_001269039:exon2:c.704-35A>-:p=?;NM_001269039:exon2:c.704-35A>-:p=?:p=?
11	chr18	53070914	53070914	G	A	exonic	TCF4	.AD	nonsynonymous SNV	TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V

# 2  
Old 01-23-2018
Code:
awk '
  BEGIN { FS=OFS="\t" }
  $9 ~ /NM/ {
       gsub(";", ":p=?;", $9);
       sub("$", ":p=?", $9);
  } 1' file

This User Gave Thanks to rdrtx1 For This Post:
# 3  
Old 01-23-2018
Hi cmccabe,
I agree with rdrtx1 that the code suggested in post #2 should do what you want.

What I don't understand is how the code you showed us in post #1 could produce the output that you labeled as "current output" in that post. Are you absolutely positive that the code you showed us in post #1 produced the output you showed us when file had the contents you showed us in that post?

The code you showed us seems like it would produce the desired number of additions to your field #9, but would omit the desired question marks and just append :p= to each subfield. The code to which you appended the comment:
Code:
# add ? to each NM[i] and store in array out

does not add question marks; it reforms the new field number 9 by adding back in the semicolons that were removed by the split(). And, note that the variable named out in your code is a string; not an array.
This User Gave Thanks to Don Cragun For This Post:
# 4  
Old 01-23-2018
Thank you very much rdrtx1, that works perfect Smilie.

Don Cragun your are correct in that:

Quote:
The code you showed us seems like it would produce the desired number of additions to your field #9, but would omit the desired question marks and just append Smilie= to each subfield. The code to which you appended the comment:
I forgot that I changed the
sub(/$/, ":p=", NM[i]) # add :p. to end off each NM[i] before the to
sub(/$/, ":p=?", NM[i]) # add :p. to end off each NM[i] before the

However the :p.=? seemed to be iterating based on the number of splits. Maybe it is the wrong terminology but I didn't understand why, no matter what I tried. Thank you for the correction on the array being a string, I was confused.

awk
Code:
awk '
   BEGIN { FS=OFS="\t" }  # define FS and OFS as tab and start processing
   $9 ~ /NM/ {            # look for pattern NM in $9
        # split $9 by ";" and cycle through them
           out=""
       i=split($9,NM,/;/)
          for (n=1; n<=i; n++) {
           sub(/$/, ":p=", NM[i])   # add :p. to end off each NM[i] before the ;
           out = (out=="" ? "" : out";") NM[i]
          }
       $9 = out
}1' file

R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritance	ExonicFunc.refGene	AAChange.refGene
1	chr1	155870416	155870416	G	A	splicing	RIT1	NM_006912:exon6:c.430-7C>T:p=;NM_006912:exon6:c.430-7C>T:p=:p=;NM_006912:exon6:c.430-7C>T:p=:p=:p=
9	chr10	112760138	112760138	A	-	splicing	SHOC2	NM_001269039:exon2:c.704-35A>-:p=;NM_001269039:exon2:c.704-35A>-:p=:p=
11	chr18	53070914	53070914	G	A	exonic	TCF4	.AD	nonsynonymous SNV	TCF4:NM_001243232:exon1:c.32C>T:p.A11V;TCF4:NM_001306208:exon1:c.32C>T:p.A11V

In rdrtx1 awk is the below close?

Code:
awk '
  BEGIN { FS=OFS="\t" }  # define FS and OFS as tab and start processing
  $9 ~ /NM/ { # look for pattern NM in $9
       gsub(";", ":p=?;", $9);  # split by ; in $9
       sub("$", ":p=?", $9);  # add :p=? to end of each split by ;
  } 1' file  # update input

Thank you very much Smilie.
# 5  
Old 01-23-2018
Your code had four minor bugs. If you change what you showed us in post #1 to:
Code:
awk '
BEGIN { FS=OFS="\t" }  # define FS and OFS as tab and start processing
$9 ~ /NM/ {            # look for pattern NM in $9
	# split $9 by ";" and cycle through them
	out=""   # array out is empty
	i=split($9,NM,/;/)
	for (n=1; n<=i; n++) {
		sub(/$/, ":p=?", NM[n])   # add ":p=?" to end off each NM[i]
		out = (out=="" ? "" : out";") NM[n]  # add updated NM[i] to new output string, restoring ";"s.
	}
	$9 = out  # replace field #9 with updated output string
}1' file

you'll get the output you wanted.

But, rdrtx1's code is easier to read and probably faster. Some of your comments on rdrtx1's code are a little bit off. Try changing:
Code:
       gsub(";", ":p=?;", $9);  # split by ; in $9
       sub("$", ":p=?", $9);  # add :p=? to end of each split by ;

to:
Code:
       gsub(";", ":p=?;", $9);  # prepend ":p=?" to each of the subfield separators.
       sub("$", ":p=?", $9);  # add ":p=?" to end of the last subfield

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to add text to each line of matching id

The awk below executes as expected if the id in $4 (like in f) is unique. However most of my data is like f1 where the same id can appear multiple times. I think that is the reason why the awk is not working as expected. I added a comment on the line that I can not change without causing the script... (6 Replies)
Discussion started by: cmccabe
6 Replies

2. Shell Programming and Scripting

Using awk to add length of matching characters between field in file

The awk below produces the current output, which will add +1 to $3. However, I am trying to add the length of the matching characters between $5 and $6 to $3. I have tried using sub as a variable to store the length but am not able to do so correctly. I added comments to each line and the... (4 Replies)
Discussion started by: cmccabe
4 Replies

3. Shell Programming and Scripting

awk to update field using matching value in file1 and substring in field in file2

In the awk below I am trying to set/update the value of $14 in file2 in bold, using the matching NM_ in $12 or $9 in file2 with the NM_ in $2 of file1. The lengths of $9 and $12 can be variable but what is consistent is the start pattern will always be NM_ and the end pattern is always ;... (2 Replies)
Discussion started by: cmccabe
2 Replies

4. Shell Programming and Scripting

awk to remove field and match strings to add text

In file1 field $18 is removed.... column header is "Otherinfo", then each line in file1 is used to search file2 for a match. When a match is found the last four strings in file2 are copied to file1. Maybe: cut -f1-17 file1 and then match each line to file2 file1 Chr Start End ... (6 Replies)
Discussion started by: cmccabe
6 Replies

5. Shell Programming and Scripting

awk to parse field and include the text of 1 pipe in field 4

I am trying to parse the input in awk to include the |gc= in $4 but am not able to. The below is close: awk so far: awk '{sub(/\|]+]++/, ""); print }' input.txt Input chr1 955543 955763 AGRN-6|pr=2|gc=75 0 + chr1 957571 957852 AGRN-7|pr=3|gc=61.2 0 + chr1 970621 ... (7 Replies)
Discussion started by: cmccabe
7 Replies

6. Shell Programming and Scripting

Pattern Matching and text deletion using VI

Can someone please assist me, I'm trying to get vi to remove all the occurences of the text in a file i.e. "DEVICE=/dev/mt??". The "??" represents a number variable. Is there a globel search and delete command that I can use? Thank You in Advance. (3 Replies)
Discussion started by: roadrunner
3 Replies

7. Shell Programming and Scripting

AWK: Pattern match between 2 files, then compare a field in file1 as > or < field in file2

First, thanks for the help in previous posts... couldn't have gotten where I am now without it! So here is what I have, I use AWK to match $1 and $2 as 1 string in file1 to $1 and $2 as 1 string in file2. Now I'm wondering if I can extend this AWK command to incorporate the following: If $1... (4 Replies)
Discussion started by: right_coaster
4 Replies

8. Shell Programming and Scripting

AWK : Add Fields of lines with matching field

Dear All, I would like to add values of a field, if the lines match in a certain field. Then I would like to divide the sum though the number of lines that have a matched field. This is the Input: Input: Test1 5 Test1 10 Test2 2 Test2 5 Test2 13 Test3 4 Output: Test1 7.5 Test1 7.5... (6 Replies)
Discussion started by: DerSeb
6 Replies

9. Shell Programming and Scripting

awk, comma as field separator and text inside double quotes as a field.

Hi, all I need to get fields in a line that are separated by commas, some of the fields are enclosed with double quotes, and they are supposed to be treated as a single field even if there are commas inside the quotes. sample input: for this line, 5 fields are supposed to be extracted, they... (8 Replies)
Discussion started by: kevintse
8 Replies

10. Shell Programming and Scripting

awk or sed to add field in a text file

Hi there, I have a csv file with some columns comma sepated like this : 4502-17,PETER,ITA2,LEGUE,92,ME - HALF,23/05/10 15:00 4502-18,CARL,ITA2,LEGUE,96,ME - HALF,20/01/09 14:00 4502-19,OTTO,ITA2,LEGUE,97,ME - MARY,23/05/10 15:00 As you can see the column n. 7 is a timestamp column, I need... (23 Replies)
Discussion started by: capnino
23 Replies
Login or Register to Ask a Question