SED: Can't Repeat Search Character in SED Output

09-14-2009

Registered User

4, 0

Join Date: Sep 2009

Last Activity: 25 September 2009, 1:32 PM EDT

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

SED: Can't Repeat Search Character in SED Output

I'm not sure if the problem I'm seeing is an artifact of sed or simply a beginner's mistake. Here's the problem: I want to add a zero-width space following each underscore between XML tags. For example, if I had the following xml:

<MY_BIG_TAG>This_is_a_test</MY_BIG_TAG>

It should look like this after I run my sed script:

<MY_BIG_TAG>This_is_a_test</MY_BIG_TAG>

To accomplish this, I found an example from the web and modified it for my purposes. Unfortunately, the example does not allow me to search for the underscore and then re-use it in the output. The script works fine if I just want to search on the underscore character and replace it with a different character (in this case, the zero-width space); however, as soon as I try to search for the underscore and replace it with that same underscore followed by a zero-width space, sed stalls and never completes.

Here's the script:

#/usr/bin/sh

sed 's/>[^>]*<\//\n&/g #This isolates strings in which I'm interested by inserting newline characters.
:loop
s/\(\n>[^>]*\)_\([^>]*<\/\)/\1_\\2/ #This is supposed to replace "_" with "_" but it does not; instead, it stalls sed.
t loop
s/\n//g' 1.xml #This removes the newline characters

This doesn't look like a problem with the underscore character itself since I have the same problem no matter what character I search on: I'm unable to find a character and replace it with the same character followed my something new.

This seems like such a basic sed feature that I'm inclined to think I'm doing something wrong.

Any ideas?

Thanks,

Rob

rhetoric101

View Public Profile for rhetoric101

Find all posts by rhetoric101

09-14-2009

Registered User

7,747, 559

Join Date: Feb 2007

Last Activity: 20 April 2020, 11:28 AM EDT

Location: The Netherlands

Posts: 7,747

Thanks Given: 139

Thanked 559 Times in 520 Posts

I don't see any difference between the given input and desired output.

To keep the forums high quality for all users, please take the time to format your posts correctly.

First of all, use Code Tags when you post any code or data samples so others can easily read your code. You can easily do this by highlighting your code and then clicking on the # in the editing menu. (You can also type code tags [code] and [/code] by hand.)

Second, avoid adding color or different fonts and font size to your posts. Selective use of color to highlight a single word or phrase can be useful at times, but using color, in general, makes the forums harder to read, especially bright colors like red.

Third, be careful when you cut-and-paste, edit any odd characters and make sure all links are working property.

Thank You.

The UNIX and Linux Forums

Franklin52

View Public Profile for Franklin52

Find all posts by Franklin52

09-14-2009

Registered User

4, 0

Join Date: Sep 2009

Last Activity: 25 September 2009, 1:32 PM EDT

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

Code correction

I made a mistake when I posted my code example. I neglected to show that I wanted to replace all underscores with that same underscore followed by a zero-width space. Here is the corrected example (I added "" in the sed loop:

<code>

#/usr/bin/sh

sed 's/>[^>]*<\//\n&/g #This isolates strings in which I'm interested by inserting newline characters.
:loop
s/\(\n>[^>]*\)_\([^>]*<\/\)/\1_\\2/ #This is supposed to replace "_" with "_" but it does not; instead, it stalls sed.
t loop
s/\n//g' 1.xml #This removes the newline characters

</code>

---------- Post updated at 03:09 PM ---------- Previous update was at 02:44 PM ----------

The forum wordprocessing tool has been removing my example of the zero-width space. When I try to type an ampersand followed by the pound sign and "8203;" it is rendered as a blank space by the forum editor.

Perhaps the moderator could help with the the required escape characters in this editor?

In the meantime, I'll provide an example using different characters that aren't problematic for the text editor. Here's an example where I want to replace all underscores with that same underscore followed by the letter "q":

Code:

#/usr/bin/sh
 
sed 's/>[^>]*<\//\n&/g #This isolates strings in which I'm interested by inserting newline characters.
:loop
s/\(\n>[^>]*\)_\([^>]*<\/\)/\1_Q\\2/ #This is supposed to replace "_" with "_Q" but it does not; instead, it stalls sed.
t loop
s/\n//g' 1.xml #This removes the newline characters

Here is the code for the source file (1.xml)

Code:

<MY_BIG_TAG>This_is_a_test</MY_BIG_TAG>

My desired result with this revised example using "Q" is the following:

Code:

<MY_BIG_TAG>This_Qis_Qa_Qtest</MY_BIG_TAG>

The code above works fine as long as I don't repeat the underscore in the output; in other words, if I replace the underscore with only the letter "Q" sed is able to complete.

Is there a way I can have sed repeat the underscore followed by the letter "Q"?

Thanks,

Rob

rhetoric101

View Public Profile for rhetoric101

Find all posts by rhetoric101

09-15-2009

Registered User

7,747, 559

Join Date: Feb 2007

Last Activity: 20 April 2020, 11:28 AM EDT

Location: The Netherlands

Posts: 7,747

Thanks Given: 139

Thanked 559 Times in 520 Posts

To replace "_" with "_Q" with sed:

Code:

sed 's/_/_Q/g' file

Franklin52

View Public Profile for Franklin52

Find all posts by Franklin52

09-15-2009

Registered User

140, 0

Join Date: Jul 2009

Last Activity: 23 February 2011, 6:20 AM EST

Posts: 140

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hope this works for you... Slight modification from your solution:

Code:

Input:
<MY_BIG_TAG>This_is_a_test</MY_BIG_TAG>

Code:
sed '
s/\(<[^>]*>\)\([^>]*\)\(<[^>]*>\)/\1\n\2\3/g 
:loop
s/\n\([^<_]*\)_/\1_Q\n/g 
/\n[^<_]*_/b loop
s/\n//g' a

Output:
<MY_BIG_TAG>This_Qis_Qa_Qtest</MY_BIG_TAG>

Explanation :

1. s/\(<[^>]*>\)\([^>]*\)\(<[^>]*>\)/\1\n\2\3/g
This replaces like <MY_BIG_TAG>\nThis_is_a_test<MY_BIG_TAG>
2. starts loop
3. After \n till < arrives, substitute all underscore to _Q
4. Again checks if the same pattern appears, if it is, go through the loop again.
5. Atlast replace \n with the empty ( which we replaced in line 1).

skmdu

View Public Profile for skmdu

Find all posts by skmdu

09-15-2009

Registered User

4, 0

Join Date: Sep 2009

Last Activity: 25 September 2009, 1:32 PM EDT

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

The code you suggested worked very well! It replaces only the underscores betwen the tags, which is want I wanted.

Code:

sed '
s/\(<[^>]*>\)\([^>]*\)\(<[^>]*>\)/\1\n\2\3/g 
:loop
s/\n\([^<_]*\)_/\1_Q\n/g 
/\n[^<_]*_/b loop
s/\n//g' a

What I'd like to know is what are the key differences in your script that enables sed to reuse the underscore, whereas in mine, sed completely hangs if I try to use the "\1_Q" (but works if I just use "\1Q").

Any ideas?

Last edited by rhetoric101; 09-15-2009 at 05:12 PM.. Reason: Forgot closing parenthesis

rhetoric101

View Public Profile for rhetoric101

Find all posts by rhetoric101

09-16-2009

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by rhetoric101

sed completely hangs if I try to use the "\1_Q" (but works if I just use "\1Q").

You basically search for "_" and replace it with "_Q". In the next round of the loop you find what you have just replaced and replace again thereby establishing an infinite loop.

You would have to search for "_<not followed by a Q>" (in regex "_[^Q]*") to avoid that loop.

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

UNIX for Dummies Questions & Answers

SED: Can't Repeat Search Character in SED Output

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to repeat a character in a field if it's a single character?

Discussion started by: nengcheng

2. Shell Programming and Scripting

awk sed to repeat every character on same position from the upper line replacing whitespace

Discussion started by: nakaedu

3. Shell Programming and Scripting

sed searches a character string for a specified delimiter character, and returns a leading or traili

Discussion started by: fspalero

4. Shell Programming and Scripting

Sed: delete on each line before a character and after a character

Discussion started by: bnbsd

5. Shell Programming and Scripting

sed help - search/copy from one file and search/paste to another

Discussion started by: ncwxpanther

6. Shell Programming and Scripting

In Sed how can I replace starting from the 7th character to the 15th character.

Discussion started by: mohullah

7. Shell Programming and Scripting

sed to delete character 0 only when it's on its own?

Discussion started by: Bashingaway

8. Shell Programming and Scripting

use SED to replace repeat statements

Discussion started by: watsup

9. Shell Programming and Scripting

repeat character with printf

Discussion started by: ripat

10. Shell Programming and Scripting

Use sed to delete a character

Discussion started by: bthomas