Simple sed one-liner for fixing unencoded ampersands

Simple sed one-liner for fixing unencoded ampersands

I recieve some XML-files that constantly has bad encoded content. There are Ampersands that are not encoded correctly causing my XML-parser to halt.
I wrote a sed one-liner to fix any stand alone "&":

sed -e 's/&[^amp;|^apos;|^quot;|^lt;|^gt;]/&/gi' input.xml

testfile for input:
<source> &quot; One &quot; </source>
<name>test &amp; test</name>
<address>test3 &apos; test3</address>
<area> test5 &lt; test5</area>
<post> test6 &gt; </post>
<test> test7 &</test>

My problem is that the caracter after the "&" is removed as well, destroying the XML-tag

<source> &quot; One &quot; </source>
<name>test &amp; test</name>
<address>test3 &apos; test3</address>
<area> test5 &lt; test5</area>
<post> test6 &gt; </post>
<test> test7 &amp;/test>

I tried the script on both Unix and in Windows 2000 (with unixutil)
Any Ideas?

sed -e 's/&[^amp;|^apos;|^quot;|^lt;|^gt;]/\&amp;/gi' input.xml

Try to escape the "&"
Data same result with escaped "&"

sed -e 's/\&[^amp;|^apos;|^quot;|^lt;|^gt;]/\&amp;/gi' input.xml

<source> &quot; One &quot; </source>
<name>test &amp; test</name>
<address>test3 &apos; test3</address>
<area> test5 &lt; test5</area>
<post> test6 &gt; </post>
<test> test7&amp;/test>
Originally Posted by tobbe
sed -e 's/\&[^amp;|^apos;|^quot;|^lt;|^gt;]/\&amp;/gi' input.xml

<source> &quot; One &quot; </source>
<name>test &amp; test</name>
<address>test3 &apos; test3</address>
<area> test5 &lt; test5</area>
<post> test6 &gt; </post>
<test> test7&amp;/test>
Your match catches the part after the "&" (after all, that's what all that "not" business is!). Wrap the expression in parentheses (remember to escape them!), and then include a backreference in the substitution.
OK thanks for the advice.
In perl the content of the parenthesis are: $1, $2 etc.
What is the syntax like in sed?

Originally Posted by tobbe
OK thanks for the advice.
In perl the content of the parenthesis are: $1, $2 etc.
What is the syntax like in sed?

Under Bash's backslash-escaping rules:
echo 'Hi mom!' | sed 's/ mom\(.\)/\1  How are you?/'

Hi!  How are you?

That works OK:

echo "AB &CD&amp;EF" | sed -e 's/\&\([^\amp;]\)/\&amp;\1/'
AB &amp;CD&amp;EF
