Remove Unicode/special chars from XML


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Remove Unicode/special chars from XML
# 1  
Old 02-10-2012
Remove Unicode/special chars from XML

Hi,

We are receiving an XML file in Unix which has some special characters between tags like '^' etc

<Tag> 1e^O7f%<2304e.$d8f57e8^Bf-&e.^Zh7/327e^O7 </Tag>

We need to remove all special characters like ^ ones and also any '&' or '<' or '>' being sent within the start and close tags i.e. in tag text.

The upstream system is sending some unicode characters which are getting convrted to carot symbols in Unix (apart from & and > and <). This is causing my XML parser to abort or drop rows which have such data.

Please provide a perl command to remove them. (we need to remove '&' and '<' and '>' which are present in tag 'text')

Thanks
DSR
# 2  
Old 02-10-2012
Some things show out-of-range characters AS ^Z or whatever, but that doesn't mean it's literally the character ^ followed by the character Z. That's just its way of showing you characters it can't represent any other way.

So I think deleting the UTF8 characters themselves would be a good thing to try first; they're probably still there, unconverted. Since all UTF8 characters are >=128, we can use tr to strip out that entire range.

Code:
tr -d '[\200-\377]' < input > output

# 3  
Old 02-10-2012
Thanks

I tried to run the below command but it didnt work, so I am assuming the XML file on Unix has infact chars in ascii

tr -d '[\200-\377]' < MyOld.xml > MyNew.xml
# 4  
Old 02-10-2012
Can you post output of:
Code:
cat MyOld.xml | head

and
Code:
cat -e MyOld.xml | head

# 5  
Old 02-10-2012
The UTF chars are still fine, I am more worried about the additional '&'/'<'/'>' which I have to romove from Tagtexts Smilie as they are failing my XML parser

---------- Post updated at 03:57 PM ---------- Previous update was at 03:53 PM ----------

Hi Bartus11, unfortunately my XML is a big single record. And your commands are pulling the whole XML. Please let me know what specific head info are you looking for
# 6  
Old 02-10-2012
Would it be possible to re-download these XML files in an unconverted state? I think someone tried to remove the utf8 with cat -v and ruined it.
# 7  
Old 02-10-2012
OK, so try this:
Code:
cut -b1-200 MyOld.xml | od -c

I hope there are some of those mysterious characters in the first 200 bytes of your file.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Shell script to split data with a delimiter having chars and special chars

Hi Team, I have a file a1.txt with data as follows. dfjakjf...asdfkasj</EnableQuotedIDs><SQL><SelectStatement modified='1' type='string'><! The delimiter string: <SelectStatement modified='1' type='string'><! dlm="<SelectStatement modified='1' type='string'><! The above command is... (7 Replies)
Discussion started by: kmanivan82
7 Replies

2. Shell Programming and Scripting

Safely Remove Files with Special Chars

Hey Guys, I'm swamped writing code for the forums: Could someone write a script or command line to safely delete files with special chars in filenames from a directory: Example: -rw-r--r-- 1 root root 148 Apr 30 23:00 ?xA?? -rw-r--r-- 1 root root 148... (8 Replies)
Discussion started by: Neo
8 Replies

3. Shell Programming and Scripting

All strings within two special chars

I have a file with multiple lines. From each line I want to get all strings that starts with '+' and ends with '/'. Then I want the strings to be separated by ' + ' Example input: +$A$/NOUN+At/NSUFF_FEM_PL+K/CASE_INDEF_ACC Sample output: $A$ + At + K (20 Replies)
Discussion started by: Viernes
20 Replies

4. Shell Programming and Scripting

print all between patterns with special chars

Hi, I'm having trouble with awk print all characters between 2 patterns. I tried more then one solution found on this forum but with no success. Probably my mistakes are due to the special characters "" and "]"in the search patterns. Well, have a log file like this: logfile.txt ... (3 Replies)
Discussion started by: ginolatino
3 Replies

5. Shell Programming and Scripting

comm command help with unicode chars in file

Hi, I have a Master file (file.txt) with good and bad records( records with unicode characters). I ahve a file with only bad records (bad.txt) I want the records in file.txt which are not present in bad.txt ie only the good records. I tried comm -23 file.txt bad.txt It is giving... (14 Replies)
Discussion started by: ashwin3086
14 Replies

6. Shell Programming and Scripting

finding files with unicode chars in the filename

I'm trying to check-in a repository to svn -- but the import is failing because some files waaaay down deep in some graphics-library folder are using unicode characters in the file name - which are masked using the ls command but picked up when piping output to more: # ls -l 1914* -rwxrwxr-x 1... (2 Replies)
Discussion started by: mshallop
2 Replies

7. UNIX for Dummies Questions & Answers

remove special and unicode characters

Hi, How do I remove the lines where special characters or Unicode characters appear? The following query does work but I wonder if there is a better way. cat test.txt | egrep -v '\)|#|,|&|-|\(|\\|\/|\.' The following lines show that my query is incomplete. Warning: The word "*Khan" is... (1 Reply)
Discussion started by: shantanuo
1 Replies

8. Shell Programming and Scripting

special chars arrangement in code

here is my simple script to show process and owners except me: ps `-ef |grep xterm |grep -v aucar` | while read a1 a2 a3 a4 a5 a6 a7 a8 do echo KILL..\($a1\).. $a2 |more done how can I pass values from command "ps -ef |grep xterm|grep -v aucar" to ? because above command... (2 Replies)
Discussion started by: xramm
2 Replies

9. Shell Programming and Scripting

treating special chars

Hi, I need some advise on treating non printable chars over ascii value 126 Case 1 : On some fields in the text , I need to retiain then 'as-is' and load to a database.I understand it also depends on database codepage. but i just wanna know how do i ensure it do not change while loading... (1 Reply)
Discussion started by: braindrain
1 Replies

10. UNIX for Advanced & Expert Users

Supress special chars in vi

Hi, One of our application is producing log files. But if we open the log file in vi or less or view mode, it shows all the special characters in it. The 'cat' shows correctly but it shows only last page. If I do 'cat' <file_name> | more, then again it shows special characters. ... (1 Reply)
Discussion started by: divakarp
1 Replies
Login or Register to Ask a Question