xmlstarlet parse non en_US characters


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting xmlstarlet parse non en_US characters
# 1  
Old 11-30-2010
xmlstarlet parse non en_US characters

I'm parsing around 600K xml files, with roughly 1500 lines of text in each, some of the lines include Chinese, Russian, whatever language, with a bash script that uses
Code:
 cat $i | xmlstarlet sel -t -m "//section1/section2/section3/section4/section5" -v "@VALUE" -n > somefile

which works, but I get parse errors like
Code:
-:2350: parser error : invalid character in attribute value
     <NODE NAME="something" VALUE="^Tï¿<96>ᄂ←ᄍ￰ï¿<8f>￲/dPï¾Â*ï¿<84>ï¿<88>mu" />
                                  ^
-:2350: parser error : attributes construct error
     <NODE NAME="something" VALUE="^Tï¿<96>ᄂ←ᄍ￰ï¿<8f>￲/dPï¾Â*ï¿<84>ï¿<88>mu" />
                                  ^
-:2350: parser error : Couldn't find end of Start Tag NODE line 2350
     <NODE NAME="something" VALUE="^Tï¿<96>ᄂ←ᄍ￰ï¿<8f>￲/dPï¾Â*ï¿<84>ï¿<88>mu" />
                                  ^
-:2350: parser error : PCDATA invalid Char value 20
     <NODE NAME="something" VALUE="^Tï¿<96>ᄂ←ᄍ￰ï¿<8f>￲/dPï¾Â*ï¿<84>ï¿<88>mu" />

I have installed all locales. Is there a way to bulk change all the encoding to UTF-8 or something on all the files, or install something, or am I going about it the wrong way?
# 2  
Old 11-30-2010
Don't know what you are trying to achieve.
maybe give a try to
Code:
strings $i |

instead of the
Code:
cat $i |

by the way, are you sure xmlstarlet is reliable and up to date ?
# 3  
Old 11-30-2010
xmlstarlet relies on libxml2 which uses UTF8 internally. For more information, see LIBXML2 - Encodings support.
# 4  
Old 11-30-2010
I tried xml2 parsing, which only converts xml to a flat file format, otherwise I don't know what else to use for bash xml parsing, I've written a couple basic parsers for similar tasks, but they have bad error handling I've found. I think maybe if I could get xmlstarlet to read in extended ascii encoding for these files it would work, but I don't know how to do that.

Strings didn't seem to help either

---------- Post updated at 05:59 PM ---------- Previous update was at 05:47 PM ----------

It seems without much pain I can't get libxml2 to encode ascii extended, I'm wondering if there's a way to convert it when I read the file in from a list, which I do by:
Code:
cat "${@:-somelist.txt}" |
while read i
do
        strings $i | xmlstarlet sel -t -m "//sec1/sec2/sec3/sec4/sec5" -v "@VALUE" -n > somefile
   value1=`sed -n "1p" somefile`
...

I also know my looping probably isn't the most elegant, but it works, well, except the encoding. Is there some command I can convert the string before it gets read by xmlstarlet or something?

btw, I'm using Debian Squeeze, which uses xmlstarlet 1.0.2-1
# 5  
Old 11-30-2010
iconv might be helpfull here, you can probably extract the documents charset from the XLM meta tag.


Code:
iconv -f ${SRC_CHARSET:-UTF-8} -t UTF-8 $i | xmlstarlet sel -t -m "//sec1/sec2/sec3/sec4/sec5" -v "@VALUE" -n  | iconv -f UTF-8 -t ${SRC_CHARSET:-UTF-8} > somefile

# 6  
Old 12-01-2010
Is there an encoding declaration at the top of your XML files? If so, what is it?

If no encoding declaration is present in the XML document, the assumed encoding of an XML document depends on the presence of a Byte-Order-Mark (BOM). A BOM is a Unicode special marker placed at the top of the file to indicate its encoding. A BOM is optional for UTF-8.

Code:
First bytes 	                Encoding assumed

EF BB BF 	                UTF-8
FE FF                           UTF-16 (big-endian)
FF FE             	        UTF-16 (little-endian)
00 00 FE FF                     UTF-32 (big-endian)
FF FE 00 00                     UTF-32 (little-endian)

# 7  
Old 12-01-2010
yes, <?xml version="1.0" encoding="utf-8"?>

am I correct in assuming that utf-8 won't work for extended ASCII characters like Cyrillic, Chinese, etc? It seems though the xml encoding tag says utf-8, it still has extended ascii characters in it? I converted to UTF-8 using iconv (Chubler_XL), but I still get parse errors, example
Code:
-:2854: parser error : PCDATA invalid Char value 1
     <NODE NAME="OEToolbarPos" VALUE="^A" />

which makes xmlstarlet stop parsing the rest of the file, is there a way to make it ignore/handle errors?

Last edited by unclecameron; 12-01-2010 at 01:23 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to insert subnode in xml file using xmlstarlet or any other bash command?

I have multiple xml files where i want to update a subnode if the subnode project points to different project or insert a subnode if it doesn't exist using a xmlstarlet or any other command that can be used in a bash script. I have been able to update the subnode project if it doesn't point to... (1 Reply)
Discussion started by: Sekhar419
1 Replies

2. Shell Programming and Scripting

Use xmlstarlet inside an if loop

I have a XML file of little huge size. I have to build a logic to get the count of the tag <capacity>. And have an if loop such that all the <capacity> blocks are captured one after the other. sample input file - sample1.xml <subcolumns><capacity><name>45.90</name> <index>0</index>... (1 Reply)
Discussion started by: ramprabhum
1 Replies

3. Shell Programming and Scripting

Ksh: Read line parse characters into variable and remove the line if the date is older than 50 days

I have a test file with the following format, It contains the username_date when the user was locked from the database. $ cat lockedusers.txt TEST1_21062016 TEST2_02122015 TEST3_01032016 TEST4_01042016 I'm writing a ksh script and faced with this difficult scenario for my... (11 Replies)
Discussion started by: humble_learner
11 Replies

4. Shell Programming and Scripting

Parse two patterns and print next few characters following the pattern

Hi all, I have many large files with data like following in each line: 1 822381 rs116091741 C T . PASS ASP;G5;G5A;GMAF=0.014308426073132;KGPilot123;RSPOS=822381;SAO=0; I want output like this: rs116091741 0.014308426073132 I tried some of the commands... (5 Replies)
Discussion started by: pirates.genome
5 Replies

5. Solaris

setting locale en_US.UTF-8

hi, I am using SOLARIS sparc 64 bit, during installation of Oracle i receive an error required OS locale en_US.UTF-8 does not exist on the installation computer. To avoid this issue, please ensure that the locale en_US.UTF-8 exists on the installation computer prior to installing Oracle. when... (4 Replies)
Discussion started by: zeeshan047
4 Replies

6. Shell Programming and Scripting

xmlstarlet parse field from file

I have a xmlfile like this: <?xml version="1.0" encoding="utf-8"?> <contentlocation xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns="http://wherein.yahooapis.com/v1/schema" xml:lang="en"> <processingTime>0.001538</processingTime> ... (1 Reply)
Discussion started by: unclecameron
1 Replies

7. Shell Programming and Scripting

xmlstarlet template parse small xml file

I have a file like: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <geonames> <geoname> <toponymName>Palos Verdes</toponymName> <name>Palos Verdes</name> <lat>42.1628912</lat> <lng>-123.6481235</lng> <geonameId>5718340</geonameId> <countryCode>US</countryCode>... (4 Replies)
Discussion started by: unclecameron
4 Replies

8. Solaris

Add language en_US Solaris 10

Hello, I have a Sun Solaris 10 installs by default in French. I do not have CDs of the OS installation. I have a program use the language en_US. At connection language chosen is C (en_USxxxx not available) I open a console $ LANG C if LANG = en_US I get "could not set correctly local" ... (2 Replies)
Discussion started by: XRay
2 Replies

9. Solaris

en_US.ISO8859-1 Table

Hy together, I doesn't find a table of en_US.IS08859-1. Have someone a link or same else? Thanks Urs (1 Reply)
Discussion started by: MuellerUrs
1 Replies

10. AIX

en_us.utf-8

please someone provide me the link for downloading en_us.utf-8 .....i have an issue with locale for which i need this :( (1 Reply)
Discussion started by: shubhendu.pyne
1 Replies
Login or Register to Ask a Question