I'm parsing around 600K xml files, with roughly 1500 lines of text in each, some of the lines include Chinese, Russian, whatever language, with a bash script that uses
which works, but I get parse errors like
I have installed all locales. Is there a way to bulk change all the encoding to UTF-8 or something on all the files, or install something, or am I going about it the wrong way?
I tried xml2 parsing, which only converts xml to a flat file format, otherwise I don't know what else to use for bash xml parsing, I've written a couple basic parsers for similar tasks, but they have bad error handling I've found. I think maybe if I could get xmlstarlet to read in extended ascii encoding for these files it would work, but I don't know how to do that.
Strings didn't seem to help either
---------- Post updated at 05:59 PM ---------- Previous update was at 05:47 PM ----------
It seems without much pain I can't get libxml2 to encode ascii extended, I'm wondering if there's a way to convert it when I read the file in from a list, which I do by:
I also know my looping probably isn't the most elegant, but it works, well, except the encoding. Is there some command I can convert the string before it gets read by xmlstarlet or something?
btw, I'm using Debian Squeeze, which uses xmlstarlet 1.0.2-1
Is there an encoding declaration at the top of your XML files? If so, what is it?
If no encoding declaration is present in the XML document, the assumed encoding of an XML document depends on the presence of a Byte-Order-Mark (BOM). A BOM is a Unicode special marker placed at the top of the file to indicate its encoding. A BOM is optional for UTF-8.
am I correct in assuming that utf-8 won't work for extended ASCII characters like Cyrillic, Chinese, etc? It seems though the xml encoding tag says utf-8, it still has extended ascii characters in it? I converted to UTF-8 using iconv (Chubler_XL), but I still get parse errors, example
which makes xmlstarlet stop parsing the rest of the file, is there a way to make it ignore/handle errors?
Last edited by unclecameron; 12-01-2010 at 01:23 PM..
I have multiple xml files where i want to update a subnode if the subnode project points to different project or insert a subnode if it doesn't exist using a xmlstarlet or any other command that can be used in a bash script.
I have been able to update the subnode project if it doesn't point to... (1 Reply)
I have a XML file of little huge size. I have to build a logic to get the count of the tag <capacity>.
And have an if loop such that all the <capacity> blocks are captured one after the other.
sample input file - sample1.xml
<subcolumns><capacity><name>45.90</name>
<index>0</index>... (1 Reply)
I have a test file with the following format, It contains the username_date when the user was locked from the database.
$ cat lockedusers.txt
TEST1_21062016
TEST2_02122015
TEST3_01032016
TEST4_01042016
I'm writing a ksh script and faced with this difficult scenario for my... (11 Replies)
Hi all,
I have many large files with data like following in each line:
1 822381 rs116091741 C T . PASS ASP;G5;G5A;GMAF=0.014308426073132;KGPilot123;RSPOS=822381;SAO=0;
I want output like this:
rs116091741 0.014308426073132
I tried some of the commands... (5 Replies)
hi,
I am using SOLARIS sparc 64 bit, during installation of Oracle i receive an error required OS locale en_US.UTF-8 does not exist on the installation computer. To avoid this issue, please ensure that the locale en_US.UTF-8 exists on the installation computer prior to installing Oracle.
when... (4 Replies)
I have a xmlfile like this:
<?xml version="1.0" encoding="utf-8"?>
<contentlocation xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns="http://wherein.yahooapis.com/v1/schema" xml:lang="en">
<processingTime>0.001538</processingTime>
... (1 Reply)
Hello,
I have a Sun Solaris 10 installs by default in French.
I do not have CDs of the OS installation.
I have a program use the language en_US.
At connection language chosen is C (en_USxxxx not available)
I open a console $ LANG C
if LANG = en_US I get "could not set correctly local" ... (2 Replies)