iconv and xmllint


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users iconv and xmllint
# 1  
Old 10-05-2007
iconv and xmllint

Here is my question,

volume of records processed : 5M ( approx )

Its basically very simple operation that am trying to do and I had achieved the output that am interested. What am looking for really is to improve the performance, an optimized way to do that.

with respect to iconv, am checking each of the records whether they can be converted from encoding format ' ef1 ' to another encoding format ' ef2 '.
For this, I take one record at a time and apply ' iconv ' command. With the return value ' $? ' its validated whether it could be converted to another encoding format or not

similarly with respect to xmllint, am creating a xml file and validating against the XSD's conformance.

As said earlier, these are simple operations and need your thoughts / input to improve the efficiency.

Is there a better way of doing these operations when the volume of records is really huge ( 5M ) ?
# 2  
Old 10-06-2007
Invoking iconv on a single large attempt is far more efficient than many small ones. You can see from this:

Code:
admin@np64gw:/dev/shm$ time perl -e 'while (<>) { open(ICONV, "| iconv -f big5 -t utf8 >/dev/null"); print ICONV $_; close ICONV }' <XLink.txt

real    0m4.224s
user    0m2.200s
sys     0m0.652s
admin@np64gw:/dev/shm$ time iconv -f big5 -t utf8 XLink.txt >/dev/null          
real    0m0.009s
user    0m0.008s
sys     0m0.000s

So, if you have some methods to concatenate the records into 1 single file before passing to iconv, it will go a lot faster. iconv will return the file position that has the error, so if you have some indexing performed that allows you to accurate map a file position to record number, that would likely work. If you are just doing validation and expect all records should pass normally, this may work for you.

But can you reprogram that part of the script in C? I guess with libiconv you can better control the process in case there are many alien bytes sneaked in.
# 3  
Old 10-06-2007
thanks for the reply,

thats a nice idea,

so to achieve that I should work on arriving at a map between range of character positions and the record number

But there is a potential problem to this approach

say, there are ' n ' records

if iconv is failing at ' 3 ' record ( 3 < n )
then 3rd record should be removed from processing and continue with the 4th record, until the 3rd record is removed it would not continue from where it had failed

so each time when a 'x' record fails it should be removed from processing 'n' records
# 4  
Old 10-06-2007
Yes, that's why it is good if you are doing validation and normally would expect everything to pass.

This shortcut will be quite messy otherwise, if indeed some records have problems. That's why I have another suggestion of using libiconv, as I know you can instruct it to ignore bytes that cannot be converted and proceed, and do so without stopping the iconv process. This cannot be achieved with the iconv executable alone because there are no "hooks" that allow you do so from the command line.

Loading of character tables is very expensive operation, so starting iconv many times is bound to be slow. If you really have records of that volume, you should really invest in a C program with libiconv that acts on a concatenated sequence of records. I have some good feeling that it could work based on my earlier exploration of libiconv although I have not made anything similar myself.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Xmllint parser error : EntityRef: expecting ';'

Hi I have an XML file which contains html urls in that node values. When i use xmllint to parse that, i am getting error (because of the sympols in the url). i have used --html option but it throws other tag errors. Please guide me. sample file.xml <abc> <bcd> <cde> <a>sometext</a>... (2 Replies)
Discussion started by: ananan
2 Replies

2. Shell Programming and Scripting

Xmllint: get one result per line

Hi, I'm trying to get some values from an xmlfile and want be able to process them. I'm using xmllint(v20901 on debian jessie) and this program directly outputs all results concatenated right after each other. I did not find a solution in the man page to get a different format or some output... (2 Replies)
Discussion started by: stomp
2 Replies

3. Shell Programming and Scripting

Help with xmllint

Have like 50 xml files in a folder. They all have a Node named <Number>.How to display the values of <Number> with the count and filename in the folder. I am using Mac . (7 Replies)
Discussion started by: Anethar
7 Replies

4. Shell Programming and Scripting

Parse XML using xmllint

Hi All, Need help to parse the xml file in shell script using xmllint. Below is the sample xml file. <CARS> <AUDI> <Speed="45"/> <speed="55"/> <speed="75"/> <speed="95"/> </AUDI> <BMW> <Speed="30"/> <speed="75"/> <speed="120"/> <speed="135"/> </BMW>... (6 Replies)
Discussion started by: prasanna2166
6 Replies

5. UNIX for Dummies Questions & Answers

Xmllint pretty print, batch files

I have about 20 xml files I want to use xmllint to pretty print: xmllint --format file01.xml > pretty_file01.xml xmllint --format file02.xml > pretty_file02.xml etc Is there a way I can just use "xmllint --format" on all the current xml files so I don't have to run this command 20 times?? :( (5 Replies)
Discussion started by: pxalpine
5 Replies

6. UNIX for Dummies Questions & Answers

Help with iconv command

Hi , I am using iconv command to convert a file in UTF-16 format to UTF-8 format. This command will work for few files but for some showing an error as bad input character. But if i copy the contents of the file for which it is showing "bad input character" to a new file and perform the... (2 Replies)
Discussion started by: Shruthi8818
2 Replies

7. Shell Programming and Scripting

Help with iconv command

Hi , I am using iconv command to convert a file in UTF-16 format to UTF-8 format. This command will work for few files but for some showing an error as bad input character. But if i copy the contents of the file for which it is showing "bad input character" to a new file and perform the... (0 Replies)
Discussion started by: Shruthi8818
0 Replies

8. Shell Programming and Scripting

XMLLINT COMMAND IN UNIX TO VALIDATE XML AGAINST XSD

Hi i am baby to unix shell script. how do i validate xml agaist xsd and transforms xml using xslt. Thanks Mohan (2 Replies)
Discussion started by: mohan.cheepu
2 Replies

9. Shell Programming and Scripting

xmllint output to a file

Hello All, I have an XML file which has some errors in its tag definition according to an xsd. When i validate this xml file against an xsd, i wish to only take the errors in a file and not the complete xml. for eg. Raman.xml has some errors induced in it. RamanValidator.xsd holds the schema... (5 Replies)
Discussion started by: damansingh
5 Replies

10. Programming

about iconv

I want to use iconv.h to convert some text to another charset. The code is below: #include <stdio.h> #include <stdlib.h> #include <iconv.h> int main() { iconv_t cd; char instr="汉字"; char *inbuf; char *outbuf; unsigned int insize=7; ... (4 Replies)
Discussion started by: yong
4 Replies
Login or Register to Ask a Question