Quote:
Originally Posted by
Abhijit Sen
But when it is getting converted(using iconv) as UTF-16 is 2 byte value , hence the file alignment is getting changed. Is there any way to fix this to resolve the alignment problem.
My file contains only UTF-8 value which takes 1 byte. And each line it can have only limited number of UTF-8 chracters. But during conversion few of the UTF-8 characters are getting pushed to next line which is altering the file alignment.
In general UNIX is and its utilities are coding-insensitive. That is: using
sed (or
awk,
tr or similar text filters) you work on streams of bytes. In ASCII (and similar encodings) a "character" is a byte and a byte is a character. In other encodings this is not the case (like in UTF-16, where 2 bytes represent a character). But UNIX tools are not aware of this and treat each byte as if it would represent a character.
Having said this: to search for/find a text like "abc" regardless of the encoding cannot be done with these text filters, because they will not recognize that two bytes containing (if memory serves correctly) the hex values "00:61" ("U+0061") is the same letter "a" as a single byte with hex value "61" in ASCII.
Issuing
grep "a" /some/file is basically telling grep to search for a byte containing the hex value 61 because this is what "a" is encoded in ASCII with.
Somebody has mentioned that python-tools work differently and are encoding-aware. It seems you need to resort to those (or similarily coding-aware tools) to get what you want. With UNIX tools you will not get over the aforementioned limitation, however clever you may work around that. It will always be a less-than-genuine solution which oughts to break under some unexpected set of conditions.
I hope this helps.
bakunin