extra character with iconv encoding


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting extra character with iconv encoding
# 1  
Old 06-13-2011
extra character with iconv encoding

hey,

I am trying to convert a sample russian encoding file to English encoding using iconv utility.

Its almost done but with each converted character i am getting one extra character which must not come.

my sample Russian text is

test.txt
Code:
А Б В Г Д Е Ж З И Й К ~

and script which i am using for conversion is

script
Code:
>out
for i in `iconv -l`
do 
o=`iconv -f cp866 -t $i test.txt` 
len=`expr length "$o"`
if [ "$len" -gt 2 ]
then
echo $o#$i>>out
fi
done

and sample output for few almost successfully converted text are:

out
Code:
ト@ トA トB トC トD トE トG トH トI トJ トK ~	CP932
ト@ トA トB トC トD トE トG トH トI トJ トK ~	CSIBM932
ト@ トA トB トC トD トE トG トH トI トJ トK ~	CSIBM943
ト@ トA トB トC トD トE トG トH トI トJ トK ~	CSSHIFTJIS
ト@ トA トB トC トD トE トG トH トI トJ トK ~	CSWINDOWS31J
ト@ トA トB トC トD トE トG トH トI トJ トK ~	IBM-932
ト@ トA トB トC トD トE トG トH トI トJ トK ~	IBM-943
ト@ トA トB トC トD トE トG トH トI トJ トK ~	IBM932
ト@ トA トB トC トD トE トG トH トI トJ トK ~	IBM943
ト@ トA トB トC トD トE トG トH トI トJ トK ~	MS932
ト@ トA トB トC トD トE トG トH トI トJ トK ~	MS_KANJI
ト@ トA トB トC トD トE トG トH トI トJ トK ~	SHIFT-JIS
ト@ トA トB トC トD トE トG トH トI トJ トK ~	SHIFT_JIS
ト@ トA トB トC トD トE トG トH トI トJ トK ~	SHIFT_JISX0213
ト@ トA トB トC トD トE トG トH トI トJ トK ~	SJIS-OPEN
ト@ トA トB トC トD トE トG トH トI トJ トK ~	SJIS-WIN
ト@ トA トB トC トD トE トG トH トI トJ トK ~	SJIS
ト@ トA トB トC トD トE トG トH トI トJ トK ~	WINDOWS-31J

pls suggest where i am going wrong in this encoding process

Any help with that would be greatly appreciated.

---------- Post updated 06-14-11 at 08:08 AM ---------- Previous update was 06-13-11 at 09:20 PM ----------

hey guys can anyone help me on this..
# 2  
Old 06-14-2011
What is the output of
Code:
iconv -f CP866 -t UTF-8 test.txt

# 3  
Old 06-14-2011
It's alright. Change coding on your terminal or in your editor to shift_jis and you can see "pure" Cyrillic letters. Sometimes you can see ツ (like so ツА ツБ ツВ ツГ ツД ツИ ツЙ ツК ツЛ) - it's the leading symbol for Cyrillic (and some another) letters.

Last edited by yazu; 06-14-2011 at 10:23 AM..
# 4  
Old 06-15-2011
hi Yazu,

Quote:
Change coding on your terminal or in your editor to shift_jis
I am new to this unix part, can u please explain in detail the steps to perform same
  • extra characters i am getting is in converted english file which is in default ascii mode.
  • I am using putty with encoding setting as utf-8

---------- Post updated at 04:08 PM ---------- Previous update was at 11:43 AM ----------

hey can anyone help me on this..
# 5  
Old 06-15-2011
Quote:
Originally Posted by yazu
It's alright. Change coding on your terminal or in your editor to shift_jis and you can see "pure" Cyrillic letters. Sometimes you can see ツ (like so ツА ツБ ツВ ツГ ツД ツИ ツЙ ツК ツЛ) - it's the leading symbol for Cyrillic (and some another) letters.
Out of curiosity, why did you recommend changing to Shift-JS which is a Japanese language encoding? CP866 does not map to Shift-JS. The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters in JIS X 0201.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

sed removing extra character from end

Hi, Searching through forum I found "sed 's/*$//'" can be used to remove trailing whitespaces and tabs from file. The command works fine but I see minor issue as below. Can you please suggest if I am doing something wrong here. $ cat a.txt upg_prod_test upg_prod_new $ cat a.txt |sed... (11 Replies)
Discussion started by: bhupinder08
11 Replies

2. Solaris

connect to ILOM via ssh character encoding

Hello all, I am connecting to ILOM using ssh client (putty) but when RedHat start booting everything look chinese for me... Probably i have to configure the character set, i tried also utf-8 but the issue remain. Any idea? Thanks in advance (0 Replies)
Discussion started by: @dagio
0 Replies

3. Shell Programming and Scripting

Awk while-loop printing extra character

Hi, I'm using a while-loop in an awk script. If it matches a regular expression, it prints a line. Unfortunately, each line that is printed in this loop is followed by an extra character, "1". While-statement extracted from my script: getline temp; while (temp ~ /.* x .*/) print temp... (3 Replies)
Discussion started by: redbluefish
3 Replies

4. HP-UX

how to find the character encoding of a file in hp_ux

how to find the character encoding of a file in hp_ux (1 Reply)
Discussion started by: alokjyotibal
1 Replies

5. Shell Programming and Scripting

Remove extra character

Hi I am using cat <filename> command in one of my datastage job(Command Activity). It is giving actual value but giving extra line. Eg: Displayed Output: 1 and showing extraline(Eg: 1 ) I had checked even wc -c it is giving one character extra. If the file contains 11. wc -c says 3. ... (3 Replies)
Discussion started by: cnrj
3 Replies

6. Shell Programming and Scripting

how to delete extra character in a line?

And I want to delete the characters longer than 20 for each line start with #. The other lines should remain the same. I think this can be done by sed. Could anyone help me with this? Thanks! my input file: #ZP_05494889.1_Clostridium_papyrosolvens... (3 Replies)
Discussion started by: ritacc
3 Replies

7. Shell Programming and Scripting

sort file adding extra character

HI all i have this script : #!/bin/bash sort /usr/tmp/"REPORT"$1 -o \ /usr/tmp/"SREPORT"$1 -k 1,7 -S 150 end of script now i'm doing this command : ls -lsgt *REPORT* 4 -rw-r--r-- 300 Sep 16 REPORT54784 4 -rw-r--r-- 301 Sep 16 SREPORT54784 as you can see the sorted file... (5 Replies)
Discussion started by: naamas03
5 Replies

8. AIX

Vacation message character encoding

I am trying to send a vacation message (.vacation.msg) from my AIX 5.3 server. Message is UTF-8 characters. Some email clients (like apple mail) have no problems displaying the correct text, however, some, like Windows Outlook, display garbage. Is there a way of forcing the client to use proper... (0 Replies)
Discussion started by: lanny
0 Replies

9. UNIX for Dummies Questions & Answers

character encoding in Fedora6

Hello, After upgrading the OS from Fedora4 to Fedora6, the firefox view>character encoding doesn't work anymore. None of the foreign characters can be displayed, no matter what character encoding to select. Any suggestions? Thanks, bsky :confused (1 Reply)
Discussion started by: bsky
1 Replies

10. UNIX for Advanced & Expert Users

iconv -l and ANSEL character set

I am forced to use the ANSEL character set for some GEDCOM documents but must convert them to a more modern set for another app which doesn't recognize ANSEL. I am unable to locate an ISO code for ANSEL in a search of the web. Would someone plese identify the ANSEL character set from the list given... (4 Replies)
Discussion started by: Whiterock
4 Replies
Login or Register to Ask a Question