Convert UTF-8 file to ASCII/ISO8859-1 OR replace characters Post: 302980614

Sponsored Content

Top Forums Shell Programming and Scripting Convert UTF-8 file to ASCII/ISO8859-1 OR replace characters Post 302980614 by hemkiran.s on Tuesday 30th of August 2016 06:30:26 PM

08-30-2016

Registered User

Convert UTF-8 file to ASCII/ISO8859-1 OR replace characters

I am trying to develop a script which will work on a source UTF-8 file and perform one or more of the following
It will accept the target encoding as an argument e.g. US-ASCII or ISO-8859-1, etc
1. It should replace all occurrences of characters outside target character set by " " (space) or whatever character we define. Naturally we don't want to shift character positions in case of fixed width files.
2. It should also have the ability to get rid of characters altogether, characters which fall outside target character set.
3. It should have the ability to translate certain characters .e.g. "�" to "JPY" or "�" to "GBP" or "�" to "A"

What I have tried
1. This is good to get rid of characters, but it doesn't work for fixed width. Further it has //TRANSLIT option but one has no control over character translation.

Code:

 iconv -f UTF-8 -t US-ASCII//IGNORE -c utf8_file.txt

2. This gives me a list of characters > \xff

Code:

 echo "!\"#$%&0@ABab����ほぼぽま~अ" | grep -P -o '[^\x00-\xff]'
ほ
ぼ
ぽ
ま
अ

Trying to replace them with a space or any other character.

Code:

 echo "!\"#$%&0@ABab����ほぼぽま~अ" | sed 's/[^\x00-\xff]/ /g'

Error: "sed: -e expression #1, char 18: Invalid collation character"

Even sed -e option doesn't work here, same error.

Using the hexdump

Code:

 echo -n अ | hexdump -ve '1/1 " %.2x"' 
Output: e0 a4 85

If I plug in this value in sed this works.

Code:

 echo "!\"#$%&0@ABab����ほぼぽま~अ" | sed 's/\xe0\xa4\x85/" "/g'
Output: !"#$%&0@ABab����ほぼぽま~" "

So my final approach is parse the file and use grep to find characters outside the range of target encoding and put only matching characters into a temp file with their hexdump values

Code:

grep -P -o '[^\x00-\x7f]' utf_sample.txt -> temp file

Then loop through temp file and for each character do the character replacement or suppress characters one by one.

I am stuck here
1. Is there a way to define hexdump values of characters in a variable. This will be used in association with grep command to create temp file
below does not work for me.

Code:

hex_range=$'\x00-\xff'
echo "!\"#$%&0@ABab����ほぼぽま~अ" | grep -P -o '[^$(hex_range)]'

Code:

 echo "!\"#$%&0@ABab����ほぼぽま~अ" | grep -P -o '[^`eval $hex_range`]'

2. Is there a way to store the hex value in some variable and use sed or anything else to perform find replace?
here find_char = each character read from temp file one by one
Even this does not work.

Code:

 find_char=`echo -n अ | hexdump -ve '1/1 " %.2x"' | sed 's/ /\\\x/g'`
echo "!\"#$%&0@ABab����ほぼぽま~अ" | sed 's/$find_char/ /g'

I have tried lot of options and googled at least 100 webpages to get some clues. I am open to ideas, suggestions...

I am using

Code:

$ uname -a
Linux <server_name> 2.6.32-573.26.1.el6.x86_64 #1 SMP Tue Apr 12 01:47:01 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
$ locale
LANG=C
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

Moderator's Comments:

Please use CODE tags as required by forum rules!

Last edited by RudiC; 08-31-2016 at 03:46 AM.. Reason: Changed CODE tags.

hemkiran.s

View Public Profile for hemkiran.s

Find all posts by hemkiran.s

10 More Discussions You Might Find Interesting

1. Programming

Howto convert Ascii -> UTF-8 & back C++

While working with russian text under FreeBSD&MySQL I need to convert a string from MySQL to the Unicode format. I've just started my way in C++ under FreeBSD , so please explain me how can I get ascii code of Char variable and also how can i get a character into variable with the specified ascii...

2. Shell Programming and Scripting

Replace characters in a string using their ascii value

Hi All, In the HP Unix that i'm using when i initialise a string as Stalled="'30�G'" Stalled=$Stalled" '30�C'", it is taking the character � as a comma. I need to grep for 30�G 30�C in a file and take its count. But since this character � is not being understood, the count returns a zero. The...

3. Shell Programming and Scripting

replace UTF-8 characters with tr

Hi, I try to get tr to replace multibytes characters by ascii equivalent. For example "Je vais � l'�cole" ---> 'Je vais a l'ecole" But my version of tr (5.97) doesn't seem to support multibyte sets. $ locale charmap; echo "Je vais � l'�cole" | tr �� ea UTF-8 Je vais aa l'aacole I try to...

4. Shell Programming and Scripting

read in a file character by character - replace any unknown ASCII characters with spa

Can someone help me to write a script / command to read in a file, character by character, replace any unknown ASCII characters with space. then write out the file to a new filename/ Thanks!

5. Shell Programming and Scripting

convert ascii values into ascii characters

Hi gurus, I have a file in unix with ascii values. I need to convert all the ascii values in the file to ascii characters. File contains nearly 20000 records with ascii values.

6. Shell Programming and Scripting

Remove characters other than ISO8859-1

Hi please help in writing a script for replacing all the non-iso8859-1 characters to question marks. I need a pattern of this kind "sed s/<non-iso char range>/?/g < ipfile > opfile" Please help me in this.

7. Red Hat

Can't convert 7bit ASCII to UTF-8

Hello, I am trying to convert a 7bit ASCII file to UTF-8. I have used iconv before though it can't recognize it for some reason and says unknown file encoding. When I used ascii2uni package with different package, ./ascii2uni -a K -a I -a J -a X test_file > new_test_file It still...

8. Linux

Help to Convert file from UNIX UTF-8 to Windows UTF-16

Hi, I have tried to convert a UTF-8 file to windows UTF-16 format file as below from unix machine unix2dos < testing.txt | iconv -f UTF-8 -t UTF-16 > out.txt and i am getting some chinese characters as below which l opened the converted file on windows machine. LANG=en_US.UTF-8...

9. Shell Programming and Scripting

Search and Replace Extended Ascii Characters

We are getting extended Ascii characters in the input file and my requirement is to search and replace them with a space. I am using the following command LANG=C sed -e 's// /g' It is doing a good job, but in some cases it is replacing the extended characters with two spaces. So my input...

10. Shell Programming and Scripting

Convert Hex to Ascii in a Ascii file

Hi All, I have an ascii file in which few columns are having hex values which i need to convert into ascii. Kindly suggest me what command can be used in unix shell scripting? Thanks in Advance

10 More Discussions You Might Find Interesting

1. Programming

Howto convert Ascii -> UTF-8 & back C++

Discussion started by: macron

2. Shell Programming and Scripting

Replace characters in a string using their ascii value

Discussion started by: roops

3. Shell Programming and Scripting

replace UTF-8 characters with tr

Discussion started by: ripat

4. Shell Programming and Scripting

read in a file character by character - replace any unknown ASCII characters with spa

Discussion started by: raghav525

5. Shell Programming and Scripting

convert ascii values into ascii characters

Discussion started by: sandeeppvk

6. Shell Programming and Scripting

Remove characters other than ISO8859-1

Discussion started by: rprajendran

7. Red Hat

Can't convert 7bit ASCII to UTF-8

Discussion started by: rockf1bull

8. Linux

Help to Convert file from UNIX UTF-8 to Windows UTF-16

Discussion started by: phanidhar6039

9. Shell Programming and Scripting

Search and Replace Extended Ascii Characters

Discussion started by: ysvsr1

10. Shell Programming and Scripting

Convert Hex to Ascii in a Ascii file

Discussion started by: HemaV