Removal Extended ASCII using awk

01-02-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

When I run the command:

Code:

printf '%s\n' 'testing_Š_testing' 'testing__testing'|tr -d '\145\147\128-\140'

I get the output:

Code:

tstinŠtstintstintstin$

(Note that the $ at the end of the output is my shell's prompt. The arguments you are giving to tr are treated as octal values (not decimal), \145 is the character e; \147 is the character g; \128 is treated as \12 (the newline character) followed by the character 8; and 8-\140 in ASCII removes the characters 8, 9, all upper-case alphabetic characters, and the [, \, ], ^, _, and ` characters.

And the command:

Code:

printf '%s\n' 'testing_Š_testing' 'testing__testing'|sed -e 's/\d145//g' -e 's/\d147//g'  -e s'/\d128-\d140//g'

Produces the output:

Code:

testing_Š_testing
testing__testing
$

because, as I said before, the two byte character Š in UTF-8 is made up of bytes with the decimal values 197 and 160 (neither of which are in your list of byte values to be deleted by the sed command). (Note also that while, \dx (where x is a one, two, or three digit decimal number) works on some systems, it is an extension to the standards and, on many systems, will give you a syntax error or delete the characters d, 0, 1, 4, 5, 7, and 8.)

Please show us the output you get when you run the commands above!

I repeat:
What OS (including version) and shell are you using?

What Locale are you using when you run your script?

Last edited by Don Cragun; 01-02-2015 at 08:58 PM.. Reason: Fix typo (your s/b you).

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-02-2015

Registered User

73, 0

Join Date: Aug 2007

Last Activity: 1 September 2016, 12:26 PM EDT

Posts: 73

Thanks Given: 7

Thanked 0 Times in 0 Posts

Hi Don,

I am making these changes using korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)

I will check and provide you the results on monday when I have the system in front of me

Thanks & Regards

tostay2003

View Public Profile for tostay2003

Find all posts by tostay2003

01-02-2015

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

The example you gave does not match what you say you want to remove. The character is made of TWO ASCII characters not one. Please post the output of

Code:

locale
echo $LANG

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

01-05-2015

Registered User

73, 0

Join Date: Aug 2007

Last Activity: 1 September 2016, 12:26 PM EDT

Posts: 73

Thanks Given: 7

Thanked 0 Times in 0 Posts

echo $LANG produces the below result

echo $LANG

Code:

en_US.UTF-8

Shell and OS

Code:

korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)

Apologies, I am unable to copy and paste results to and from the client network.

The "�" was a character that I picked up to show how I wanted to remove such characters. It's good that I came to know new thing that we have multi byte characters as well.

I just noticed that I was not removing the characters properly as Don mentioned even with tr command.

PS: I am manually typing the results

Code:

printf "testing_\x80\x81\x82\x88_testing" > test.txt
cat -v test.txt | tr -d '[\d128-\d130]' | tr -d '[\d136]'

is resulting in

Code:

testing----testing

We did not notice it until now, that the results were incorrect.

Can you please help to remove such UTF-8 characters using awk and tr as well

tostay2003

View Public Profile for tostay2003

Find all posts by tostay2003

01-05-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

In your last example, check the output of cat -v first; then you know where the dashes come from. And, not all systems/commands accept the \dnnn sequences. Try

Code:

cat test.txt | tr -d '\200-\202\210'
testing__testing

In UTF-8 (and other UTFs), single chars above the ASCII range don't exist. They come in pairs or even longer char groups. So you could
- delete ALL chars above ASCII
- explicitly list the chars to be removed
- use iconv or recode to convert to e.g. "extended ASCII" (of which several char sets exist) and then remove those unwanted chars.

Example for option 2:

Code:

FN=$1
shift
TBD=$@
TBD=${TBD// /\|}
sed -r "s/$TBD//g" $FN

running this on your first testfile:

Code:

./remscript testfile Š � �
testing__testing

Last edited by RudiC; 01-05-2015 at 07:02 AM..

RudiC

View Public Profile for RudiC

Find all posts by RudiC

01-05-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

If you just want to get rid of non-ASCII characters (rather than a particular list of single- and/or multi-byte UTF-8 characters), the following awk and tr commands should work:

Code:

LANG=C awk '{gsub(/[\200-\377]/, "")}1' input_file > output_file   
LANG=C tr -d '\200-\377' < input_file > output_file

as evidenced by these examples:

Code:

$ printf '%s\n' 'testing_�_testing' 'testing__testing'| LANG=C awk '{gsub(/[\200-\377]/, "")}1'|od -c        
0000000    t   e   s   t   i   n   g   _   _   t   e   s   t   i   n   g
0000020   \n   t   e   s   t   i   n   g   _   _   t   e   s   t   i   n
0000040    g  \n                                                        
0000042
$ printf '%s\n' 'testing_�_testing' 'testing__testing'| LANG=C tr -d '\200-\377'|od -c        
0000000    t   e   s   t   i   n   g   _   _   t   e   s   t   i   n   g
0000020   \n   t   e   s   t   i   n   g   _   _   t   e   s   t   i   n
0000040    g  \n                                                        
0000042

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-20-2015

Registered User

73, 0

Join Date: Aug 2007

Last Activity: 1 September 2016, 12:26 PM EDT

Posts: 73

Thanks Given: 7

Thanked 0 Times in 0 Posts

Hi,

Sorry for digging the old thread. Please let me know if I have to open another thread.

Can you please let me know how you have the number 200 instead of dec 128.

I want to remove selected characters, which includes multi bytes.

I am making these changes using korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)
LANG=en_US.UTF-8

tostay2003

View Public Profile for tostay2003

Find all posts by tostay2003

Shell Programming and Scripting

Removal Extended ASCII using awk

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print byte position of extended ascii character

Discussion started by: rosebud123

2. Shell Programming and Scripting

Extended ASCII Characters keep on getting reintroduced to text files

Discussion started by: lewk

3. Programming

How to read extended ASCII characters from stdin?

Discussion started by: sanzee007

4. Shell Programming and Scripting

Search and Replace Extended Ascii Characters

Discussion started by: ysvsr1

5. Shell Programming and Scripting

Identify extended ascii characters in a file

Discussion started by: decci_7

6. Shell Programming and Scripting

Removal of HTML ASCII Codes from file

Discussion started by: btt3165

7. AIX

Printing extended ASCII

Discussion started by: petervg

8. UNIX for Advanced & Expert Users

Processing extended ascii character file names in UNIX (BASH scipts)

Discussion started by: peli

9. Shell Programming and Scripting

extended ascii problem

Discussion started by: smooth

10. Programming

Extended ascii

Discussion started by: avis