Removal Extended ASCII using awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removal Extended ASCII using awk
# 8  
Old 01-02-2015
When I run the command:
Code:
printf '%s\n' 'testing_Š_testing' 'testing__testing'|tr -d '\145\147\128-\140'

I get the output:
Code:
tstinŠtstintstintstin$

(Note that the $ at the end of the output is my shell's prompt. The arguments you are giving to tr are treated as octal values (not decimal), \145 is the character e; \147 is the character g; \128 is treated as \12 (the newline character) followed by the character 8; and 8-\140 in ASCII removes the characters 8, 9, all upper-case alphabetic characters, and the [, \, ], ^, _, and ` characters.

And the command:
Code:
printf '%s\n' 'testing_Š_testing' 'testing__testing'|sed -e 's/\d145//g' -e 's/\d147//g'  -e s'/\d128-\d140//g'

Produces the output:
Code:
testing_Š_testing
testing__testing
$

because, as I said before, the two byte character Š in UTF-8 is made up of bytes with the decimal values 197 and 160 (neither of which are in your list of byte values to be deleted by the sed command). (Note also that while, \dx (where x is a one, two, or three digit decimal number) works on some systems, it is an extension to the standards and, on many systems, will give you a syntax error or delete the characters d, 0, 1, 4, 5, 7, and 8.)

Please show us the output you get when you run the commands above!

I repeat:
What OS (including version) and shell are you using?

What Locale are you using when you run your script?

Last edited by Don Cragun; 01-02-2015 at 08:58 PM.. Reason: Fix typo (your s/b you).
# 9  
Old 01-02-2015
Hi Don,

I am making these changes using korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)

I will check and provide you the results on monday when I have the system in front of me

Thanks & Regards
# 10  
Old 01-02-2015
The example you gave does not match what you say you want to remove. The character is made of TWO ASCII characters not one. Please post the output of
Code:
locale
echo $LANG

# 11  
Old 01-05-2015
echo $LANG produces the below result

echo $LANG
Code:
en_US.UTF-8

Shell and OS
Code:
korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)


Apologies, I am unable to copy and paste results to and from the client network.

The "Š" was a character that I picked up to show how I wanted to remove such characters. It's good that I came to know new thing that we have multi byte characters as well.

I just noticed that I was not removing the characters properly as Don mentioned even with tr command.


PS: I am manually typing the results
Code:
printf "testing_\x80\x81\x82\x88_testing" > test.txt
cat -v test.txt | tr -d '[\d128-\d130]' | tr -d '[\d136]'

is resulting in
Code:
testing----testing

We did not notice it until now, that the results were incorrect.

Can you please help to remove such UTF-8 characters using awk and tr as well
# 12  
Old 01-05-2015
In your last example, check the output of cat -v first; then you know where the dashes come from. And, not all systems/commands accept the \dnnn sequences. Try
Code:
cat test.txt | tr -d '\200-\202\210'
testing__testing

In UTF-8 (and other UTFs), single chars above the ASCII range don't exist. They come in pairs or even longer char groups. So you could
- delete ALL chars above ASCII
- explicitly list the chars to be removed
- use iconv or recode to convert to e.g. "extended ASCII" (of which several char sets exist) and then remove those unwanted chars.

Example for option 2:
Code:
FN=$1
shift
TBD=$@
TBD=${TBD// /\|}
sed -r "s/$TBD//g" $FN

running this on your first testfile:
Code:
./remscript testfile Š ü ß
testing__testing


Last edited by RudiC; 01-05-2015 at 07:02 AM..
# 13  
Old 01-05-2015
If you just want to get rid of non-ASCII characters (rather than a particular list of single- and/or multi-byte UTF-8 characters), the following awk and tr commands should work:
Code:
LANG=C awk '{gsub(/[\200-\377]/, "")}1' input_file > output_file   
LANG=C tr -d '\200-\377' < input_file > output_file

as evidenced by these examples:
Code:
$ printf '%s\n' 'testing_Š_testing' 'testing__testing'| LANG=C awk '{gsub(/[\200-\377]/, "")}1'|od -c        
0000000    t   e   s   t   i   n   g   _   _   t   e   s   t   i   n   g
0000020   \n   t   e   s   t   i   n   g   _   _   t   e   s   t   i   n
0000040    g  \n                                                        
0000042
$ printf '%s\n' 'testing_Š_testing' 'testing__testing'| LANG=C tr -d '\200-\377'|od -c        
0000000    t   e   s   t   i   n   g   _   _   t   e   s   t   i   n   g
0000020   \n   t   e   s   t   i   n   g   _   _   t   e   s   t   i   n
0000040    g  \n                                                        
0000042

# 14  
Old 04-20-2015
Hi,

Sorry for digging the old thread. Please let me know if I have to open another thread.

Can you please let me know how you have the number 200 instead of dec 128.

I want to remove selected characters, which includes multi bytes.

I am making these changes using korn shell (Version AJM 93t+ 2010-06) on Linux OS (2.6.18)
LANG=en_US.UTF-8
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print byte position of extended ascii character

Hello, I am on AIX. When I encounter extended ascii characters and special characters on a file I need to print.. Byte position, actual character and line number. Is there a simple command that can give me the above result ? Thanks in advance (38 Replies)
Discussion started by: rosebud123
38 Replies

2. Shell Programming and Scripting

Extended ASCII Characters keep on getting reintroduced to text files

I am working with a log file that I am trying to clean up by removing non-English ASCII characters. I am using Bash via Cygwin on Windows. Before I start I set: export LC_ALL=C I clean it up by removing all non-English ASCII characters with the following command; grep -v $''... (4 Replies)
Discussion started by: lewk
4 Replies

3. Programming

How to read extended ASCII characters from stdin?

Hi, I want to read extended ASCII characters from keyboard using c language on unix/linux. How to read extended characters from keyboard or by copy-paste in terminal irrespective of locale set in the system. I want to read the input characters from keyboard, store it in an array or some local... (3 Replies)
Discussion started by: sanzee007
3 Replies

4. Shell Programming and Scripting

Search and Replace Extended Ascii Characters

We are getting extended Ascii characters in the input file and my requirement is to search and replace them with a space. I am using the following command LANG=C sed -e 's// /g' It is doing a good job, but in some cases it is replacing the extended characters with two spaces. So my input... (12 Replies)
Discussion started by: ysvsr1
12 Replies

5. Shell Programming and Scripting

Identify extended ascii characters in a file

Hi, Is there a way to identify the lines in a file having extended ascii characters and display the same? For instance I have a file abc.txt having below data aaa|bbb|111|This is first line aaa|bbb|222|This is secõnd line aaa|bbb|333|This is third line aaa|bbb|444|This is foùrth line... (3 Replies)
Discussion started by: decci_7
3 Replies

6. Shell Programming and Scripting

Removal of HTML ASCII Codes from file

Hi all, I have a file with extended ASCII codes in the description which needs to be removed. List of extended ascii codes "Œ", "œ", "Š", "š", "Ÿ", "ƒ", "-", "-", "‘", "'", "‚", "“", "”", "„","†", "‡", "•", "...", "‰", "€", "™" Sample data: Test Details-HAVE BEEN PUBLISHED... (1 Reply)
Discussion started by: btt3165
1 Replies

7. AIX

Printing extended ASCII

Hi All, I'm trying to send extended ascii characters to my HP2055 as part of PCL printer control codes. What I want to do is select a bar code font, print the bar code and reset the printer to the default font. Selecting the bar code font works good. Printing the bar code goes almost ok too. ... (5 Replies)
Discussion started by: petervg
5 Replies

8. UNIX for Advanced & Expert Users

Processing extended ascii character file names in UNIX (BASH scipts)

Hi, I have a accentuated letter (ö) in a script for an Installer. It's a file name. This is not working and I'm told to try using the octal value for the extended ascii character. Does anyone no how to do this? If I had the word "filförval", can I just put in the value between the letters, like... (9 Replies)
Discussion started by: peli
9 Replies

9. Shell Programming and Scripting

extended ascii problem

hi i would like to check text files if they contain extended ascii characters within or not. i really dont have any idea how to start your kind help would be very much appreciated thanks. (7 Replies)
Discussion started by: smooth
7 Replies

10. Programming

Extended ascii

Hi all, I would like to change the extended ascii code ( 128 - 255). I tried to change LC_ALL and LANG in current session ( values from locale -a) and for no good. Thanks. (0 Replies)
Discussion started by: avis
0 Replies
Login or Register to Ask a Question