Removal Extended ASCII using awk Post: 302930296

Sponsored Content

Top Forums Shell Programming and Scripting Removal Extended ASCII using awk Post 302930296 by Don Cragun on Friday 2nd of January 2015 01:12:10 AM

01-02-2015

Registered User

It appears that your strings are UTF-8; not extend ASCII. Furthermore, printing your strings through od shows that the byte values that you said you wanted to remove are not present in your input string or output string samples:

Code:

printf '%s' 'testing_�_testing' | od -t cu1
printf '%s' 'testing__testing' | od -t cu1

shows us that the unsigned decimal byte values of the two bytes you want to remove are 197 and 160:

Code:

0000000    t   e   s   t   i   n   g   _   �  **   _   t   e   s   t   i
          116 101 115 116 105 110 103  95 197 160  95 116 101 115 116 105
0000020    n   g                                                        
          110 103                                                        
0000022
printf '%s' 'testing__testing' | od -t cu1
0000000    t   e   s   t   i   n   g   _   _   t   e   s   t   i   n   g
          116 101 115 116 105 110 103  95  95 116 101 115 116 105 110 103
0000020

If you are working with UTF-8 input and want "extended ASCII" output (where you may be removing 1 or more bytes out of a multi-byte UTF-8 character, but might not be removing complete characters), you may end up with an unintelligible mess. If you want to remove a specific set of UTF-8 characters, that is easy to do. If you want to remove all non-(7-bit)ASCII characters, that is easy to do on some systems (depending on how well your version of awk handles locales and multi-byte characters).

What OS (including version) and shell are you using?

What Locale are you using when your run this script?

Is it OK to just remove all bytes from your input stream that have the high order bit set? If not, is there a specific list of UTF-8 characters you want to remove? If not, and you really want to remove individual bytes from strings containing multi-byte characters, this may be hard to do in some versions of awk.

You said you know how to do what you want using sed. Show us the sed substitute command that does what you want and we can show you how to easily change that into an awk sub() or gsub() function call.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10 More Discussions You Might Find Interesting

1. Programming

Extended ascii

Hi all, I would like to change the extended ascii code ( 128 - 255). I tried to change LC_ALL and LANG in current session ( values from locale -a) and for no good. Thanks.

2. Shell Programming and Scripting

extended ascii problem

hi i would like to check text files if they contain extended ascii characters within or not. i really dont have any idea how to start your kind help would be very much appreciated thanks.

3. UNIX for Advanced & Expert Users

Processing extended ascii character file names in UNIX (BASH scipts)

Hi, I have a accentuated letter (�) in a script for an Installer. It's a file name. This is not working and I'm told to try using the octal value for the extended ascii character. Does anyone no how to do this? If I had the word "filf�rval", can I just put in the value between the letters, like...

4. AIX

Printing extended ASCII

Hi All, I'm trying to send extended ascii characters to my HP2055 as part of PCL printer control codes. What I want to do is select a bar code font, print the bar code and reset the printer to the default font. Selecting the bar code font works good. Printing the bar code goes almost ok too. ...

5. Shell Programming and Scripting

Removal of HTML ASCII Codes from file

Hi all, I have a file with extended ASCII codes in the description which needs to be removed. List of extended ascii codes "�", "�", "�", "�", "�", "�", "-", "-", "�", "'", "�", "�", "�", "�","�", "�", "�", "...", "�", "�", "�" Sample data: Test Details-HAVE BEEN PUBLISHED...

6. Shell Programming and Scripting

Identify extended ascii characters in a file

Hi, Is there a way to identify the lines in a file having extended ascii characters and display the same? For instance I have a file abc.txt having below data aaa|bbb|111|This is first line aaa|bbb|222|This is sec�nd line aaa|bbb|333|This is third line aaa|bbb|444|This is fo�rth line...

7. Shell Programming and Scripting

Search and Replace Extended Ascii Characters

We are getting extended Ascii characters in the input file and my requirement is to search and replace them with a space. I am using the following command LANG=C sed -e 's// /g' It is doing a good job, but in some cases it is replacing the extended characters with two spaces. So my input...

8. Programming

How to read extended ASCII characters from stdin?

Hi, I want to read extended ASCII characters from keyboard using c language on unix/linux. How to read extended characters from keyboard or by copy-paste in terminal irrespective of locale set in the system. I want to read the input characters from keyboard, store it in an array or some local...

9. Shell Programming and Scripting

Extended ASCII Characters keep on getting reintroduced to text files

I am working with a log file that I am trying to clean up by removing non-English ASCII characters. I am using Bash via Cygwin on Windows. Before I start I set: export LC_ALL=C I clean it up by removing all non-English ASCII characters with the following command; grep -v $''...

10. UNIX for Beginners Questions & Answers

Print byte position of extended ascii character

Hello, I am on AIX. When I encounter extended ascii characters and special characters on a file I need to print.. Byte position, actual character and line number. Is there a simple command that can give me the above result ? Thanks in advance

LEARN ABOUT MOJAVE

isupper

ISUPPER(3)						   BSD Library Functions Manual 						ISUPPER(3)

NAME

     isupper -- upper-case character test

LIBRARY

     Standard C Library (libc, -lc)

SYNOPSIS

     #include <ctype.h>

     int
     isupper(int c);

DESCRIPTION

     The isupper() function tests for any upper-case letter.  The value of the argument must be representable as an unsigned char or the value of
     EOF.

     In the ASCII character set, this includes the following characters (preceded by their numeric values, in octal):

     101 ``A''	   102 ``B''	 103 ``C''     104 ``D''     105 ``E''
     106 ``F''	   107 ``G''	 110 ``H''     111 ``I''     112 ``J''
     113 ``K''	   114 ``L''	 115 ``M''     116 ``N''     117 ``O''
     120 ``P''	   121 ``Q''	 122 ``R''     123 ``S''     124 ``T''
     125 ``U''	   126 ``V''	 127 ``W''     130 ``X''     131 ``Y''
     132 ``Z''

RETURN VALUES

     The isupper() function returns zero if the character tests false and returns non-zero if the character tests true.

COMPATIBILITY

     The 4.4BSD extension of accepting arguments outside of the range of the unsigned char type in locales with large character sets is considered
     obsolete and may not be supported in future releases.  The iswupper() function should be used instead.

SEE ALSO

     ctype(3), isalnum_l(3), iswupper(3), toupper(3), ascii(7)

STANDARDS

     The isupper() function conforms to ISO/IEC 9899:1990 (``ISO C90'').

BSD
								   July 17, 2005							       BSD

10 More Discussions You Might Find Interesting

1. Programming

Extended ascii

Discussion started by: avis

2. Shell Programming and Scripting

extended ascii problem

Discussion started by: smooth

3. UNIX for Advanced & Expert Users

Processing extended ascii character file names in UNIX (BASH scipts)

Discussion started by: peli

4. AIX

Printing extended ASCII

Discussion started by: petervg