Extended ASCII Characters keep on getting reintroduced to text files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extended ASCII Characters keep on getting reintroduced to text files
# 1  
Old 07-09-2016
Extended ASCII Characters keep on getting reintroduced to text files

I am working with a log file that I am trying to clean up by removing non-English ASCII characters. I am using Bash via Cygwin on Windows.

Before I start I set:
Code:
export LC_ALL=C

I clean it up by removing all non-English ASCII characters with the following command;
Code:
grep -v $'[^\t\r -~]' filename_01.csv > filename_02.csv

I then check whether there is any non-English ASCII characters left with the following command and it returns nothing, indicating that there is no non-English ASCII characters left.

Code:
perl -ane '{ if(m/[[:^ascii:]]/) { print  } }' filename_02.csv

I then deleted the first line with the following command;
Code:
tail -n +2 filename_02.csv > filename_03.csv

When I check filename_03.csv again for non-English characters it returns quite a few lines with non-English ASCII characters Smilie Why is this happening, what am I doing wrong? It somehow got reintroduced when I ran the tail command, how is this possible?

Code:
perl -ane '{ if(m/[[:^ascii:]]/) { print  } }' filename_03.csv

Example of the characters that got introduced back into my text file after I ran the tail command that I initially cleaned.

Code:
▒▒Gateway1,Gateway2 ▒▒4▒d▒U4▒'E"▒morel

# 2  
Old 07-09-2016
I think you'll get better response if you post representative samples of your log file indicating which characters you want to remove, or a sample of the result file.
# 3  
Old 07-10-2016
One thing I noticed is that this:
Code:
[[:^ascii:]]

should be:
Code:
[^[:ascii:]]

Also, this :
Code:
grep -v $'[^\t\r -~]' filename_01.csv > filename_02.csv

does not just remove non-ascii characters, it discards entire lines that contains one of those characters that are not [\t\r -~]

Last edited by Scrutinizer; 07-10-2016 at 10:28 AM..
# 4  
Old 07-10-2016
Using the very limited info, done longhand.
Each line in the file contains a single ' and " .
Not sure if this a just a very small part of the string but here goes.
Code:
#¡/bin/bash
# nonascii.sh
# Macbbok Pro, circa August 2012, OSX 10.7.5, deafult  bash terminal.
> /tmp/nonascii.dat
> /tmp/newascii.txt
# Create 3 lines of data as per the very limited info.
printf "%s\n" "▒▒Gateway1,Gateway2 ▒▒4▒d▒U4▒"\'"E\"▒morel" >> /tmp/nonascii.dat
printf "%s\n" "▒▒Gateway1,Gateway2 ▒▒4▒d▒U4▒"\'"E\"▒morel" >> /tmp/nonascii.dat
printf "%s\n" "▒▒Gateway1,Gateway2 ▒▒4▒d▒U4▒"\'"E\"▒morel" >> /tmp/nonascii.dat
cat /tmp/nonascii.dat
# Now remove non-ascii characters, Backspace character, (127 decimal), also removed here.
length=$( wc -c < /tmp/nonascii.dat )
deci_string=( $( od -tu1 -An < /tmp/nonascii.dat ) )
for n in $( seq 0 1 $((length-1)) )
do
	if [ "${deci_string[$n]}" -le "126" ]
	then
		printf '\x'$( printf "%x" "${deci_string[$n]}" )
	fi
done > /tmp/newascii.txt
# Prove extended characters have gone.
cat /tmp/newascii.txt

Results:-
Code:
Last login: Sun Jul 10 18:34:52 on ttys000
AMIGA:barrywalker~> cd Desktop/Code/Shell
AMIGA:barrywalker~/Desktop/Code/Shell> ./nonascii.sh
▒▒Gateway1,Gateway2 ▒▒4▒d▒U4▒'E"▒morel
▒▒Gateway1,Gateway2 ▒▒4▒d▒U4▒'E"▒morel
▒▒Gateway1,Gateway2 ▒▒4▒d▒U4▒'E"▒morel
Gateway1,Gateway2 4dU4'E"morel
Gateway1,Gateway2 4dU4'E"morel
Gateway1,Gateway2 4dU4'E"morel
AMIGA:barrywalker~/Desktop/Code/Shell> _

# 5  
Old 07-11-2016
Shooting in the dark: How about
Code:
tr -dc '[:alnum:][:punct:][:cntrl:][:space:]' <file

or
Code:
tr -dc '[:print:][:cntrl:]' <file

EDIT:
or even
Code:
tr -dc '\000-\177' <file

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print byte position of extended ascii character

Hello, I am on AIX. When I encounter extended ascii characters and special characters on a file I need to print.. Byte position, actual character and line number. Is there a simple command that can give me the above result ? Thanks in advance (38 Replies)
Discussion started by: rosebud123
38 Replies

2. Shell Programming and Scripting

Removal Extended ASCII using awk

Hi All, I am trying to remove (SELECTIVE - passed as argument) Extended ASCII using Awk based on adhoc basis. Can you please let me know how to do it. I have to implement this using awk only. Thanks & Regads (14 Replies)
Discussion started by: tostay2003
14 Replies

3. Programming

How to read extended ASCII characters from stdin?

Hi, I want to read extended ASCII characters from keyboard using c language on unix/linux. How to read extended characters from keyboard or by copy-paste in terminal irrespective of locale set in the system. I want to read the input characters from keyboard, store it in an array or some local... (3 Replies)
Discussion started by: sanzee007
3 Replies

4. Shell Programming and Scripting

Search and Replace Extended Ascii Characters

We are getting extended Ascii characters in the input file and my requirement is to search and replace them with a space. I am using the following command LANG=C sed -e 's// /g' It is doing a good job, but in some cases it is replacing the extended characters with two spaces. So my input... (12 Replies)
Discussion started by: ysvsr1
12 Replies

5. Shell Programming and Scripting

Identify extended ascii characters in a file

Hi, Is there a way to identify the lines in a file having extended ascii characters and display the same? For instance I have a file abc.txt having below data aaa|bbb|111|This is first line aaa|bbb|222|This is secõnd line aaa|bbb|333|This is third line aaa|bbb|444|This is foùrth line... (3 Replies)
Discussion started by: decci_7
3 Replies

6. Shell Programming and Scripting

Extended replacing of nonspecific strings in text files [beware complicated !]

Well, to make another post at this helpful forum :b::D: I recently tried something like this, I want to replace all those numberings/letters that are located between <string>file://localhost/var/mobile/Applications/ and /Documents/</string> numberings =---- replace with: first... (6 Replies)
Discussion started by: pasc
6 Replies

7. AIX

Printing extended ASCII

Hi All, I'm trying to send extended ascii characters to my HP2055 as part of PCL printer control codes. What I want to do is select a bar code font, print the bar code and reset the printer to the default font. Selecting the bar code font works good. Printing the bar code goes almost ok too. ... (5 Replies)
Discussion started by: petervg
5 Replies

8. Shell Programming and Scripting

convert ascii values into ascii characters

Hi gurus, I have a file in unix with ascii values. I need to convert all the ascii values in the file to ascii characters. File contains nearly 20000 records with ascii values. (10 Replies)
Discussion started by: sandeeppvk
10 Replies

9. Shell Programming and Scripting

extended ascii problem

hi i would like to check text files if they contain extended ascii characters within or not. i really dont have any idea how to start your kind help would be very much appreciated thanks. (7 Replies)
Discussion started by: smooth
7 Replies

10. Programming

Extended ascii

Hi all, I would like to change the extended ascii code ( 128 - 255). I tried to change LC_ALL and LANG in current session ( values from locale -a) and for no good. Thanks. (0 Replies)
Discussion started by: avis
0 Replies
Login or Register to Ask a Question