Cleaning up incorrect/unknown characters


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Cleaning up incorrect/unknown characters
# 8  
Old 07-11-2013
Quote:
Originally Posted by m69w
. . .
Is there a way to identify, and possibly replace, those characters using sed, awk or else? . . .
Replace with what?
# 9  
Old 07-12-2013
Quote:
Originally Posted by vbe
Its more a question of knowing Input format (seems UTF ) and needed format on your unix/linux and doing a translation of char codes...
I'm using UTF-8 on the destination (Linux Mint), just like on input. Not sure what charset woud be more appropriate to set before transferring the files. Any suggestion?

Quote:
Originally Posted by Corona688
No wildcards needed, look at how UTF8 works, simply remove any characters >=128.

Code:
tr -d '\200-\377' < inputfile > outputfile

Doesn't work, but thanks for the idea

Quote:
Originally Posted by RudiC
Replace with what?
Ideally with the correct corresponding character, but a space or some other default character is also ok
# 10  
Old 07-12-2013
Did you consider the recode and/or iconv tools?
This User Gave Thanks to RudiC For This Post:
# 11  
Old 07-12-2013
I didn't know about these until you mentionned them. So i gave it a try using all possible charset available like this but none seems to work out. Either it returns the original character or an error

Code:
recode --list|egrep -v '^\/|^:'|while read s; do recode $s < unknown_character.txt 2>/dev/null ; done|egrep ^Unknown

# 12  
Old 07-12-2013
If I remember correctly, you need to specify a source charset AND a target charset to recode. There should be readymade source-target-pairs, though...
# 13  
Old 07-15-2013
Thanks for pointing this out. I've tried different combinations but didn't get much luck. Looks like the original charset wasn't utf-8 after all. Perhaps is it set internally by the app generating the xml files.

Following Corona688's suggestion i think i'm just going to delete the unwanted character using
Code:
 tr -d '\031'

It seems to be the only one that gives trouble (for now) so removing it this way is the easiest solution.

Thanks all for your help
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Cleaning output using awk

I have some small problem with my code. data.html <TD class="statuscol2">c</TD> <TD class="statuscol3">18</TD> <TD class="statuscol4"><SPAN TITLE="#04">test4</SPAN></TD> <TD... (4 Replies)
Discussion started by: Jotne
4 Replies

2. Shell Programming and Scripting

Removing characters from end of line (length unknown)

Hi I have a file which contains wrong XML, There are some garbage characters at the end of line that I want to get rid of. Example: <request type="product" ><attributes><pair><name>q</name><value><!]></value></pair><pair><name>start</name><value>1</value></pair></attributes></request>�J ... (7 Replies)
Discussion started by: dirtyd0ggy
7 Replies

3. Shell Programming and Scripting

cleaning the file

Hi, I have a file with multiple rows. each row has 8 columns. Column 8 has entries separated by commas. I want to exclude all the rows in which column 8 has more than 3 commas. 1234#0/1 - ABC_1234 3 ATGCATGCATGC HHHIIIGIHVF 1 49:T>C,60:T>C,78:C>A,76:G>T,65:T>G Thanks, Diya (3 Replies)
Discussion started by: Diya123
3 Replies

4. Shell Programming and Scripting

File cleaning

HI , I am getting the source data as below. Source Data CDR_Data,,,,, F1,F2,F3,F4,F5,F6 5,5,6,7,8,7 6,6,g,,, 7,7,76,,, 8,8,gt,,, 9,9,df ,d,d,d ,,,,, (4 Replies)
Discussion started by: wangkc
4 Replies

5. UNIX for Dummies Questions & Answers

AWK Data Cleaning

Hello, I am trying to analyze data I recently ran, and the only way to efficiently clean up the data is by using an awk file. I am very new to awk and am having great difficulty with it. In $8 and $9, for example, I am trying to delete numbers that contain 1. I cannot find any tutorials that... (20 Replies)
Discussion started by: carmar87
20 Replies

6. Shell Programming and Scripting

read in a file character by character - replace any unknown ASCII characters with spa

Can someone help me to write a script / command to read in a file, character by character, replace any unknown ASCII characters with space. then write out the file to a new filename/ Thanks! (1 Reply)
Discussion started by: raghav525
1 Replies

7. Solaris

PING - Unknown host 127.0.0.1, Unknown host localhost - Solaris 10

Hello, I have a problem - I created a chrooted jail for one user. When I'm logged in as root, everything work fine, but when I'm logged in as a chrooted user - I have many problems: 1. When I execute the command ping, I get weird results: bash-3.00$ usr/sbin/ping localhost ... (4 Replies)
Discussion started by: Przemek
4 Replies

8. SCO

Tape drive cleaning

Hello everyone, First, thank you anyone who might be able to help : ) !! here it is, I am using SCO at my business, and I back up everything to a tape drive. I want to do my cleaning of the drive, and i put in the cartridge to the drive, it recognizes it yet it will not engage the... (5 Replies)
Discussion started by: RichardHeadd
5 Replies

9. UNIX for Dummies Questions & Answers

Database cleaning software

Hi everybody, I have been given a task to find the names of some products that can clean up databases by removing confidential information. The situation is that a client imports data from public sources (government websites, etc.) but that this data sometimes includes things like Social... (0 Replies)
Discussion started by: rhfrommn
0 Replies

10. AIX

doing some spring cleaning....

USERS="me you jim joe sue" for user in ${USERS}; do rmuser -p $user usrdir=`cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'` rm -fr `cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'` echo Deleting: $user '\t' REMOVING: $usrdir done This is for AIX ONLY!!! but easily ported to... (0 Replies)
Discussion started by: Optimus_P
0 Replies
Login or Register to Ask a Question