Go Back   The UNIX and Linux Forums > Top Forums > UNIX for Dummies Questions & Answers


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 06-18-2012
Registered User
 
Join Date: May 2012
Posts: 58
Thanks: 5
Thanked 9 Times in 9 Posts
Issue with UTF-8 BOM character in text file

Sometimes we recieve some excel files containing French/Japanese characters over the mail, and these files are manually transferred to the server by using SFTP (security is not a huge concern here). The data is changed to text format before transferring it using Notepad.

Problem is: When saving the files to our windows machine in UTF-8 format, notepad inserts BOM characters
Code:
 

at the beginning of the text. ETL tools such as Informatica have no problem reading the files with this character, but unfortunately we validate the data before loading it, and this validation is performed by a shell script. Since the first field of the first line is no longer a valid field, the shell script fails.

One solution we tried was removing the BOM characters from the text file in Unix before processing it. This worked fine as far as the shell script was concerned, but then the ETL tool failed to read the UTF characters in the file.

My questions:
1. Is there a way to remove this issue at root, i.e. can I find a way to remove the BOM character in notepad, while saving it to UTF-8 format.
2. Can some other tool help me out to make this change instead of notepad... DOS maybe?
3. What are my options in Unix? Is there a way to remove the BOM characters without "breaking" the file in Unix? There must be, because I have seen a lot of UTF files without BOM being processed just fine earlier. I just don't know how to do it.
Sponsored Links
    #2  
Old 06-18-2012
Mead Rotor
 
Join Date: Aug 2005
Location: Saskatchewan
Posts: 16,384
Thanks: 491
Thanked 2,535 Times in 2,418 Posts
Make a copy of the file with the BOM character removed. Use that to validate the file in UNIX. You don't have to actually save it permanently.
Sponsored Links
    #3  
Old 06-18-2012
Registered User
 
Join Date: May 2012
Posts: 58
Thanks: 5
Thanked 9 Times in 9 Posts
Apologies for not mentioning this....

Validation is not the only thing the script does. It also shuffles the columns around to an order which the ETL tool will understand.
    #4  
Old 06-18-2012
...@...
 
Join Date: Feb 2004
Location: NM
Posts: 9,657
Thanks: 164
Thanked 645 Times in 622 Posts
Can you not transfer it using ftp in ASCII mode instead of using a windows app?

Unix tools for windows do exist - they are free. You can install cygwin on your PC or simply download unixtools for windows

UnxUtils | Free software downloads at SourceForge.net

Cygwin
Sponsored Links
    #5  
Old 06-18-2012
Registered User
 
Join Date: May 2012
Posts: 58
Thanks: 5
Thanked 9 Times in 9 Posts
Unfortunately downloading+installing tools is not an option (Controlled environment at work means I would have to cut through at least half a dozen people to get something as basic as puTTY installed on my system).

Question regarding your first point: Wouldn't transferring the file in ASCII mode incorrectly transmit the UTF(japanese/spanish) characters? Also, are you suggesting skipping the "copy data from excel - paste to notepad - save to UTF8 format" step? That might again not be possible in my current situation, unless I find a way to convert the data to a proper UTF8 text file without BOM characters using a pre-installed application.
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
post-Adding character for a text file manas_ranjan Shell Programming and Scripting 5 11-04-2011 11:28 AM
Deleting the last character in a text file sagarparadkar Shell Programming and Scripting 5 02-24-2011 04:10 AM
read the text file and print the content character by character.. samupnl Shell Programming and Scripting 1 06-10-2010 03:03 AM
Deleting all instances of a certain character from a text file guitarscn UNIX for Dummies Questions & Answers 1 02-18-2010 01:17 PM
need to read 3° character from a text file piltrafa UNIX for Dummies Questions & Answers 15 07-26-2005 10:19 AM



All times are GMT -4. The time now is 09:41 PM.