Determing the encoding of a file | Unix Linux Forums | UNIX for Dummies Questions & Answers

  Go Back    


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

Determing the encoding of a file

UNIX for Dummies Questions & Answers


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 01-04-2013
MIA651 MIA651 is offline
Registered User
 
Join Date: Oct 2012
Last Activity: 10 April 2014, 12:11 PM EDT
Location: United States
Posts: 28
Thanks: 20
Thanked 1 Time in 1 Post
Determing the encoding of a file

Hi, I am trying to determine the encoding for the file, because to convert to UTF-8, it seems as though I have to know the encoding of the source.

Tried this

Code:
 file <filename>

give me this:
<filename>:data or International Language text

Tried to see the locale and this is the output:
LANG=C
LC_COLLATE="C"
LC_CTYPE="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_MESSAGES="C"
LC_ALL=

Not really much help there either. Any help will be appreciated!
Sponsored Links
    #2  
Old 01-04-2013
DGPickett DGPickett is offline Forum Advisor  
Registered User
 
Join Date: Oct 2010
Last Activity: 8 July 2014, 12:19 PM EDT
Location: Southern NJ, USA (Nord)
Posts: 4,378
Thanks: 8
Thanked 535 Times in 514 Posts
Try using 'od' on it to see if there is a pattern you can recognize. Is it unicode, euc, jis, ebcdic, bcdic, or just an odd code page? Hard to say! 'I use 'od -bc' because I was octal-raised, but there are options for hex and decimal offsets. But yes, really, you should know!

Often, 'C' is linked to iso-8859-1 or Latin-1, but your file is not that.
Sponsored Links
    #3  
Old 01-04-2013
MIA651 MIA651 is offline
Registered User
 
Join Date: Oct 2012
Last Activity: 10 April 2014, 12:11 PM EDT
Location: United States
Posts: 28
Thanks: 20
Thanked 1 Time in 1 Post
Sorry DGPickett, tried that and it looked all Greek to me(not in a literal sense, lol)
    #4  
Old 01-04-2013
DGPickett DGPickett is offline Forum Advisor  
Registered User
 
Join Date: Oct 2010
Last Activity: 8 July 2014, 12:19 PM EDT
Location: Southern NJ, USA (Nord)
Posts: 4,378
Thanks: 8
Thanked 535 Times in 514 Posts
Well, utf-8 and unicode have a pattern in their encoding. The dd command has an ebcdic decoder I have used. Might it be from big blue land?

Googling around the subject, one suggests file -i, another mentions enca http://linux.die.net/man/1/enca and for solaris, auto_ef. There is a 'chardet' python based tool.

Last edited by DGPickett; 01-04-2013 at 03:35 PM..
Sponsored Links
    #5  
Old 01-04-2013
MIA651 MIA651 is offline
Registered User
 
Join Date: Oct 2012
Last Activity: 10 April 2014, 12:11 PM EDT
Location: United States
Posts: 28
Thanks: 20
Thanked 1 Time in 1 Post
Quote:
Originally Posted by DGPickett View Post
Well, utf-8 and unicode have a pattern in their encoding. The dd command has an ebcdic decoder I have used. Might it be from big blue land?

Googling around the subject, one suggests file -i, another mentions enca enca(1): detect/convert encoding of text files - Linux man page and for solaris, auto_ef. There is a 'chardet' python based tool.
Yes tried file -i and it tells me it is a regular file. By big blue land, I assume you mean IBM? If that's the case yes I am using an AIX machine therefore auto_ef and enca are unrecognized commands. Yet to try chardet...I'll have to dig deeper. Thanks though!
Sponsored Links
    #6  
Old 01-04-2013
DGPickett DGPickett is offline Forum Advisor  
Registered User
 
Join Date: Oct 2010
Last Activity: 8 July 2014, 12:19 PM EDT
Location: Southern NJ, USA (Nord)
Posts: 4,378
Thanks: 8
Thanked 535 Times in 514 Posts
Yes, IBM is a world unto itself, and ebcdic is the dominant charset, and even then to print right you may need the code page. BCDIC was the 6 bit code, Binary Coded Decimal Info Code, so called because it was closely related to card codes with a decimal basis, where A is 21 base 8, B is 22, I is 31 (20+9), then J is 41 through R at 51, then / is 61, S is 62 through Z is 71. The r-x-0 rows of the card became upper bits, and 1-9 were binary coded. EBCDIC is BCDIC Extended to 8 bits.

You can probably get enca binary or source, and python and chardet for free, and install them. http://www.perzl.org/aix/index.php?n=Main.Enca http://www.python.org/getit/other/ http://pypi.python.org/pypi/chardet

Last edited by DGPickett; 01-04-2013 at 04:20 PM..
Sponsored Links
    #7  
Old 01-05-2013
RudiC RudiC is offline Forum Advisor  
Registered User
 
Join Date: Jul 2012
Last Activity: 22 July 2014, 4:01 PM EDT
Location: Aachen, Germany
Posts: 3,869
Thanks: 62
Thanked 918 Times in 871 Posts
Did you consider using iconv or recode ? Maybe on a trial and error basis, but I think they complain if an unsuitable from-charset is given.
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Determing size of swap space sai_2507 HP-UX 2 08-27-2012 09:37 AM
How to find the file encoding and updating the file encoding? cnraja Shell Programming and Scripting 7 05-27-2011 06:50 AM
Dymically determing the number of check list in Zenity, How? shivarajM Shell Programming and Scripting 1 04-29-2009 01:35 PM
get the file encoding vinment AIX 1 12-12-2008 01:40 PM
get the file encoding vinment Shell Programming and Scripting 2 12-12-2008 11:39 AM



All times are GMT -4. The time now is 01:54 PM.