Unix/Linux Go Back    


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

Determing the encoding of a file

UNIX for Dummies Questions & Answers


Closed Linux or Unix Question    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 01-04-2013
MIA651 MIA651 is offline
Registered User
 
Join Date: Oct 2012
Last Activity: 10 April 2014, 12:11 PM EDT
Location: United States
Posts: 28
Thanks: 20
Thanked 1 Time in 1 Post
Determing the encoding of a file

Hi, I am trying to determine the encoding for the file, because to convert to UTF-8, it seems as though I have to know the encoding of the source.

Tried this

Code:
 file <filename>

give me this:
<filename>:data or International Language text

Tried to see the locale and this is the output:
LANG=C
LC_COLLATE="C"
LC_CTYPE="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_MESSAGES="C"
LC_ALL=

Not really much help there either. Any help will be appreciated!
Sponsored Links
    #2  
Old Unix and Linux 01-04-2013
DGPickett DGPickett is offline Forum Advisor  
Registered User
 
Join Date: Oct 2010
Last Activity: 17 February 2015, 1:56 PM EST
Location: Southern NJ, USA (Nord)
Posts: 4,671
Thanks: 8
Thanked 586 Times in 559 Posts
Try using 'od' on it to see if there is a pattern you can recognize. Is it unicode, euc, jis, ebcdic, bcdic, or just an odd code page? Hard to say! 'I use 'od -bc' because I was octal-raised, but there are options for hex and decimal offsets. But yes, really, you should know!

Often, 'C' is linked to iso-8859-1 or Latin-1, but your file is not that.
Sponsored Links
    #3  
Old Unix and Linux 01-04-2013
MIA651 MIA651 is offline
Registered User
 
Join Date: Oct 2012
Last Activity: 10 April 2014, 12:11 PM EDT
Location: United States
Posts: 28
Thanks: 20
Thanked 1 Time in 1 Post
Sorry DGPickett, tried that and it looked all Greek to me(not in a literal sense, lol)
    #4  
Old Unix and Linux 01-04-2013
DGPickett DGPickett is offline Forum Advisor  
Registered User
 
Join Date: Oct 2010
Last Activity: 17 February 2015, 1:56 PM EST
Location: Southern NJ, USA (Nord)
Posts: 4,671
Thanks: 8
Thanked 586 Times in 559 Posts
Well, utf-8 and unicode have a pattern in their encoding. The dd command has an ebcdic decoder I have used. Might it be from big blue land?

Googling around the subject, one suggests file -i, another mentions enca http://linux.die.net/man/1/enca and for solaris, auto_ef. There is a 'chardet' python based tool.

Last edited by DGPickett; 01-04-2013 at 03:35 PM..
Sponsored Links
    #5  
Old Unix and Linux 01-04-2013
MIA651 MIA651 is offline
Registered User
 
Join Date: Oct 2012
Last Activity: 10 April 2014, 12:11 PM EDT
Location: United States
Posts: 28
Thanks: 20
Thanked 1 Time in 1 Post
Quote:
Originally Posted by DGPickett View Post
Well, utf-8 and unicode have a pattern in their encoding. The dd command has an ebcdic decoder I have used. Might it be from big blue land?

Googling around the subject, one suggests file -i, another mentions enca enca(1): detect/convert encoding of text files - Linux man page and for solaris, auto_ef. There is a 'chardet' python based tool.
Yes tried file -i and it tells me it is a regular file. By big blue land, I assume you mean IBM? If that's the case yes I am using an AIX machine therefore auto_ef and enca are unrecognized commands. Yet to try chardet...I'll have to dig deeper. Thanks though!
Sponsored Links
    #6  
Old Unix and Linux 01-04-2013
DGPickett DGPickett is offline Forum Advisor  
Registered User
 
Join Date: Oct 2010
Last Activity: 17 February 2015, 1:56 PM EST
Location: Southern NJ, USA (Nord)
Posts: 4,671
Thanks: 8
Thanked 586 Times in 559 Posts
Yes, IBM is a world unto itself, and ebcdic is the dominant charset, and even then to print right you may need the code page. BCDIC was the 6 bit code, Binary Coded Decimal Info Code, so called because it was closely related to card codes with a decimal basis, where A is 21 base 8, B is 22, I is 31 (20+9), then J is 41 through R at 51, then / is 61, S is 62 through Z is 71. The r-x-0 rows of the card became upper bits, and 1-9 were binary coded. EBCDIC is BCDIC Extended to 8 bits.

You can probably get enca binary or source, and python and chardet for free, and install them. http://www.perzl.org/aix/index.php?n=Main.Enca http://www.python.org/getit/other/ http://pypi.python.org/pypi/chardet

Last edited by DGPickett; 01-04-2013 at 04:20 PM..
Sponsored Links
    #7  
Old Unix and Linux 01-05-2013
RudiC RudiC is offline Forum Advisor  
Registered User
 
Join Date: Jul 2012
Last Activity: 25 April 2015, 11:38 AM EDT
Location: Aachen, Germany
Posts: 5,756
Thanks: 101
Thanked 1,519 Times in 1,428 Posts
Did you consider using iconv or recode ? Maybe on a trial and error basis, but I think they complain if an unsuitable from-charset is given.
Sponsored Links
Closed Linux or Unix Question

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Unix or Linux Image More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Determing size of swap space sai_2507 HP-UX 2 08-27-2012 09:37 AM
How to find the file encoding and updating the file encoding? cnraja Shell Programming and Scripting 7 05-27-2011 06:50 AM
Dymically determing the number of check list in Zenity, How? shivarajM Shell Programming and Scripting 1 04-29-2009 01:35 PM
get the file encoding vinment AIX 1 12-12-2008 01:40 PM
get the file encoding vinment Shell Programming and Scripting 2 12-12-2008 11:39 AM



All times are GMT -4. The time now is 02:18 AM.