Determing the encoding of a file


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Determing the encoding of a file
# 1  
Old 01-04-2013
Determing the encoding of a file

Hi, I am trying to determine the encoding for the file, because to convert to UTF-8, it seems as though I have to know the encoding of the source.

Tried this
Code:
 file <filename>

give me this:
<filename>:data or International Language text

Tried to see the locale and this is the output:
LANG=C
LC_COLLATE="C"
LC_CTYPE="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_MESSAGES="C"
LC_ALL=

Not really much help there either. Any help will be appreciated!
# 2  
Old 01-04-2013
Try using 'od' on it to see if there is a pattern you can recognize. Is it unicode, euc, jis, ebcdic, bcdic, or just an odd code page? Hard to say! 'I use 'od -bc' because I was octal-raised, but there are options for hex and decimal offsets. But yes, really, you should know!

Often, 'C' is linked to iso-8859-1 or Latin-1, but your file is not that.
# 3  
Old 01-04-2013
Sorry DGPickett, tried that and it looked all Greek to me(not in a literal sense, lol)
# 4  
Old 01-04-2013
Well, utf-8 and unicode have a pattern in their encoding. The dd command has an ebcdic decoder I have used. Might it be from big blue land?

Googling around the subject, one suggests file -i, another mentions enca http://linux.die.net/man/1/enca and for solaris, auto_ef. There is a 'chardet' python based tool.

Last edited by DGPickett; 01-04-2013 at 04:35 PM..
# 5  
Old 01-04-2013
Quote:
Originally Posted by DGPickett
Well, utf-8 and unicode have a pattern in their encoding. The dd command has an ebcdic decoder I have used. Might it be from big blue land?

Googling around the subject, one suggests file -i, another mentions enca enca(1): detect/convert encoding of text files - Linux man page and for solaris, auto_ef. There is a 'chardet' python based tool.
Yes tried file -i and it tells me it is a regular file. By big blue land, I assume you mean IBM? If that's the case yes I am using an AIX machine therefore auto_ef and enca are unrecognized commands. Yet to try chardet...I'll have to dig deeper. Thanks though!
# 6  
Old 01-04-2013
Yes, IBM is a world unto itself, and ebcdic is the dominant charset, and even then to print right you may need the code page. BCDIC was the 6 bit code, Binary Coded Decimal Info Code, so called because it was closely related to card codes with a decimal basis, where A is 21 base 8, B is 22, I is 31 (20+9), then J is 41 through R at 51, then / is 61, S is 62 through Z is 71. The r-x-0 rows of the card became upper bits, and 1-9 were binary coded. EBCDIC is BCDIC Extended to 8 bits.

You can probably get enca binary or source, and python and chardet for free, and install them. http://www.perzl.org/aix/index.php?n=Main.Enca http://www.python.org/getit/other/ http://pypi.python.org/pypi/chardet

Last edited by DGPickett; 01-04-2013 at 05:20 PM..
# 7  
Old 01-05-2013
Did you consider using iconv or recode? Maybe on a trial and error basis, but I think they complain if an unsuitable from-charset is given.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to know file encoding?

how can i know what format a file is * example: UTF-8 ANSI UCS2 i am in a... (8 Replies)
Discussion started by: tricampeon81
8 Replies

2. Solaris

View file encoding then change encoding.

Hi all!! I´m using command file -i myfile.xml to validate XML file encoding, but it is just saying regular file . I´m expecting / looking an output as UTF8 or ANSI / ASCII Is there command to display the files encoding? Thank you! (2 Replies)
Discussion started by: mrreds
2 Replies

3. HP-UX

Determing size of swap space

Hi Experts, Need your advise in determining the size of swap space in of the new HP-Ux server. Server is having 32G of physical memory. Ideally what amout of physical memory should be allocated as a swap space? Following document from HP suggests to have minimum swap space... (2 Replies)
Discussion started by: sai_2507
2 Replies

4. Shell Programming and Scripting

How to find the file encoding and updating the file encoding?

Hi, I am beginner to Unix. My requirement is to validate the encoding used in the incoming file(csv,txt).If it is encoded with UTF-8 format,then the file should remain as such otherwise i need to chnage the encoding to UTF-8. Please advice me how to proceed on this. (7 Replies)
Discussion started by: cnraja
7 Replies

5. HP-UX

how to find the character encoding of a file in hp_ux

how to find the character encoding of a file in hp_ux (1 Reply)
Discussion started by: alokjyotibal
1 Replies

6. Shell Programming and Scripting

Cygwin vi XML file encoding problem

Hi, I have got a zip (binary) file transferred from MacOS (thus it has additional __MACOSX directory packed inside). On extracting this zip, there are few *.xml files available. When I opened this *.xml file in vim editor using Cygwin (on windows) the editor displayed in the bottom. I tried... (4 Replies)
Discussion started by: royalibrahim
4 Replies

7. Shell Programming and Scripting

Dymically determing the number of check list in Zenity, How?

hi, In my project i cannot determine the number of check list initially... I will know dynamically during execution... so How to specify the number of check list dynamically in zenity Waiting for your precious Answer..... (1 Reply)
Discussion started by: shivarajM
1 Replies

8. AIX

get the file encoding

Hello! The system is AIX 5.3 Give please command or script to get the file encoding (1 Reply)
Discussion started by: vinment
1 Replies

9. Shell Programming and Scripting

get the file encoding

Hello! The system is AIX 5.3 Give please command or script to get the file encoding Thanks (2 Replies)
Discussion started by: vinment
2 Replies

10. UNIX for Dummies Questions & Answers

File encoding in Unix

1. I have a shell script which creates a file using cat command. How can i find what encoding the file follows (e.g. UTF8, ANSI)? 2. I want to convert that file to PC-ANSI format. How can i achieve that? I am using HP-Unix. (6 Replies)
Discussion started by: ssmallya
6 Replies
Login or Register to Ask a Question