Visit Our UNIX and Linux User Community


Problem identifying charset of a file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Problem identifying charset of a file
# 1  
Old 03-08-2009
Question Problem identifying charset of a file

Hi all,

My objective is to find out the charset using which a file is encoded. (The OS is SunOs)
I have set NLS_LANG to AR8MSWIN1256 and spooled the file.

When viewed the file using vi, I saw the following
\307\341\321\355\307\326

I then inserted the line containing these codes in a table by setting NLS_LANG to AL32UTF8 and saw the Arabic text
الرياض

Now, what are these 307, 341 .. numbers? Are these the code points? If that is the case, they should be of Windows 1256 cp as I have set NLS_LANG to AR8MSWIN1256. Also, are they in decimal/ hex/ oct?

Can anyone tell me how can i arrive at the arabic text by using those numbers?
I tried something like this in a HTML page without any luck
& #307;& #341;& #321;& #355;& #307;& #326;
& #775;& #833;& #801;& #853;& #775;& #806; (I have kept a space between & and # to avoid the browser rendering them as symbols/characters)

Thanks,
Sridhar

Last edited by sridhar_423; 03-08-2009 at 05:23 PM..
# 2  
Old 03-09-2009
Quote:
Originally Posted by sridhar_423
Hi all,

My objective is to find out the charset using which a file is encoded. (The OS is SunOs)
try:
Code:
$ file filename.txt

for example:
Code:
yogeshs@yogesh-laptop:~/temp$ 
yogeshs@yogesh-laptop:~/temp$ cat chars.txt 
لرياض
yogeshs@yogesh-laptop:~/temp$ file chars.txt 
chars.txt: UTF-8 Unicode text
yogeshs@yogesh-laptop:~/temp$

# 3  
Old 03-09-2009
Hi Yogesh,

Thanks a lot for the reply.
I tried "file" option as well. But dont know why it displays only "text". Its not as descriptive as you have showed in your post on my unix box.

Can you please try this with a file that is generated using win 1256 cp?

Also, do you have any idea about those numbers? I found on some site that these numbers are octal. So, I have converted them into decimal and then tried &#DECIAML; in a HTML without any luck.

You can check this in your example by doing "vi chars.txt"

Any pointers in this direction would be very helpful

Thanks again
Sridhar
# 4  
Old 03-28-2009
I guess I found out what I was looking for after a series of tests
file -- This may not give correct output. In the above post, chars.txt gave utf-8 because chars.txt is saved to disk using utf-8 and utf-8 reserves first 3 bytes of the file to represent that its a unicode file which is encoded using utf-8

In my case, the file was generated using cp1256. So, if the first 512 bytes are ascii characters(I guess file checks for first 512 bytes.. i'm not 100% sure though. I simply added 1000 english characters to the beginning of the file), then it would display the file as ascii as the code points of cp1256 is same as ascii for <=127

Coming to the numbers in the file when opened using vi editor, they are the octals(base 8) of the code points. I performed the below test to confirm it
1. opened the file using vi and copied some of those numbers
2. Wrote a php program to convert the octals into decimal and print the corresponding character
As my computer uses 1256cp for representing the characters which fall outside of ascii range, it displayed arabic data. So, these numbers are nothing but the code points.

Thanks,
Sridhar

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Identifying missing file dates

Hi Experts, I have written the below script to check the missing files based on the date in the file name from current date to in a given interval of days. In the file names we have dates along with some name. ex:jera_sit_2017-04-25-150325.txt. The below script is working fine if we have only... (10 Replies)
Discussion started by: nalu
10 Replies

2. Red Hat

How to load a charset on RHEL 6.6 ?

Hi all, am running the following code on a RHEL 6.6 box to list which charsets are loaded and which are available: #!/usr/bin/perl -w use strict; use Encode; my @list = Encode->encodings(); my @all_encodings = Encode->encodings(":all"); print "@list\n\n"; print "@all_encodings\n"; ... (3 Replies)
Discussion started by: Fundix
3 Replies

3. Shell Programming and Scripting

Identifying Missing File Sequence

Hi, I have a file which contains few columns and the first column has the file names, and I would like to identify the missing file sequence number form the file and would copy to another file. My files has data in below format. APKRISPSIN320131231201319_0983,1,54,125,... (5 Replies)
Discussion started by: rramkrishnas
5 Replies

4. Shell Programming and Scripting

Identifying presence and name of new file(s)?

I have an HP-UX server that runs a script each night. The script connects to an SFTP server and downloads all xml files (if any are present) from a certain folder, and then deletes the files from the SFTP server. So sometimes it will download a new file, sometimes it will download 2 or 3 new... (4 Replies)
Discussion started by: lupin..the..3rd
4 Replies

5. UNIX for Advanced & Expert Users

ISO 88591 file encoding charset in Linux

Hello Experts, please help to provide any insight as I am facing issue migrating java application from hpux to redhat. The java program is using InputStreamReader to read a file without specifying any charset parameter. However, in new Linux Redhat 5.6 environent, when reading a file that... (1 Reply)
Discussion started by: sonic_air
1 Replies

6. Shell Programming and Scripting

Identifying the file completion

Hi, A script is running for multiple databases so data is also being populated for multiple DBs in a.txt file. I need to rename this file once all the data is populated. Kindly suggest me How can I check once file is populated completely before renaming? Thanks in advance. (3 Replies)
Discussion started by: ravigupta2u
3 Replies

7. UNIX for Dummies Questions & Answers

locale and glibc and charset

what's the relationship among locale, glibc, charset, charmap and fonts? why locale needs to be generated by glibc? how? what are in the locale-archive file? and what are in font files? (0 Replies)
Discussion started by: vistastar
0 Replies

8. Shell Programming and Scripting

Identifying suffixes in a file and printing them out

Hello, I am interested in finding and identifying suffixes for Indian names through an awk script or a perl program. Suffixes normally are found at the end of a word as is shown in the sample given below. What I need is a perl script which will identify suffixes of a defined lenght to be given in... (4 Replies)
Discussion started by: gimley
4 Replies

9. Shell Programming and Scripting

identifying null values in a file

I have a huge file with 20 fileds in each record and each field is seperated by "|". If i want to get all the reocrds that have 18th or for that matter any filed as null how can i do it? Please let me know (3 Replies)
Discussion started by: dsravan
3 Replies

10. UNIX for Advanced & Expert Users

Unix charset

Hi, How can I find out the charset on a Unix server (SUNOS 5.2)? I tried locale charmap and returned 646. What does 646 mean? If I send an xml file with encoding="utf-8", should the server be able to handle the file, even with special characters in it? Thanks. (0 Replies)
Discussion started by: iengca
0 Replies

Featured Tech Videos