Sponsored Content
Top Forums Shell Programming and Scripting Problem identifying charset of a file Post 302301835 by sridhar_423 on Saturday 28th of March 2009 04:00:49 PM
Old 03-28-2009
I guess I found out what I was looking for after a series of tests
file -- This may not give correct output. In the above post, chars.txt gave utf-8 because chars.txt is saved to disk using utf-8 and utf-8 reserves first 3 bytes of the file to represent that its a unicode file which is encoded using utf-8

In my case, the file was generated using cp1256. So, if the first 512 bytes are ascii characters(I guess file checks for first 512 bytes.. i'm not 100% sure though. I simply added 1000 english characters to the beginning of the file), then it would display the file as ascii as the code points of cp1256 is same as ascii for <=127

Coming to the numbers in the file when opened using vi editor, they are the octals(base 8) of the code points. I performed the below test to confirm it
1. opened the file using vi and copied some of those numbers
2. Wrote a php program to convert the octals into decimal and print the corresponding character
As my computer uses 1256cp for representing the characters which fall outside of ascii range, it displayed arabic data. So, these numbers are nothing but the code points.

Thanks,
Sridhar
 

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Unix charset

Hi, How can I find out the charset on a Unix server (SUNOS 5.2)? I tried locale charmap and returned 646. What does 646 mean? If I send an xml file with encoding="utf-8", should the server be able to handle the file, even with special characters in it? Thanks. (0 Replies)
Discussion started by: iengca
0 Replies

2. Shell Programming and Scripting

identifying null values in a file

I have a huge file with 20 fileds in each record and each field is seperated by "|". If i want to get all the reocrds that have 18th or for that matter any filed as null how can i do it? Please let me know (3 Replies)
Discussion started by: dsravan
3 Replies

3. Shell Programming and Scripting

Identifying suffixes in a file and printing them out

Hello, I am interested in finding and identifying suffixes for Indian names through an awk script or a perl program. Suffixes normally are found at the end of a word as is shown in the sample given below. What I need is a perl script which will identify suffixes of a defined lenght to be given in... (4 Replies)
Discussion started by: gimley
4 Replies

4. UNIX for Dummies Questions & Answers

locale and glibc and charset

what's the relationship among locale, glibc, charset, charmap and fonts? why locale needs to be generated by glibc? how? what are in the locale-archive file? and what are in font files? (0 Replies)
Discussion started by: vistastar
0 Replies

5. Shell Programming and Scripting

Identifying the file completion

Hi, A script is running for multiple databases so data is also being populated for multiple DBs in a.txt file. I need to rename this file once all the data is populated. Kindly suggest me How can I check once file is populated completely before renaming? Thanks in advance. (3 Replies)
Discussion started by: ravigupta2u
3 Replies

6. UNIX for Advanced & Expert Users

ISO 88591 file encoding charset in Linux

Hello Experts, please help to provide any insight as I am facing issue migrating java application from hpux to redhat. The java program is using InputStreamReader to read a file without specifying any charset parameter. However, in new Linux Redhat 5.6 environent, when reading a file that... (1 Reply)
Discussion started by: sonic_air
1 Replies

7. Shell Programming and Scripting

Identifying presence and name of new file(s)?

I have an HP-UX server that runs a script each night. The script connects to an SFTP server and downloads all xml files (if any are present) from a certain folder, and then deletes the files from the SFTP server. So sometimes it will download a new file, sometimes it will download 2 or 3 new... (4 Replies)
Discussion started by: lupin..the..3rd
4 Replies

8. Shell Programming and Scripting

Identifying Missing File Sequence

Hi, I have a file which contains few columns and the first column has the file names, and I would like to identify the missing file sequence number form the file and would copy to another file. My files has data in below format. APKRISPSIN320131231201319_0983,1,54,125,... (5 Replies)
Discussion started by: rramkrishnas
5 Replies

9. Red Hat

How to load a charset on RHEL 6.6 ?

Hi all, am running the following code on a RHEL 6.6 box to list which charsets are loaded and which are available: #!/usr/bin/perl -w use strict; use Encode; my @list = Encode->encodings(); my @all_encodings = Encode->encodings(":all"); print "@list\n\n"; print "@all_encodings\n"; ... (3 Replies)
Discussion started by: Fundix
3 Replies

10. Shell Programming and Scripting

Identifying missing file dates

Hi Experts, I have written the below script to check the missing files based on the date in the file name from current date to in a given interval of days. In the file names we have dates along with some name. ex:jera_sit_2017-04-25-150325.txt. The below script is working fine if we have only... (10 Replies)
Discussion started by: nalu
10 Replies
Encode::Detect::Detector(3)				User Contributed Perl Documentation			       Encode::Detect::Detector(3)

NAME
Encode::Detect::Detector - Detects the encoding of data SYNOPSIS
use Encode::Detect::Detector; my $charset = detect($octets); my $d = new Encode::Detect::Detector; $d->handle($octets); $d->handle($more_octets); $d->end; my $charset = $d->getresult; DESCRIPTION
This module provides an interface to Mozilla's universal charset detector, which detects the charset used to encode data. METHODS
$charset = Encode::Detect::Detector->detect($octets) Detect the charset used to encode the data in $octets and return the charset's name. Returns undef if the charset cannot be determined with sufficient confidence. $d = Encode::Detect::Detector->new() Creates a new "Encode::Detect::Detector" object and returns it. $d->handle($octets) Provides an additional chunk of data to be examined by the detector. May be called multiple times. Returns zero on success, nonzero if a memory allocation failed. $d->eof Informs the detector that there is no more data to be examined. In many cases, this is necessary in order for the detector to make a decision on the charset. $d->reset Resets the detector to its initial state. $d->getresult Returns the name of the detected charset or "undef" if no charset has (yet) been decided upon. May be called at any time. SEE ALSO
Encode::Detect AUTHOR
John Gardiner Myers <jgmyers@proofpoint.com> SUPPORT
For help and thank you notes, e-mail the author directly. To report a bug, submit a patch, or add to the wishlist please visit the CPAN bug manager at: http://rt.cpan.org perl v5.18.2 2017-10-06 Encode::Detect::Detector(3)
All times are GMT -4. The time now is 11:14 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy