perl sort unicode non-ascii letters Post: 302317239

Sponsored Content

Top Forums Shell Programming and Scripting perl sort unicode non-ascii letters Post 302317239 by cbkihong on Monday 18th of May 2009 11:07:54 AM

05-18-2009

Registered User

This is tricky (sort of), because your program looks sane but it does not act that way. For more consistent behaviour, make sure both files are saved in some sort of Unicode, preferably UTF-8.

The following may not make sense if you do not have Perl 5.8 or later. Perl did not have really good Unicode support prior to 5.8.

Then fix your script to be correctly parsed as UTF-8. This is important because your script (not data file!) contains non-ASCII characters. If you followed my advice, your script file will have the special characters encoded in UTF-8. But Perl will not automatically parse it as UTF-8. It always treats it as ASCII unless you instruct it otherwise.

Finally, make sure the data file is interpreted as UTF-8, and the results being output in UTF-8.

Code:

use utf8;
use strict;
use warnings;
open (_file_, "<test.txt")  or  die "Failed to read file : $! ";
binmode(_file_, ':utf8');
binmode(STDOUT, ':utf8');
my @not_sorted = <_file_>;
sub normalize {
   my $in = $_[0];
      $in = lc($in);
      $in =~ tr<abcdefghijklmnopqrsštuūvwxyz><\x01-\x1C>; #I've put the ? here after the 'normal' 'u'  and increased the letter number to 28 (x1c)
   return $in;
}
my @sorted = map {$_->[0]}
        sort{ normalize($a->[1]) cmp normalize($b->[1]) or $a->[1] cmp $b->[1]}
        map {chomp;[$_,split(/\&/)]} @not_sorted;
print "$_\n" for @sorted;
close (_file_);

Then I got your expected result on my Windows machine.

cbkihong

View Public Profile for cbkihong

Find all posts by cbkihong

10 More Discussions You Might Find Interesting

1. Programming

How to display unicode characters / unicode string

I have a stream of characters like "\u8BBE\u5907\u7BA1" and i want to display it. I tried following things already without any luck. 1) printf("%s",L("\u8BBE\u5907\u7BA1")); 2) printf("%lc",0x8BBE); 3) setlocale followed by fwide followed by wprintf 4) also changed the local manually...

2. UNIX for Dummies Questions & Answers

Non-ascii character detection (perl or grep)

Hi, Can I know how to grep for lines with non-ascii characters in a file? If not grep, at least can we do it with command-line perl or awk? I tried the functionality of perl, but still could not get the result. Any help?? PS: I was sure that someone should have asked this question...

3. Shell Programming and Scripting

sort file with non ascii chars and cjk with perl

Hello, I am not a programmer, please be patient. Actually, I have started to look into Perl because it seems to be able to solve all the problems (or most of them) I happen meet using my computer. These problems are generally all text-manipulation-related. Although I started to study, I cannot...

4. Shell Programming and Scripting

convert ascii values into ascii characters

Hi gurus, I have a file in unix with ascii values. I need to convert all the ascii values in the file to ascii characters. File contains nearly 20000 records with ascii values.

5. Shell Programming and Scripting

Ambiguity in unicode, Perl CGI

Hello, I was written a cgi with a textarea to save some words from web. I grab and write words like this: $cgiparams{'CONTENTS'} =~ s/\r//g; #$cgiparams{'CONTENTS'} =~ s/�/á/g; open(TM, ">$editedfilename"); #binmode(TM,...

6. Shell Programming and Scripting

Perl script backspace not working for Unicode characters

Hello, My Perl script reads input from stdin and prints it out to stdout. After I read input I use BACKSPACE to erase characters. However BACKSPACE does not work with Unicode characters that are multi-bytes. On screen the character is erased but underneath only one byte is deleted instead of all...

7. Shell Programming and Scripting

sort -t option causing code to fail need ASCII character

Hello, When I run this UNIX code without the -t option it gives me the desired results. The code keeps the record with the greatest datetime based on the key columns. I sort it first then sort it again with the -u option, that's it. I need to have a variable to specify an ASCII character...

8. Shell Programming and Scripting

Help with Unicode identification using PERL or AWK

Hello, I have a large file in UTF8 format with around 200 thousand plus strings which have a large number of scripts (code-blocks/code-pages). I need to extract from the file only the following: All strings having basic Latin characters: 0021-007E All strings in the Devanagari range: 0900 to...

9. UNIX for Advanced & Expert Users

Conversion from EBCDIC to Ascii OR unicode

I have a file in my Unix ( SOLARIS ) with EBCDIC format...I want this file to read in ASCII OR unicode...Is it possible with UNIX to convert this file on ASCII OR UNICODE format from EBCDIC format? I was searching through web and found only conversion table :( Request Rejected Below is...

10. Shell Programming and Scripting

Convert Hex to Ascii in a Ascii file

Hi All, I have an ascii file in which few columns are having hex values which i need to convert into ascii. Kindly suggest me what command can be used in unix shell scripting? Thanks in Advance

LEARN ABOUT MOJAVE

locale

locale(3pm)						 Perl Programmers Reference Guide					       locale(3pm)

NAME

       locale - Perl pragma to use or avoid POSIX locales for built-in operations

SYNOPSIS

	   @x = sort @y;       # Unicode sorting order
	   {
	       use locale;
	       @x = sort @y;   # Locale-defined sorting order
	   }
	   @x = sort @y;       # Unicode sorting order again

DESCRIPTION

       This pragma tells the compiler to enable (or disable) the use of POSIX locales for built-in operations (for example, LC_CTYPE for regular
       expressions, LC_COLLATE for string comparison, and LC_NUMERIC for number formatting).  Each "use locale" or "no locale" affects statements
       to the end of the enclosing BLOCK.

       Starting in Perl 5.16, a hybrid mode for this pragma is available,

	   use locale ':not_characters';

       which enables only the portions of locales that don't affect the character set (that is, all except LC_COLLATE and LC_CTYPE).  This is
       useful when mixing Unicode and locales, including UTF-8 locales.

	   use locale ':not_characters';
	   use open ":locale";		 # Convert I/O to/from Unicode
	   use POSIX qw(locale_h);	 # Import the LC_ALL constant
	   setlocale(LC_ALL, "");	 # Required for the next statement
					 # to take effect
	   printf "%.2f
", 12345.67'	 # Locale-defined formatting
	   @x = sort @y;		 # Unicode-defined sorting order.
					 # (Note that you will get better
					 # results using Unicode::Collate.)

       See perllocale for more detailed information on how Perl supports locales.

NOTE

       If your system does not support locales, then loading this module will cause the program to die with a message:

	   "Your vendor does not support locales, you cannot use the locale
	   module."

perl v5.18.2							    2013-11-04							       locale(3pm)

10 More Discussions You Might Find Interesting

1. Programming

How to display unicode characters / unicode string

Discussion started by: jackdorso

2. UNIX for Dummies Questions & Answers

Non-ascii character detection (perl or grep)

Discussion started by: srinivasan_85

3. Shell Programming and Scripting

sort file with non ascii chars and cjk with perl

Discussion started by: ahsog

4. Shell Programming and Scripting

convert ascii values into ascii characters

Discussion started by: sandeeppvk