perl sort unicode non-ascii letters Post: 302316917

Sponsored Content

Top Forums Shell Programming and Scripting perl sort unicode non-ascii letters Post 302316917 by ahsog on Sunday 17th of May 2009 10:07:37 AM

05-17-2009

Registered User

perl sort unicode non-ascii letters

In another thread (field separator in Perl) I nearly solved my sorting problem and I finally understood the Schwartzian transform especially thank to KevinADC. After that I've found out that the sorting was not done the way I need it. I did not notice it at first because I used all vowels as a test, but if I put consonants then I see the problem. In fact, the � (U0161) was sorted as expected, but not the ū (U016B), because I need this last to be put as it were a separate letter, after all the "normal' 'u'.
I've tried to change the script to this:

Code:

use strict;
use warnings;
open (_file_, "< path-to-file")  or  die "Failed to read file : $! ";
my @not_sorted = <_file_>;
sub normalize {
   my $in = $_[0];
      $in = lc($in);
      $in =~ tr<abcdefghijklmnopqrs�tuūvwxyz><\x01-\x1C>; #I've put the ū here after the 'normal' 'u'  and increased the letter number to 28 (x1c)
   return $in;
}
my @sorted = map {$_->[0]}
        sort{ normalize($a->[1]) cmp normalize($b->[1]) or $a->[1] cmp $b->[1]}
        map {chomp;[$_,split(/\&/)]} @not_sorted;
print "$_\n" for @sorted;
close (_file_);

I thought it would work, but from this file:

Code:

bbc&aaa&aaa
mmn&aaa&ccc
lmn&bbb&aaa
aaa&ccc&ddd
���&&
sss&&aaa
zzz&&
aaa&bbb&ccc
aaa&aaa&bbb
uuu&&
�as&&
sa�&&
cab&&
tuuū&&
tūmbi&&
tūūū&&
tuuu&&
tuaa&&
tuwakiyambi&&
tuttu&&

I get this result:

Code:

aaa&ccc&ddd
aaa&bbb&ccc
aaa&aaa&bbb
bbc&aaa&aaa
cab&&
lmn&bbb&aaa
mmn&aaa&ccc
sa�&&
sss&&aaa
�as&&
���&&
tūmbi&&
tūūū&&
tuaa&&
tuttu&&
tuuū&&
tuuu&&
tuwakiyambi&&
uuu&&
zzz&&

but I need this:

Code:

 aaa&ccc&ddd
aaa&bbb&ccc
aaa&aaa&bbb
bbc&aaa&aaa
cab&&
lmn&bbb&aaa
mmn&aaa&ccc
sa�&&
sss&&aaa
�as&&
���&&
tuaa&&
tuttu&&
tuuu&&
tuuū&&
tuwakiyambi&&
tūmbi&&
tūūū&&
uuu&&
zzz&&

What am I doing wrong?

ahsog

View Public Profile for ahsog

Find all posts by ahsog

10 More Discussions You Might Find Interesting

1. Programming

How to display unicode characters / unicode string

I have a stream of characters like "\u8BBE\u5907\u7BA1" and i want to display it. I tried following things already without any luck. 1) printf("%s",L("\u8BBE\u5907\u7BA1")); 2) printf("%lc",0x8BBE); 3) setlocale followed by fwide followed by wprintf 4) also changed the local manually...

2. UNIX for Dummies Questions & Answers

Non-ascii character detection (perl or grep)

Hi, Can I know how to grep for lines with non-ascii characters in a file? If not grep, at least can we do it with command-line perl or awk? I tried the functionality of perl, but still could not get the result. Any help?? PS: I was sure that someone should have asked this question...

3. Shell Programming and Scripting

sort file with non ascii chars and cjk with perl

Hello, I am not a programmer, please be patient. Actually, I have started to look into Perl because it seems to be able to solve all the problems (or most of them) I happen meet using my computer. These problems are generally all text-manipulation-related. Although I started to study, I cannot...

4. Shell Programming and Scripting

convert ascii values into ascii characters

Hi gurus, I have a file in unix with ascii values. I need to convert all the ascii values in the file to ascii characters. File contains nearly 20000 records with ascii values.

5. Shell Programming and Scripting

Ambiguity in unicode, Perl CGI

Hello, I was written a cgi with a textarea to save some words from web. I grab and write words like this: $cgiparams{'CONTENTS'} =~ s/\r//g; #$cgiparams{'CONTENTS'} =~ s/�/á/g; open(TM, ">$editedfilename"); #binmode(TM,...

6. Shell Programming and Scripting

Perl script backspace not working for Unicode characters

Hello, My Perl script reads input from stdin and prints it out to stdout. After I read input I use BACKSPACE to erase characters. However BACKSPACE does not work with Unicode characters that are multi-bytes. On screen the character is erased but underneath only one byte is deleted instead of all...

7. Shell Programming and Scripting

sort -t option causing code to fail need ASCII character

Hello, When I run this UNIX code without the -t option it gives me the desired results. The code keeps the record with the greatest datetime based on the key columns. I sort it first then sort it again with the -u option, that's it. I need to have a variable to specify an ASCII character...

8. Shell Programming and Scripting

Help with Unicode identification using PERL or AWK

Hello, I have a large file in UTF8 format with around 200 thousand plus strings which have a large number of scripts (code-blocks/code-pages). I need to extract from the file only the following: All strings having basic Latin characters: 0021-007E All strings in the Devanagari range: 0900 to...

9. UNIX for Advanced & Expert Users

Conversion from EBCDIC to Ascii OR unicode

I have a file in my Unix ( SOLARIS ) with EBCDIC format...I want this file to read in ASCII OR unicode...Is it possible with UNIX to convert this file on ASCII OR UNICODE format from EBCDIC format? I was searching through web and found only conversion table :( Request Rejected Below is...

10. Shell Programming and Scripting

Convert Hex to Ascii in a Ascii file

Hi All, I have an ascii file in which few columns are having hex values which i need to convert into ascii. Kindly suggest me what command can be used in unix shell scripting? Thanks in Advance

LEARN ABOUT CENTOS

locale

locale(3pm)						 Perl Programmers Reference Guide					       locale(3pm)

NAME

       locale - Perl pragma to use or avoid POSIX locales for built-in operations

SYNOPSIS

	   @x = sort @y;       # Unicode sorting order
	   {
	       use locale;
	       @x = sort @y;   # Locale-defined sorting order
	   }
	   @x = sort @y;       # Unicode sorting order again

DESCRIPTION

       This pragma tells the compiler to enable (or disable) the use of POSIX locales for built-in operations (for example, LC_CTYPE for regular
       expressions, LC_COLLATE for string comparison, and LC_NUMERIC for number formatting).  Each "use locale" or "no locale" affects statements
       to the end of the enclosing BLOCK.

       Starting in Perl 5.16, a hybrid mode for this pragma is available,

	   use locale ':not_characters';

       which enables only the portions of locales that don't affect the character set (that is, all except LC_COLLATE and LC_CTYPE).  This is
       useful when mixing Unicode and locales, including UTF-8 locales.

	   use locale ':not_characters';
	   use open ":locale";		 # Convert I/O to/from Unicode
	   use POSIX qw(locale_h);	 # Import the LC_ALL constant
	   setlocale(LC_ALL, "");	 # Required for the next statement
					 # to take effect
	   printf "%.2f
", 12345.67'	 # Locale-defined formatting
	   @x = sort @y;		 # Unicode-defined sorting order.
					 # (Note that you will get better
					 # results using Unicode::Collate.)

       See perllocale for more detailed information on how Perl supports locales.

perl v5.16.3							    2013-03-04							       locale(3pm)

10 More Discussions You Might Find Interesting

1. Programming

How to display unicode characters / unicode string

Discussion started by: jackdorso

2. UNIX for Dummies Questions & Answers

Non-ascii character detection (perl or grep)

Discussion started by: srinivasan_85

3. Shell Programming and Scripting

sort file with non ascii chars and cjk with perl

Discussion started by: ahsog

4. Shell Programming and Scripting

convert ascii values into ascii characters

Discussion started by: sandeeppvk

5. Shell Programming and Scripting

Ambiguity in unicode, Perl CGI

Discussion started by: Zaxon

6. Shell Programming and Scripting

Perl script backspace not working for Unicode characters

Discussion started by: tdw

7. Shell Programming and Scripting

sort -t option causing code to fail need ASCII character

Discussion started by: script_op2a

8. Shell Programming and Scripting

Help with Unicode identification using PERL or AWK

Discussion started by: gimley

9. UNIX for Advanced & Expert Users

Conversion from EBCDIC to Ascii OR unicode

Discussion started by: joshilalit2004

10. Shell Programming and Scripting

Convert Hex to Ascii in a Ascii file

Discussion started by: HemaV

LEARN ABOUT CENTOS

locale