Sponsored Content
Top Forums Shell Programming and Scripting perl sort unicode non-ascii letters Post 302316917 by ahsog on Sunday 17th of May 2009 10:07:37 AM
Old 05-17-2009
perl sort unicode non-ascii letters

In another thread (field separator in Perl) I nearly solved my sorting problem and I finally understood the Schwartzian transform especially thank to KevinADC. After that I've found out that the sorting was not done the way I need it. I did not notice it at first because I used all vowels as a test, but if I put consonants then I see the problem. In fact, the š (U0161) was sorted as expected, but not the ū (U016B), because I need this last to be put as it were a separate letter, after all the "normal' 'u'.
I've tried to change the script to this:
Code:
use strict;
use warnings;
open (_file_, "< path-to-file")  or  die "Failed to read file : $! ";
my @not_sorted = <_file_>;
sub normalize {
   my $in = $_[0];
      $in = lc($in);
      $in =~ tr<abcdefghijklmnopqrsštuūvwxyz><\x01-\x1C>; #I've put the ū here after the 'normal' 'u'  and increased the letter number to 28 (x1c)
   return $in;
}
my @sorted = map {$_->[0]}
        sort{ normalize($a->[1]) cmp normalize($b->[1]) or $a->[1] cmp $b->[1]}
        map {chomp;[$_,split(/\&/)]} @not_sorted;
print "$_\n" for @sorted;
close (_file_);

I thought it would work, but from this file:
Code:
bbc&aaa&aaa
mmn&aaa&ccc
lmn&bbb&aaa
aaa&ccc&ddd
ššš&&
sss&&aaa
zzz&&
aaa&bbb&ccc
aaa&aaa&bbb
uuu&&
šas&&
saš&&
cab&&
tuuū&&
tūmbi&&
tūūū&&
tuuu&&
tuaa&&
tuwakiyambi&&
tuttu&&

I get this result:
Code:
aaa&ccc&ddd
aaa&bbb&ccc
aaa&aaa&bbb
bbc&aaa&aaa
cab&&
lmn&bbb&aaa
mmn&aaa&ccc
saš&&
sss&&aaa
šas&&
ššš&&
tūmbi&&
tūūū&&
tuaa&&
tuttu&&
tuuū&&
tuuu&&
tuwakiyambi&&
uuu&&
zzz&&

but I need this:
Code:
 aaa&ccc&ddd
aaa&bbb&ccc
aaa&aaa&bbb
bbc&aaa&aaa
cab&&
lmn&bbb&aaa
mmn&aaa&ccc
saš&&
sss&&aaa
šas&&
ššš&&
tuaa&&
tuttu&&
tuuu&&
tuuū&&
tuwakiyambi&&
tūmbi&&
tūūū&&
uuu&&
zzz&&

What am I doing wrong?
 

10 More Discussions You Might Find Interesting

1. Programming

How to display unicode characters / unicode string

I have a stream of characters like "\u8BBE\u5907\u7BA1" and i want to display it. I tried following things already without any luck. 1) printf("%s",L("\u8BBE\u5907\u7BA1")); 2) printf("%lc",0x8BBE); 3) setlocale followed by fwide followed by wprintf 4) also changed the local manually... (3 Replies)
Discussion started by: jackdorso
3 Replies

2. UNIX for Dummies Questions & Answers

Non-ascii character detection (perl or grep)

Hi, Can I know how to grep for lines with non-ascii characters in a file? If not grep, at least can we do it with command-line perl or awk? I tried the functionality of perl, but still could not get the result. Any help?? PS: I was sure that someone should have asked this question... (9 Replies)
Discussion started by: srinivasan_85
9 Replies

3. Shell Programming and Scripting

sort file with non ascii chars and cjk with perl

Hello, I am not a programmer, please be patient. Actually, I have started to look into Perl because it seems to be able to solve all the problems (or most of them) I happen meet using my computer. These problems are generally all text-manipulation-related. Although I started to study, I cannot... (6 Replies)
Discussion started by: ahsog
6 Replies

4. Shell Programming and Scripting

convert ascii values into ascii characters

Hi gurus, I have a file in unix with ascii values. I need to convert all the ascii values in the file to ascii characters. File contains nearly 20000 records with ascii values. (10 Replies)
Discussion started by: sandeeppvk
10 Replies

5. Shell Programming and Scripting

Ambiguity in unicode, Perl CGI

Hello, I was written a cgi with a textarea to save some words from web. I grab and write words like this: $cgiparams{'CONTENTS'} =~ s/\r//g; #$cgiparams{'CONTENTS'} =~ s/á/&aacute;/g; open(TM, ">$editedfilename"); #binmode(TM,... (1 Reply)
Discussion started by: Zaxon
1 Replies

6. Shell Programming and Scripting

Perl script backspace not working for Unicode characters

Hello, My Perl script reads input from stdin and prints it out to stdout. After I read input I use BACKSPACE to erase characters. However BACKSPACE does not work with Unicode characters that are multi-bytes. On screen the character is erased but underneath only one byte is deleted instead of all... (3 Replies)
Discussion started by: tdw
3 Replies

7. Shell Programming and Scripting

sort -t option causing code to fail need ASCII character

Hello, When I run this UNIX code without the -t option it gives me the desired results. The code keeps the record with the greatest datetime based on the key columns. I sort it first then sort it again with the -u option, that's it. I need to have a variable to specify an ASCII character... (2 Replies)
Discussion started by: script_op2a
2 Replies

8. Shell Programming and Scripting

Help with Unicode identification using PERL or AWK

Hello, I have a large file in UTF8 format with around 200 thousand plus strings which have a large number of scripts (code-blocks/code-pages). I need to extract from the file only the following: All strings having basic Latin characters: 0021-007E All strings in the Devanagari range: 0900 to... (3 Replies)
Discussion started by: gimley
3 Replies

9. UNIX for Advanced & Expert Users

Conversion from EBCDIC to Ascii OR unicode

I have a file in my Unix ( SOLARIS ) with EBCDIC format...I want this file to read in ASCII OR unicode...Is it possible with UNIX to convert this file on ASCII OR UNICODE format from EBCDIC format? I was searching through web and found only conversion table :( Request Rejected Below is... (16 Replies)
Discussion started by: joshilalit2004
16 Replies

10. Shell Programming and Scripting

Convert Hex to Ascii in a Ascii file

Hi All, I have an ascii file in which few columns are having hex values which i need to convert into ascii. Kindly suggest me what command can be used in unix shell scripting? Thanks in Advance (2 Replies)
Discussion started by: HemaV
2 Replies
locale(3pm)						 Perl Programmers Reference Guide					       locale(3pm)

NAME
locale - Perl pragma to use or avoid POSIX locales for built-in operations SYNOPSIS
@x = sort @y; # Unicode sorting order { use locale; @x = sort @y; # Locale-defined sorting order } @x = sort @y; # Unicode sorting order again DESCRIPTION
This pragma tells the compiler to enable (or disable) the use of POSIX locales for built-in operations (for example, LC_CTYPE for regular expressions, LC_COLLATE for string comparison, and LC_NUMERIC for number formatting). Each "use locale" or "no locale" affects statements to the end of the enclosing BLOCK. Starting in Perl 5.16, a hybrid mode for this pragma is available, use locale ':not_characters'; which enables only the portions of locales that don't affect the character set (that is, all except LC_COLLATE and LC_CTYPE). This is useful when mixing Unicode and locales, including UTF-8 locales. use locale ':not_characters'; use open ":locale"; # Convert I/O to/from Unicode use POSIX qw(locale_h); # Import the LC_ALL constant setlocale(LC_ALL, ""); # Required for the next statement # to take effect printf "%.2f ", 12345.67' # Locale-defined formatting @x = sort @y; # Unicode-defined sorting order. # (Note that you will get better # results using Unicode::Collate.) See perllocale for more detailed information on how Perl supports locales. perl v5.16.3 2013-03-04 locale(3pm)
All times are GMT -4. The time now is 06:39 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy