perl sort unicode non-ascii letters


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting perl sort unicode non-ascii letters
# 1  
Old 05-17-2009
perl sort unicode non-ascii letters

In another thread (field separator in Perl) I nearly solved my sorting problem and I finally understood the Schwartzian transform especially thank to KevinADC. After that I've found out that the sorting was not done the way I need it. I did not notice it at first because I used all vowels as a test, but if I put consonants then I see the problem. In fact, the š (U0161) was sorted as expected, but not the ū (U016B), because I need this last to be put as it were a separate letter, after all the "normal' 'u'.
I've tried to change the script to this:
Code:
use strict;
use warnings;
open (_file_, "< path-to-file")  or  die "Failed to read file : $! ";
my @not_sorted = <_file_>;
sub normalize {
   my $in = $_[0];
      $in = lc($in);
      $in =~ tr<abcdefghijklmnopqrsštuūvwxyz><\x01-\x1C>; #I've put the ū here after the 'normal' 'u'  and increased the letter number to 28 (x1c)
   return $in;
}
my @sorted = map {$_->[0]}
        sort{ normalize($a->[1]) cmp normalize($b->[1]) or $a->[1] cmp $b->[1]}
        map {chomp;[$_,split(/\&/)]} @not_sorted;
print "$_\n" for @sorted;
close (_file_);

I thought it would work, but from this file:
Code:
bbc&aaa&aaa
mmn&aaa&ccc
lmn&bbb&aaa
aaa&ccc&ddd
ššš&&
sss&&aaa
zzz&&
aaa&bbb&ccc
aaa&aaa&bbb
uuu&&
šas&&
saš&&
cab&&
tuuū&&
tūmbi&&
tūūū&&
tuuu&&
tuaa&&
tuwakiyambi&&
tuttu&&

I get this result:
Code:
aaa&ccc&ddd
aaa&bbb&ccc
aaa&aaa&bbb
bbc&aaa&aaa
cab&&
lmn&bbb&aaa
mmn&aaa&ccc
saš&&
sss&&aaa
šas&&
ššš&&
tūmbi&&
tūūū&&
tuaa&&
tuttu&&
tuuū&&
tuuu&&
tuwakiyambi&&
uuu&&
zzz&&

but I need this:
Code:
 aaa&ccc&ddd
aaa&bbb&ccc
aaa&aaa&bbb
bbc&aaa&aaa
cab&&
lmn&bbb&aaa
mmn&aaa&ccc
saš&&
sss&&aaa
šas&&
ššš&&
tuaa&&
tuttu&&
tuuu&&
tuuū&&
tuwakiyambi&&
tūmbi&&
tūūū&&
uuu&&
zzz&&

What am I doing wrong?
# 2  
Old 05-17-2009
You can set collation sequences by defining a locale, then calling setlocale().
Let the underlying sort code handle the problem. You define what you want once, and it is there forever.

See man localedef.
# 3  
Old 05-17-2009
well, I'm not a programmer, so I don't know if I'm able to define a custom locale right now, but of course I'm willing to learn if there's no other option. But isn't the line:
Code:
$in =~ tr<abcdefghijklmnopqrsštuūvwxyz><\x01-\x1C>

a collation sequence? If not, why? If it is, why does it not work?
# 4  
Old 05-18-2009
This is tricky (sort of), because your program looks sane but it does not act that way. For more consistent behaviour, make sure both files are saved in some sort of Unicode, preferably UTF-8.

The following may not make sense if you do not have Perl 5.8 or later. Perl did not have really good Unicode support prior to 5.8.

Then fix your script to be correctly parsed as UTF-8. This is important because your script (not data file!) contains non-ASCII characters. If you followed my advice, your script file will have the special characters encoded in UTF-8. But Perl will not automatically parse it as UTF-8. It always treats it as ASCII unless you instruct it otherwise.

Finally, make sure the data file is interpreted as UTF-8, and the results being output in UTF-8.

Code:
use utf8;
use strict;
use warnings;
open (_file_, "<test.txt")  or  die "Failed to read file : $! ";
binmode(_file_, ':utf8');
binmode(STDOUT, ':utf8');
my @not_sorted = <_file_>;
sub normalize {
   my $in = $_[0];
      $in = lc($in);
      $in =~ tr<abcdefghijklmnopqrsštuūvwxyz><\x01-\x1C>; #I've put the ? here after the 'normal' 'u'  and increased the letter number to 28 (x1c)
   return $in;
}
my @sorted = map {$_->[0]}
        sort{ normalize($a->[1]) cmp normalize($b->[1]) or $a->[1] cmp $b->[1]}
        map {chomp;[$_,split(/\&/)]} @not_sorted;
print "$_\n" for @sorted;
close (_file_);

Then I got your expected result on my Windows machine.
# 5  
Old 05-18-2009
I have perl 5.8.8 on unix
Works great! I did not know you have to tell perl to use utf8.
Thanks a lot!
# 6  
Old 05-18-2009
no, sorry, it is still not right, now the 'š' goes to the very end of the sorting:
Code:
aaa&ccc&ddd
aaa&bbb&ccc
aaa&aaa&bbb
bbc&aaa&aaa
cab&&
lmn&bbb&aaa
mmn&aaa&ccc
saš&&
sss&&aaa
tuaa&&
tuttu&&
tuuu&&
tuuū&&
tuwakiyambi&&
tūmbi&&
tūūū&&
uuu&&
zzz&&
šas&&
ššš&&

but the 'u' are fine as you see.
I used the command "file" to check the file encoding, and it reports utf-8 for both the test file and the script file.
Smilie
# 7  
Old 05-19-2009
no, sorry again, it works, for some mistyping an extra '}" was in the script, so it did not work.
So, finally it works.
Thank you again.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Convert Hex to Ascii in a Ascii file

Hi All, I have an ascii file in which few columns are having hex values which i need to convert into ascii. Kindly suggest me what command can be used in unix shell scripting? Thanks in Advance (2 Replies)
Discussion started by: HemaV
2 Replies

2. UNIX for Advanced & Expert Users

Conversion from EBCDIC to Ascii OR unicode

I have a file in my Unix ( SOLARIS ) with EBCDIC format...I want this file to read in ASCII OR unicode...Is it possible with UNIX to convert this file on ASCII OR UNICODE format from EBCDIC format? I was searching through web and found only conversion table :( Request Rejected Below is... (16 Replies)
Discussion started by: joshilalit2004
16 Replies

3. Shell Programming and Scripting

Help with Unicode identification using PERL or AWK

Hello, I have a large file in UTF8 format with around 200 thousand plus strings which have a large number of scripts (code-blocks/code-pages). I need to extract from the file only the following: All strings having basic Latin characters: 0021-007E All strings in the Devanagari range: 0900 to... (3 Replies)
Discussion started by: gimley
3 Replies

4. Shell Programming and Scripting

sort -t option causing code to fail need ASCII character

Hello, When I run this UNIX code without the -t option it gives me the desired results. The code keeps the record with the greatest datetime based on the key columns. I sort it first then sort it again with the -u option, that's it. I need to have a variable to specify an ASCII character... (2 Replies)
Discussion started by: script_op2a
2 Replies

5. Shell Programming and Scripting

Perl script backspace not working for Unicode characters

Hello, My Perl script reads input from stdin and prints it out to stdout. After I read input I use BACKSPACE to erase characters. However BACKSPACE does not work with Unicode characters that are multi-bytes. On screen the character is erased but underneath only one byte is deleted instead of all... (3 Replies)
Discussion started by: tdw
3 Replies

6. Shell Programming and Scripting

Ambiguity in unicode, Perl CGI

Hello, I was written a cgi with a textarea to save some words from web. I grab and write words like this: $cgiparams{'CONTENTS'} =~ s/\r//g; #$cgiparams{'CONTENTS'} =~ s/á/&aacute;/g; open(TM, ">$editedfilename"); #binmode(TM,... (1 Reply)
Discussion started by: Zaxon
1 Replies

7. Shell Programming and Scripting

convert ascii values into ascii characters

Hi gurus, I have a file in unix with ascii values. I need to convert all the ascii values in the file to ascii characters. File contains nearly 20000 records with ascii values. (10 Replies)
Discussion started by: sandeeppvk
10 Replies

8. Shell Programming and Scripting

sort file with non ascii chars and cjk with perl

Hello, I am not a programmer, please be patient. Actually, I have started to look into Perl because it seems to be able to solve all the problems (or most of them) I happen meet using my computer. These problems are generally all text-manipulation-related. Although I started to study, I cannot... (6 Replies)
Discussion started by: ahsog
6 Replies

9. UNIX for Dummies Questions & Answers

Non-ascii character detection (perl or grep)

Hi, Can I know how to grep for lines with non-ascii characters in a file? If not grep, at least can we do it with command-line perl or awk? I tried the functionality of perl, but still could not get the result. Any help?? PS: I was sure that someone should have asked this question... (9 Replies)
Discussion started by: srinivasan_85
9 Replies

10. Programming

How to display unicode characters / unicode string

I have a stream of characters like "\u8BBE\u5907\u7BA1" and i want to display it. I tried following things already without any luck. 1) printf("%s",L("\u8BBE\u5907\u7BA1")); 2) printf("%lc",0x8BBE); 3) setlocale followed by fwide followed by wprintf 4) also changed the local manually... (3 Replies)
Discussion started by: jackdorso
3 Replies
Login or Register to Ask a Question