perl sort unicode non-ascii letters

05-17-2009

Registered User

32, 0

Join Date: Apr 2009

Last Activity: 20 June 2009, 2:40 AM EDT

Posts: 32

Thanks Given: 0

Thanked 0 Times in 0 Posts

perl sort unicode non-ascii letters

In another thread (field separator in Perl) I nearly solved my sorting problem and I finally understood the Schwartzian transform especially thank to KevinADC. After that I've found out that the sorting was not done the way I need it. I did not notice it at first because I used all vowels as a test, but if I put consonants then I see the problem. In fact, the � (U0161) was sorted as expected, but not the ū (U016B), because I need this last to be put as it were a separate letter, after all the "normal' 'u'.
I've tried to change the script to this:

Code:

use strict;
use warnings;
open (_file_, "< path-to-file")  or  die "Failed to read file : $! ";
my @not_sorted = <_file_>;
sub normalize {
   my $in = $_[0];
      $in = lc($in);
      $in =~ tr<abcdefghijklmnopqrs�tuūvwxyz><\x01-\x1C>; #I've put the ū here after the 'normal' 'u'  and increased the letter number to 28 (x1c)
   return $in;
}
my @sorted = map {$_->[0]}
        sort{ normalize($a->[1]) cmp normalize($b->[1]) or $a->[1] cmp $b->[1]}
        map {chomp;[$_,split(/\&/)]} @not_sorted;
print "$_\n" for @sorted;
close (_file_);

I thought it would work, but from this file:

Code:

bbc&aaa&aaa
mmn&aaa&ccc
lmn&bbb&aaa
aaa&ccc&ddd
���&&
sss&&aaa
zzz&&
aaa&bbb&ccc
aaa&aaa&bbb
uuu&&
�as&&
sa�&&
cab&&
tuuū&&
tūmbi&&
tūūū&&
tuuu&&
tuaa&&
tuwakiyambi&&
tuttu&&

I get this result:

Code:

aaa&ccc&ddd
aaa&bbb&ccc
aaa&aaa&bbb
bbc&aaa&aaa
cab&&
lmn&bbb&aaa
mmn&aaa&ccc
sa�&&
sss&&aaa
�as&&
���&&
tūmbi&&
tūūū&&
tuaa&&
tuttu&&
tuuū&&
tuuu&&
tuwakiyambi&&
uuu&&
zzz&&

but I need this:

Code:

 aaa&ccc&ddd
aaa&bbb&ccc
aaa&aaa&bbb
bbc&aaa&aaa
cab&&
lmn&bbb&aaa
mmn&aaa&ccc
sa�&&
sss&&aaa
�as&&
���&&
tuaa&&
tuttu&&
tuuu&&
tuuū&&
tuwakiyambi&&
tūmbi&&
tūūū&&
uuu&&
zzz&&

What am I doing wrong?

ahsog

View Public Profile for ahsog

Find all posts by ahsog

05-17-2009

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

You can set collation sequences by defining a locale, then calling setlocale().
Let the underlying sort code handle the problem. You define what you want once, and it is there forever.

See man localedef.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

05-17-2009

Registered User

32, 0

Join Date: Apr 2009

Last Activity: 20 June 2009, 2:40 AM EDT

Posts: 32

Thanks Given: 0

Thanked 0 Times in 0 Posts

well, I'm not a programmer, so I don't know if I'm able to define a custom locale right now, but of course I'm willing to learn if there's no other option. But isn't the line:

Code:

$in =~ tr<abcdefghijklmnopqrs�tuūvwxyz><\x01-\x1C>

a collation sequence? If not, why? If it is, why does it not work?

ahsog

View Public Profile for ahsog

Find all posts by ahsog

05-18-2009

Registered User

1,622, 11

Join Date: Sep 2002

Last Activity: 4 May 2014, 6:22 AM EDT

Location: Hong Kong, China

Posts: 1,622

Thanks Given: 0

Thanked 11 Times in 10 Posts

This is tricky (sort of), because your program looks sane but it does not act that way. For more consistent behaviour, make sure both files are saved in some sort of Unicode, preferably UTF-8.

The following may not make sense if you do not have Perl 5.8 or later. Perl did not have really good Unicode support prior to 5.8.

Then fix your script to be correctly parsed as UTF-8. This is important because your script (not data file!) contains non-ASCII characters. If you followed my advice, your script file will have the special characters encoded in UTF-8. But Perl will not automatically parse it as UTF-8. It always treats it as ASCII unless you instruct it otherwise.

Finally, make sure the data file is interpreted as UTF-8, and the results being output in UTF-8.

Code:

use utf8;
use strict;
use warnings;
open (_file_, "<test.txt")  or  die "Failed to read file : $! ";
binmode(_file_, ':utf8');
binmode(STDOUT, ':utf8');
my @not_sorted = <_file_>;
sub normalize {
   my $in = $_[0];
      $in = lc($in);
      $in =~ tr<abcdefghijklmnopqrsštuūvwxyz><\x01-\x1C>; #I've put the ? here after the 'normal' 'u'  and increased the letter number to 28 (x1c)
   return $in;
}
my @sorted = map {$_->[0]}
        sort{ normalize($a->[1]) cmp normalize($b->[1]) or $a->[1] cmp $b->[1]}
        map {chomp;[$_,split(/\&/)]} @not_sorted;
print "$_\n" for @sorted;
close (_file_);

Then I got your expected result on my Windows machine.

cbkihong

View Public Profile for cbkihong

Find all posts by cbkihong

05-18-2009

Registered User

32, 0

Join Date: Apr 2009

Last Activity: 20 June 2009, 2:40 AM EDT

Posts: 32

Thanks Given: 0

Thanked 0 Times in 0 Posts

I have perl 5.8.8 on unix
Works great! I did not know you have to tell perl to use utf8.
Thanks a lot!

ahsog

View Public Profile for ahsog

Find all posts by ahsog

05-18-2009

Registered User

32, 0

Join Date: Apr 2009

Last Activity: 20 June 2009, 2:40 AM EDT

Posts: 32

Thanks Given: 0

Thanked 0 Times in 0 Posts

no, sorry, it is still not right, now the '�' goes to the very end of the sorting:

Code:

aaa&ccc&ddd
aaa&bbb&ccc
aaa&aaa&bbb
bbc&aaa&aaa
cab&&
lmn&bbb&aaa
mmn&aaa&ccc
sa�&&
sss&&aaa
tuaa&&
tuttu&&
tuuu&&
tuuū&&
tuwakiyambi&&
tūmbi&&
tūūū&&
uuu&&
zzz&&
�as&&
���&&

but the 'u' are fine as you see.
I used the command "file" to check the file encoding, and it reports utf-8 for both the test file and the script file.

ahsog

View Public Profile for ahsog

Find all posts by ahsog

05-19-2009

Registered User

32, 0

Join Date: Apr 2009

Last Activity: 20 June 2009, 2:40 AM EDT

Posts: 32

Thanks Given: 0

Thanked 0 Times in 0 Posts

no, sorry again, it works, for some mistyping an extra '}" was in the script, so it did not work.
So, finally it works.
Thank you again.

ahsog

View Public Profile for ahsog

Find all posts by ahsog

Shell Programming and Scripting

perl sort unicode non-ascii letters

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Convert Hex to Ascii in a Ascii file

Discussion started by: HemaV

2. UNIX for Advanced & Expert Users

Conversion from EBCDIC to Ascii OR unicode

Discussion started by: joshilalit2004

3. Shell Programming and Scripting

Help with Unicode identification using PERL or AWK

Discussion started by: gimley

4. Shell Programming and Scripting

sort -t option causing code to fail need ASCII character

Discussion started by: script_op2a

5. Shell Programming and Scripting

Perl script backspace not working for Unicode characters

Discussion started by: tdw

6. Shell Programming and Scripting

Ambiguity in unicode, Perl CGI

Discussion started by: Zaxon

7. Shell Programming and Scripting

convert ascii values into ascii characters

Discussion started by: sandeeppvk

8. Shell Programming and Scripting

sort file with non ascii chars and cjk with perl

Discussion started by: ahsog

9. UNIX for Dummies Questions & Answers

Non-ascii character detection (perl or grep)

Discussion started by: srinivasan_85

10. Programming

How to display unicode characters / unicode string

Discussion started by: jackdorso