Help with Unicode identification using PERL or AWK


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Help with Unicode identification using PERL or AWK
# 1  
Old 02-15-2012
Help with Unicode identification using PERL or AWK

Hello,
I have a large file in UTF8 format with around 200 thousand plus strings which have a large number of scripts (code-blocks/code-pages).
I need to extract from the file only the following:
All strings having basic Latin characters: 0021-007E
All strings in the Devanagari range: 0900 to 097F
Has someone written a script in PERL or AWK to handle this. I do not want to reinvent the wheel and hence the request.
Many thanks in advance. I have never tried character identification in PERL or AWK and hence the request.
# 2  
Old 02-15-2012
Well, utf8 turns on all the character's bytes high bit for all characters above 7F: http://en.wikipedia.org/wiki/UTF-8

Last edited by DGPickett; 02-15-2012 at 04:30 PM..
# 3  
Old 02-15-2012
Hello,
Many thanks for the help.
I have written a PERL script which sorts but I cannot define a range within the script. I have to literally feed in the characters:
$in =~ tr<abcdefghijklmnopqrsštuuvwxyz><\x01-\x1C>
Same for Devanagari, which is not a good idea.
Moreover Perl does not accept UTF8 chars even when I invoke use UTF8 within the perl program and that has left me stumped and hence the request for someone who can help me pipe out the two code page ranges.
# 4  
Old 02-17-2012
ord - perldoc.perl.org can take a byte or unicode wide characer into an integer, and chr - perldoc.perl.org the reverse. However, it looks like PERL will handle the grisly bit details for you if you follow the caveats (Latin for warnings): perlunicode - perldoc.perl.org Just be careful when you are reading about byte arrays and character arrays, that they are sometimes synonyms and sometimes not. Sane handling for UTF8 is to convert it to an array of 16/32 bit unsigned integer characters. (I do not know of any language that needs 64 bits, and UNICODE started out with 65K glyphs, but other Asian handlers had up to 32 bit characters! Extended Char Intro - The GNU C Library )

The wiki on Devanagari seems to be missing some glyphs! http://en.wikipedia.org/wiki/Devanagari

Last edited by DGPickett; 02-17-2012 at 12:37 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Perl script backspace not working for Unicode characters

Hello, My Perl script reads input from stdin and prints it out to stdout. After I read input I use BACKSPACE to erase characters. However BACKSPACE does not work with Unicode characters that are multi-bytes. On screen the character is erased but underneath only one byte is deleted instead of all... (3 Replies)
Discussion started by: tdw
3 Replies

2. Shell Programming and Scripting

file identification

hi there, i have written the following simple lines: find $SCENE -name "*.xml" echo -n "Input the name of the image file to be read: " set im_name = ($<) i like to set the value for im_name automatically to the .xml, which was found by the first line without having to input it. the... (4 Replies)
Discussion started by: friend
4 Replies

3. Shell Programming and Scripting

Ambiguity in unicode, Perl CGI

Hello, I was written a cgi with a textarea to save some words from web. I grab and write words like this: $cgiparams{'CONTENTS'} =~ s/\r//g; #$cgiparams{'CONTENTS'} =~ s/á/&aacute;/g; open(TM, ">$editedfilename"); #binmode(TM,... (1 Reply)
Discussion started by: Zaxon
1 Replies

4. Shell Programming and Scripting

perl sort unicode non-ascii letters

In another thread (field separator in Perl) I nearly solved my sorting problem and I finally understood the Schwartzian transform especially thank to KevinADC. After that I've found out that the sorting was not done the way I need it. I did not notice it at first because I used all vowels as a... (6 Replies)
Discussion started by: ahsog
6 Replies

5. UNIX for Dummies Questions & Answers

file identification

Can anybody tell me what are these files are and what do they do and if they are safe to delete. Thanks /var/cache/yum/base # ls -al total 44792 drwxr-xr-x 4 root root 4096 Sep 22 11:43 . drwxr-xr-x 10 root root 4096 Nov 18 2007 .. -rw-r--r-- 1 root root 0 Sep 22... (5 Replies)
Discussion started by: mcraul
5 Replies

6. UNIX for Dummies Questions & Answers

ip identification

how can i find my own ip address from unix. command like who -x .this would provide all the ip address but i need to list only current user ip address. who am i command does not display the ip. (1 Reply)
Discussion started by: naushad
1 Replies

7. Shell Programming and Scripting

version identification

Hi Which command do i use to know which version of solaris am i working on?? thanks in advance regards (1 Reply)
Discussion started by: knopix
1 Replies

8. Shell Programming and Scripting

Need Help in Users Identification ( TRU64 )

I'm looking for a script that allows me to export to CSV, the information I need. Somehow, I must gather the User ID, the User Login, the Last User Login, the Password complexity, the Password Age, The Expiration Date, . . . My experience is equal to very, very few. The only thing I have is... (2 Replies)
Discussion started by: catfish
2 Replies

9. Programming

How to display unicode characters / unicode string

I have a stream of characters like "\u8BBE\u5907\u7BA1" and i want to display it. I tried following things already without any luck. 1) printf("%s",L("\u8BBE\u5907\u7BA1")); 2) printf("%lc",0x8BBE); 3) setlocale followed by fwide followed by wprintf 4) also changed the local manually... (3 Replies)
Discussion started by: jackdorso
3 Replies

10. Solaris

file identification

Can anyone identify what this file is for? 241436 Dec 17 16:29 dtdbcache_:0 Is it necessary? My system is at 94% and I am trying to clean / directory as much as possible. Any other files I can set to dev/null besides messages, and the wtmp and wtmpx? Please and Thanks. (3 Replies)
Discussion started by: mnsalazar
3 Replies
Login or Register to Ask a Question