[awk]Chinese words!!


 
Thread Tools Search this Thread
Top Forums Programming [awk]Chinese words!!
# 1  
Old 06-25-2015
[awk]Chinese words!!

Is there a way to extract chinese words from a text written in an European Language? I want to create a glossary and finding a way would make me save time!

Thank you!
# 2  
Old 06-25-2015
I suppose if you new what part of the code set are chinese characters you could try deleting the other characters. For starters, see what something like this brings:
Code:
tr -d '[:alnum:][:punct:]' < file

# 3  
Old 06-26-2015
The range of chineese unicode chars is 4E00 thru 9FFF (344 270 200 thru 351 277 277) so the test should be >"\343" and <"\352" (to avoid picking up any 4 char UTF-8 codes):

Code:
{
f=0;
for ( i=1; i<=length; i++)
 if(substr($0, i, 1)>"\343" &&substr($0, i, 1)<"\352")
 print $f

but there is an error or more errors.... I can't find it/them

---------- Post updated at 04:59 AM ---------- Previous update was at 04:38 AM ----------

Quote:
Originally Posted by Scrutinizer
I suppose if you new what part of the code set are chinese characters you could try deleting the other characters. For starters, see what something like this brings:
Code:
tr -d '[:alnum:][:punct:]' < file

# 4  
Old 06-26-2015
Well, I tried this on a download of google.hk, and it spat out an unreadable (to western eyes) line...:
Code:
awk '{for (i=1; i<=NF; i++) if ($i >= "\344" && $i <= "\351") {printf "%s", $i$(i+1)$(i+2); i+=2}}' FS="" file
搜尋圖片地圖新聞更多登入香港中顯示的語言為中文简体私隱權政策條款設定廣告企業關於

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Permutation Words in awk

i have 13 different words. I need to get permutations like all combinations of this words: word1 word2 word3 word4 word5 word6 word7 word8 word9 word10 word11 word12 word13 But the combinations only should be 12 words long. Is there a fast efficient way to do this? Maybe with linux tool... (1 Reply)
Discussion started by: watcherpro
1 Replies

2. Shell Programming and Scripting

Count words/lines between two tags using awk

Is there an efficient awk that can count the number of lines that occur in between two tags. For instance, consider the following text: <s> Hi PP - my VBD - name DT - is NN - . SENT . </s> <s> Her PP - name VBD - is DT - the NN - same WRT - . SENT - </s> I am interested to know... (4 Replies)
Discussion started by: owwow14
4 Replies

3. Shell Programming and Scripting

AWK count letters words

Hi All! can anyone help me with this code? I want to count words or letters in every line with if(count>20){else echo $myline} awk '/<script /{p=1} /<\/script>/{p=0; next}!p' index.html | while read myline; do echo $myline done Thank you !!! (3 Replies)
Discussion started by: sanantonio7777
3 Replies

4. UNIX for Advanced & Expert Users

Need help either with awk or sed to get text between words

Hello All, My requirement is to get test between two words START & END, something like html tags Eg. Input file: START Line1 Line2 Line3 CLOSE START Line4 Line5 Line6 END START Line7 START Line8 (7 Replies)
Discussion started by: konerusuneel
7 Replies

5. Shell Programming and Scripting

search for a pattern using awk between two words

Hi, how can we search for a pattren between two words? below are the examples input 1)select from table_name c1,c2,c3,c4,fn(),fn2(),c5;-->false 2)select from table_name c1,c2,c3,c4;--True 3)select from table c1, c2, c3, fn(), c4;-->true 4)select from table_name c1, c2, c3;-->true... (11 Replies)
Discussion started by: manasa_vs
11 Replies

6. Shell Programming and Scripting

search several words with awk command

Hello, I want to test if i find the word CACCIA AND idlck in a file, i have to print a message Ok. For that , i need to user a awk command with a && logical. Can you help me ? :confused: ### CACCIA: DEBUT ### if $(grep -wqi "$2" /etc/passwd); then && rm /etc/security/.idlck ... (3 Replies)
Discussion started by: khalidou13
3 Replies

7. Shell Programming and Scripting

using the $1 $2 etc for words in awk

So if I have an awk statement that is basically just looking at the NF at if its more than 2, then print out the first 2 words, and all the rest on another line. I know that $1 and $2 are the first two fields, but how would I symbolise telling it to print all the other fields regardless of how many... (11 Replies)
Discussion started by: linuxkid
11 Replies

8. Shell Programming and Scripting

How to get a known word between two known words using awk

hi I have posted it earlier but i was unable to put my exact problem.This time posting in parts. I have a text file which i had transferred to UNIX.It has strings like: alter table table_name add (column_name); as well as modify options. now i need to read the table name between alter... (3 Replies)
Discussion started by: alisha
3 Replies

9. Shell Programming and Scripting

awk after words

hi sorry, newbie for scripting.. the text file: abcdefghijk%%$%^U^%234454234 I got awk script: awk ' ($1 == "abcd") the output only show: abcd I tried using asterix: awk ' ($1 == "abcd*") but same output abcd only... how can I get all of the line ?? (6 Replies)
Discussion started by: flekzout
6 Replies

10. Shell Programming and Scripting

Extract numbers below words with awk

Hi all, Please some help over here. I have a Sales.txt file containing info in blocks for every sold product in the pattern showed below (only for 2 products). NEW BLOCK SALE DATA PRODUCT SERIAL 79833269999 146701011945004 .Some other data .Some... (17 Replies)
Discussion started by: cgkmal
17 Replies
Login or Register to Ask a Question