Text statistics


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Text statistics
# 1  
Old 05-23-2013
Text statistics

Hello every body

if I want to get the following statistics from a text file

1- sorted the n frequent words
2- sorted the n frequent characters
3- sorted the n frequent diagrams (tow letter together like th OR he)
4- sorted frequent n trigrams like (the OR all etc. )
5- any character that occurs before specific character or after it
for example if I want to know all the character that occurred before r and after r
so I can get 2 lists of characters occurred before r and after r .
for instance "tree"
t occurred before r
r followed e
and so on.

I Can do it in C or C++ program or java which can takes long time , but is there any command line in Unix can simply do that ?

Thanks a lot
Khaled

Last edited by khaled79; 05-23-2013 at 08:35 PM..
# 2  
Old 05-24-2013
For 1 you could do (n=5):

Code:
grep -Ewo "[a-zA-Z]{2,}" file.txt | sort | uniq -c | sort -k1,1rd | head -5

For 2 try:

Code:
grep -Eo "[a-zA-Z]" file.txt | sort | uniq -c | sort -k1,1rd | head -5

For 3 and 4:
Code:
grep -Eo "[a-zA-Z]{2}" file.txt | sort | uniq -c | sort -k1,1rd | head -5
grep -Eo "[a-zA-Z]{3}" file.txt | sort | uniq -c | sort -k1,1rd | head -5

---------- Post updated at 01:06 PM ---------- Previous update was at 12:08 PM ----------

Improvement for 3&4 ("one" should give "on" AND "ne"):

Code:
 awk '{ gsub("[^a-zA-z]"," ");for(i=1;i<=NF;i++) for(j=1;j<length($i);j++) print substr($i,j,2)}' file.txt | sort | uniq -c | sort -k1,1rd | head -5
 awk '{ gsub("[^a-zA-z]"," ");for(i=1;i<=NF;i++) for(j=1;j<length($i)-1;j++) print substr($i,j,3)}' file.txt | sort | uniq -c | sort -k1,1rd | head -5

# 3  
Old 10-05-2013
The text is Arabic text which is not consist of ABC or abc letters

any help?
# 4  
Old 10-06-2013
Use [[:alpha:]] instead of [a-zA-Z]
and [^[:alpha:]] instead of [^a-zA-Z]

Last edited by MadeInGermany; 10-06-2013 at 04:04 AM.. Reason: :
# 5  
Old 10-06-2013
Thank you

I have tried it but it doesn't work
# 6  
Old 10-06-2013
"It doesn't work" is not a useful response to anyone trying to help you.

HOW doesn't it work? No output? Error message? Your computer explodes?

Post the relevant part of the code that you're running, and the output you get (and preferably some sample input as well, if possible - although that might be problematic with Arabic characters).
# 7  
Old 10-06-2013
Sorry for that reply Smilie

the input is .txt file in Arabic language

sample of input txt

Code:
 
عونك اللهم وصلى الله على سيدنا محمد وعلى آله وصحبه وسلم كما ذكره الذاكرون، وكما غفل عن ذكره الغافلون.
الحمد لله الذي برأ سماوات طباقا رفيعات، ولما دونها محيطات، وجعلها في الأقدار متفاوتات، وبالحركة متباينات، وفي التركيب مختلفات، ذات بروج معدودة، وأقسام مقدرة محدودة، وكواكب نيرة موارة، في أفلاك بها دوارة، تتحرك لنفسها تارة فتردها أفلاكها بقدرته تعالى مقسورة؛ كل ذلك يجري على ما قدر له من إسراع وتأثير، وإبطاء وتدبير، وإنماء وتغيير، بأمر الحكيم القدير، وتقدير العليم الخبير؛ ودحا الأرض فسطحها مهادا، وأرسى عليها الجبال فصارت أوتادا.
ثم خلق الإنسان من طين، وأنشأ منه البشر من سلالة من ماء مهين، واستعمرهم في الأرض لينظر كيف يعملون، وسخر لهم ما في السموات وما في الأرض لعلهم يشكرون، ومكنهم من النعماء، وتبسطوا في فنون الأفضال والآلاء، وأثاروا الأرض وعمروها، واتخذوا المدائن واستوطنوها، وقهروا الأعداء ممن ناوأهم، وخضدوا بالقهر شوكة من عاندهم أو شانأهم

the code I have tried is

Code:
 
grep -Ewo "[:alpha:]" all.txt | sort | uniq -c | sort -k1,1rd | head -30>>result.txt

and the result was

Code:
 
601469 :
      1 a

Thanks a lot

Last edited by khaled79; 10-06-2013 at 07:19 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Red Hat

CPU Usage statistics Dump in a text file over a period of time

I am facing issue related to performance of one customized application running on RHEL 5.9. The application stalls for some unknown reason that I need to track. For that I require some tool or shell scripts that can monitor the CPU usage statistics (what we get in TOP or in more detail by other... (6 Replies)
Discussion started by: Anjan Ganguly
6 Replies

2. UNIX for Dummies Questions & Answers

Any way to get process statistics?

Hi, Can someone advise what "generic" command can I use to show statistics of a process or a running script/process? For example, I want to know how many hours/minutes it's taken to run or has been running, how much CPU it used and how much memory it used or uses. I want to be able to... (2 Replies)
Discussion started by: newbie_01
2 Replies

3. Solaris

Anyone help to interpretate os statistics

Hi, Can anyone help me to explain following statistics of my unix box. /usr/sbin/swap -l swapfile dev swaplo blocks free /dev/dsk/c4 118,771 16 33560432 33319776 /dev/dsk/c4 118,763 16 33560432 33327184 /usr/sbin/swap -s total: 13429368k bytes allocated + 9830880k reserved =... (9 Replies)
Discussion started by: giteshtrivedi
9 Replies

4. Shell Programming and Scripting

statistics using awk

Hi, I have 3 columns in a file listed below. X Y X/(X+Y) 1 1 0.5 1 1 0.5 4 1 0.8 1 1 0.5 6 1 0.857142857 1 1 0.5 23 1 0.958333333 Now I want to find confidence interval using the formula for each row. (p-2 sqrt p(1-p)/(x+y), p+2... (7 Replies)
Discussion started by: Diya123
7 Replies

5. AIX

Statistics Aix

Hello If there is a way to get a statistics from Aix box server from a month. cpu use, memory, disc use, etc. Maybe via smitty or I need to do a script. The os is Aix 5.3 Greetings (8 Replies)
Discussion started by: lo-lp-kl
8 Replies

6. HP-UX

packets statistics

Hi there, are there any functions that can get the packets statistics on UNIX ? thanks. (2 Replies)
Discussion started by: Frank2004
2 Replies

7. Solaris

how to get server statistics

Hello What commands can give following type of information about the server: Time: 20080331.12:10:39 Current CPU: 97.0% Current Memory: 3.7% Current Disk Space: 76% The resources on server is currently not available. Current CPU, Memory, or Disk Space is exceeding threshold Waiting for... (2 Replies)
Discussion started by: shalua
2 Replies

8. Programming

Server Statistics ?

I'm trying to write a C program to view server statistics such as: - server general information - CPU usage - memory usage - running processes Cany anybody gives me hints on those system calls ?? ps: I'm using Tru64 unix (6 Replies)
Discussion started by: Agent007
6 Replies
Login or Register to Ask a Question