Shell scripting: frequency of specific word in a string and statistics


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Shell scripting: frequency of specific word in a string and statistics
# 8  
Old 05-15-2013
Hi Chubler_XL,

many thanks for your help:

here the output of your script:

Code:
 WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency
have,38,10,0,28,26,3%,0,0%,73,7%
know,31,3,3,25,9,7%,9,7%,80,6%
please,30,6,0,24,20,0%,0,0%,80,0%
life,29,3,0,26,10,3%,0,0%,89,7%
with,26,13,3,10,50,0%,11,5%,38,5%
paye,25,1,0,24,4,0%,0,0%,96,0%
shit,25,0,0,25,0,0%,0,0%,100,0%
having,24,0,0,24,0,0%,0,0%,100,0%
little,24,0,0,24,0,0%,0,0%,100,0%

very nice!

So, i donīt know why but the percentage results are comma separated and not point separated:
(9,7%,9,7%,80,6%) instead of (9.7%,9.7%,80.6%)

How can i set a limit result to the 50 most frequent words?

How can i set an option to search words "like grep -i" case insensitive for specific words ( that|really|great|day|.......)?
Code:
my @wanted=qw(day) result= today days monday etc.......


Many thanks for your attention and BIG HELP!
# 9  
Old 05-15-2013
The commas for decimal point are most likely related to your locale.

Unfortunately different Unix flavors have different ways of setting this, from the shell try:
Code:
$ LC_ALL=en_US.UTF-8 ; export LC_ALL

Best way to search for particular entries or limit the results to the first 2 is to use other unix tools like this:

Code:
bash $ ./your_script.sh | grep -E "^(WORD|that|really),"
WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency
really,1,1,0,0,0.0%,100.0%,0.0%
that,1,0,1,0,100.0%,0.0%,0.0%

Code:
bash $ ./your_script.sh | head -3
WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency
have,38,10,0,28,26.3%,0.0%,73.7%
know,31,3,3,25,9.7%,9.7%,80.6%

This User Gave Thanks to Chubler_XL For This Post:
# 10  
Old 06-13-2013
Hi Chubler_XL

many thanks again for your help,

I have two question:

1)
When I use grep to calculate values for specific words I lost a lot of info in terms of subpatterns

I'd like to ask you how to obtain patterns values (ignore case) instead of specific word values

Example:
Code:
./your_script.sh | grep -E "^(day|real),"
WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency
real,1,1,0,0,0.0%,100.0%,0.0%
day,1,0,1,0,100.0%,0.0%,0.0%

where real line include values for all terms: real, really, reality, etc
where day line include values for all terms: day, today, everyday, etc

2)
How can I get an additional column with a normalized value between 0 and 1:
where -1 value is 0, 0 value is 0.5 and 1 value is 1

Example:
Code:
./your_script.sh | grep -E "^(day|real),"
WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency,NV
real,1,1,0,0,0.0%,100.0%,0.0%,1
day,1,0,1,0,100.0%,0.0%,0.0%,0.5

Many thanks for your attention and big help.

Hope to hear from you soon.
# 11  
Old 06-13-2013
1) I then expression "^(day|real)," the ^ symbol matches the beginning of the line, ie day or real must be at the front of the line and be followed by a comma so to match the heading line or any sub-string with day or real simply use "(^WORD|day|real)"

2) you can use awk like this:
Code:
./your_script.sh | awk -F, 'NR==1 || ($5==0 && $4==0.5 && $3==1)'

prints out line if
rownumber = 1 (header row) OR
field#5 (-1 value) is 0 AND field#4 (0 value) is 0.5 AND field#3 (1 value) is 1
# 12  
Old 06-13-2013
HI Chubler_XL,

1)
in this way i obtain:

Code:
day,574266,0,0,0,0,0%,0,0%,0,0%
today,77679,0,0,0,0,0%,0,0%,0,0%
everyday,40810,0,0,0,0,0%,0,0%,0,0%
real,77679,0,0,0,0,0%,0,0%,0,0%
really,40810,0,0,0,0,0%,0,0%,0,0%
reality,20082,0,0,0,0,0%,0,0%,0,0%

I'd like to obtain all cases for each pattern greped in one line with aggregate unique values!


2)
In the second case i`d like to calculate an additional value (a kind of average between values 1,0,-1) in order to understand the behaviour of each word (more positive or more negative)

Example:
Code:
./your_script.sh | grep -E "^(day|real),"
WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency,NV
real,1,1,0,0,0.0%,100.0%,0.0%,1

Where NV= (1)Frequency-(-1)Frequency/[(1)Frequency+(-1)Frequency)]

In this case the index will be between -1 and 1

Do you have some suggestions?

Many thanks for your attention and big help!
# 13  
Old 06-13-2013
1) This is getting a little long for the command line and should probably be put in another .sh script but:

Code:
./your_script.sh | awk -F, -vW="day,real" '
BEGIN { split(W,w,",") ; for(i in w) N[w[i]]}
{ for(word in N) if ($1 ~ word) for(i=2;i<=NF;i++) v[word,i]+=$i }
END {
   for(word in N) {
       printf "%s", word
       for(i=2; word SUBSEP i in v; i++)
           printf ",%s", v[word,i] ((i==7||i==9||i==11)?"%":"")
       printf "\n"
   }
}'

2) I think you asking for this:

Code:
./your_script.sh | awk -F, '
  NR==1 { print $0 ",NV" }
  $5==0 && $4==0.5 && $3==1 { print $0","$3-$4/($3+$5) }'

However, won't NV will always be 0.5 [ie 1 - 0.5/(1+0)]
# 14  
Old 06-14-2013
Hi Chubler_XL

thanks for your help:

1) doesnīt works i get an error for

Code:
awk: invalid -v option

2) doesn't works i get an empty file

Code:
WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency,NV

Anyway, i guess we have to look at your first version:

Code:
#!/bin/bash
awk -F, '
NR>1 {
   split(tolower(substr($0, length($1","$2",")+1)), words, "[^A-Za-z_\047]")
   for(wnum in words) {
       w=words[wnum]
       if(length(w)>=4) {
          counts[w]=counts[w]+1
          freq[$1,w]++
       }
    }
}
END {
  OFS=","
  print "WORD,TOTFrequency,(1)Frequency,(0)Frequency,"\
        "(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency"
  for (w in counts) printf ("%s,%d,%d,%d,%d,%0.1f%%,%0.1f%%,%0.1f%%\n",
         w, counts[w],
         freq[1,w]+0, freq[0,w]+0, freq[-1,w]+0,
         freq[1,w]*100/counts[w],
         freq[0,w]*100/counts[w],
         freq[-1,w]*100/counts[w]) | "sort -t, -k2,2nr -k1,1"
}
' file

From here, can you implement in one .sh all features?

About NV, his value depends from values in columns (1)Frequency and (-1)Frequency!

example:
Code:
WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency,NV
day,6832,3562,1117,2153,52.1%,16.3%,31.5%,0.24

Where NV=3562-2153/(3562+2153)=0.24

Thanks for your help and attention.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Get string before specific word in UNIX

Hi All, I'm writing unix shell script and I have these files. I need to get name before _DETL.tmp. ABC_AAA_DETL.tmp ABC_BBB_DETL.tmp ABC_CCC_DETL.tmp PQR_DETL.tmp DEF_DETL.tmp JKL_DETL.tmp YUI_DETL.tmp TG_NM_DDD_DETL.tmp TG_NM_EEE_DETL.tmp GHJ_DETL.tmp RTY_DETL.tmp output will... (3 Replies)
Discussion started by: ace_friends22
3 Replies

2. Shell Programming and Scripting

Help with calculating frequency of specific word in a string

Input file: #read_1 AWEAWQQRZZZQWQQWZ #read_2 ZZAQWRQTWQQQWADSADZZZ #read_3 POGZZZZZZADWRR . . Desired output file: #read_1 3 #read_1 1 #read_2 2 #read_2 3 #read_3 6 . . (3 Replies)
Discussion started by: perl_beginner
3 Replies

3. UNIX for Dummies Questions & Answers

How to print line starts with specific word and contains specific word using sed?

Hi, I have gone through may posts and dint find exact solution for my requirement. I have file which consists below data and same file have lot of other data. <MAPPING DESCRIPTION ='' ISVALID ='YES' NAME='m_TASK_UPDATE' OBJECTVERSION ='1'> <MAPPING DESCRIPTION ='' ISVALID ='NO'... (11 Replies)
Discussion started by: tmalik79
11 Replies

4. Shell Programming and Scripting

break the string and print it in a new line after a specific word

Hi Gurus I am new to this forum.. I am using HP Unix OS. I have one single string in input file as shown below Abc123 | cde | fgh | ghik| lmno | Abc456 |one |two |three | four | Abc789 | five | Six | seven | eight | Abc098 | ........ I want to achive the result in a output file as shown... (3 Replies)
Discussion started by: kannansr621
3 Replies

5. Shell Programming and Scripting

Parse a String for a Specific Word

Hello, I'm almost there with scripting, and I've looked at a few examples that could help me out here. But I'm still at a lost where to start. I'm looking to parse each line in the log file below and save the output like below. Log File AABBCGCAT022|242|3 AABBCGCAT023|243|4... (6 Replies)
Discussion started by: ravzter
6 Replies

6. Shell Programming and Scripting

search-word-print-specific-string

Hi, Our input xml looks like: <doc> <str name="account_id">1111</str> <str name="prd_id">DHEP155EK</str> </doc> - <doc> <str name="account_id">6666</str> <str name="prd_id">394531662</str> </doc> - <doc> <str name="account_id">6666</str> <str... (1 Reply)
Discussion started by: Jassz
1 Replies

7. Shell Programming and Scripting

awk or sed command to print specific string between word and blank space

My source is on each line 98.194.245.255 - - "GET /disp0201.php?poc=4060&roc=1&ps=R&ooc=13&mjv=6&mov=5&rel=5&bod=155&oxi=2&omj=5&ozn=1&dav=20&cd=&daz=&drc=&mo=&sid=&lang=EN&loc=JPN HTTP/1.1" 302 - "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.0.3705; .NET CLR... (5 Replies)
Discussion started by: elamurugu
5 Replies

8. Shell Programming and Scripting

search a word and print specific string using awk

Hi, I have list of directory paths in a variable and i want to delete those dirs and if dir does not exist then search that string and get the correct path from xml file after that delete the correct directory. i tried to use grep and it prints the entire line from the search.once i get the entire... (7 Replies)
Discussion started by: dragon.1431
7 Replies

9. Shell Programming and Scripting

Finding a word at specific location in a string

Hi All , I have different strings (SQL queries infact) of different lengths such as: 1. "SELECT XYZ FROM ABC WHERE ABC.DEF='123' " 2. "DELETE FROM ABC WHERE ABC.DEF='567'" 3. "SELECT * FROM ABC" I need to find out the word coming after the... (1 Reply)
Discussion started by: swapnil.nawale
1 Replies

10. Shell Programming and Scripting

Determining Word Frequency of Specific Terms

Hello, I require a perl script that will read a .txt file that contains words like 224.199.207.IN-ADDR.ARPA. IN NS NS1.internet.com. 4.200.162.207.in-addr.arpa. IN PTR beeriftw.internet.com. arroyoeinternet.com. IN A 200.199.227.49 I want to focus on words: IN... (23 Replies)
Discussion started by: richsark
23 Replies
Login or Register to Ask a Question