Shell scripting: frequency of specific word in a string and statistics

05-15-2013

Registered User

26, 0

Join Date: May 2013

Last Activity: 23 September 2015, 9:49 AM EDT

Posts: 26

Thanks Given: 14

Thanked 0 Times in 0 Posts

Hi Chubler_XL,

many thanks for your help:

here the output of your script:

Code:

 WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency
have,38,10,0,28,26,3%,0,0%,73,7%
know,31,3,3,25,9,7%,9,7%,80,6%
please,30,6,0,24,20,0%,0,0%,80,0%
life,29,3,0,26,10,3%,0,0%,89,7%
with,26,13,3,10,50,0%,11,5%,38,5%
paye,25,1,0,24,4,0%,0,0%,96,0%
shit,25,0,0,25,0,0%,0,0%,100,0%
having,24,0,0,24,0,0%,0,0%,100,0%
little,24,0,0,24,0,0%,0,0%,100,0%

very nice!

So, i don�t know why but the percentage results are comma separated and not point separated:
(9,7%,9,7%,80,6%) instead of (9.7%,9.7%,80.6%)

How can i set a limit result to the 50 most frequent words?

How can i set an option to search words "like grep -i" case insensitive for specific words ( that|really|great|day|.......)?

Code:

my @wanted=qw(day) result= today days monday etc.......

Many thanks for your attention and BIG HELP!

kraterions

View Public Profile for kraterions

Find all posts by kraterions

05-15-2013

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

The commas for decimal point are most likely related to your locale.

Unfortunately different Unix flavors have different ways of setting this, from the shell try:

Code:

$ LC_ALL=en_US.UTF-8 ; export LC_ALL

Best way to search for particular entries or limit the results to the first 2 is to use other unix tools like this:

Code:

bash $ ./your_script.sh | grep -E "^(WORD|that|really),"
WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency
really,1,1,0,0,0.0%,100.0%,0.0%
that,1,0,1,0,100.0%,0.0%,0.0%

Code:

bash $ ./your_script.sh | head -3
WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency
have,38,10,0,28,26.3%,0.0%,73.7%
know,31,3,3,25,9.7%,9.7%,80.6%

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

06-13-2013

Registered User

26, 0

Join Date: May 2013

Last Activity: 23 September 2015, 9:49 AM EDT

Posts: 26

Thanks Given: 14

Thanked 0 Times in 0 Posts

Hi Chubler_XL

many thanks again for your help,

I have two question:

1)
When I use grep to calculate values for specific words I lost a lot of info in terms of subpatterns

I'd like to ask you how to obtain patterns values (ignore case) instead of specific word values

Example:

Code:

./your_script.sh | grep -E "^(day|real),"
WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency
real,1,1,0,0,0.0%,100.0%,0.0%
day,1,0,1,0,100.0%,0.0%,0.0%

where real line include values for all terms: real, really, reality, etc
where day line include values for all terms: day, today, everyday, etc

2)
How can I get an additional column with a normalized value between 0 and 1:
where -1 value is 0, 0 value is 0.5 and 1 value is 1

Example:

Code:

./your_script.sh | grep -E "^(day|real),"
WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency,NV
real,1,1,0,0,0.0%,100.0%,0.0%,1
day,1,0,1,0,100.0%,0.0%,0.0%,0.5

Many thanks for your attention and big help.

Hope to hear from you soon.

kraterions

View Public Profile for kraterions

Find all posts by kraterions

06-13-2013

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

1) I then expression "^(day|real)," the ^ symbol matches the beginning of the line, ie day or real must be at the front of the line and be followed by a comma so to match the heading line or any sub-string with day or real simply use "(^WORD|day|real)"

2) you can use awk like this:

Code:

./your_script.sh | awk -F, 'NR==1 || ($5==0 && $4==0.5 && $3==1)'

prints out line if
rownumber = 1 (header row) OR
field#5 (-1 value) is 0 AND field#4 (0 value) is 0.5 AND field#3 (1 value) is 1

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

06-13-2013

Registered User

26, 0

Join Date: May 2013

Last Activity: 23 September 2015, 9:49 AM EDT

Posts: 26

Thanks Given: 14

Thanked 0 Times in 0 Posts

HI Chubler_XL,

1)
in this way i obtain:

Code:

day,574266,0,0,0,0,0%,0,0%,0,0%
today,77679,0,0,0,0,0%,0,0%,0,0%
everyday,40810,0,0,0,0,0%,0,0%,0,0%
real,77679,0,0,0,0,0%,0,0%,0,0%
really,40810,0,0,0,0,0%,0,0%,0,0%
reality,20082,0,0,0,0,0%,0,0%,0,0%

I'd like to obtain all cases for each pattern greped in one line with aggregate unique values!

2)
In the second case i`d like to calculate an additional value (a kind of average between values 1,0,-1) in order to understand the behaviour of each word (more positive or more negative)

Example:

Code:

./your_script.sh | grep -E "^(day|real),"
WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency,NV
real,1,1,0,0,0.0%,100.0%,0.0%,1

Where NV= (1)Frequency-(-1)Frequency/[(1)Frequency+(-1)Frequency)]

In this case the index will be between -1 and 1

Do you have some suggestions?

Many thanks for your attention and big help!

kraterions

View Public Profile for kraterions

Find all posts by kraterions

06-13-2013

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

1) This is getting a little long for the command line and should probably be put in another .sh script but:

Code:

./your_script.sh | awk -F, -vW="day,real" '
BEGIN { split(W,w,",") ; for(i in w) N[w[i]]}
{ for(word in N) if ($1 ~ word) for(i=2;i<=NF;i++) v[word,i]+=$i }
END {
   for(word in N) {
       printf "%s", word
       for(i=2; word SUBSEP i in v; i++)
           printf ",%s", v[word,i] ((i==7||i==9||i==11)?"%":"")
       printf "\n"
   }
}'

2) I think you asking for this:

Code:

./your_script.sh | awk -F, '
  NR==1 { print $0 ",NV" }
  $5==0 && $4==0.5 && $3==1 { print $0","$3-$4/($3+$5) }'

However, won't NV will always be 0.5 [ie 1 - 0.5/(1+0)]

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

06-14-2013

Registered User

26, 0

Join Date: May 2013

Last Activity: 23 September 2015, 9:49 AM EDT

Posts: 26

Thanks Given: 14

Thanked 0 Times in 0 Posts

Hi Chubler_XL

thanks for your help:

1) doesn�t works i get an error for

Code:

awk: invalid -v option

2) doesn't works i get an empty file

Code:

WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency,NV

Anyway, i guess we have to look at your first version:

Code:

#!/bin/bash
awk -F, '
NR>1 {
   split(tolower(substr($0, length($1","$2",")+1)), words, "[^A-Za-z_\047]")
   for(wnum in words) {
       w=words[wnum]
       if(length(w)>=4) {
          counts[w]=counts[w]+1
          freq[$1,w]++
       }
    }
}
END {
  OFS=","
  print "WORD,TOTFrequency,(1)Frequency,(0)Frequency,"\
        "(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency"
  for (w in counts) printf ("%s,%d,%d,%d,%d,%0.1f%%,%0.1f%%,%0.1f%%\n",
         w, counts[w],
         freq[1,w]+0, freq[0,w]+0, freq[-1,w]+0,
         freq[1,w]*100/counts[w],
         freq[0,w]*100/counts[w],
         freq[-1,w]*100/counts[w]) | "sort -t, -k2,2nr -k1,1"
}
' file

From here, can you implement in one .sh all features?

About NV, his value depends from values in columns (1)Frequency and (-1)Frequency!

example:

Code:

WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency,NV
day,6832,3562,1117,2153,52.1%,16.3%,31.5%,0.24

Where NV=3562-2153/(3562+2153)=0.24

Thanks for your help and attention.

kraterions

View Public Profile for kraterions

Find all posts by kraterions

Shell Programming and Scripting

Shell scripting: frequency of specific word in a string and statistics

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Get string before specific word in UNIX

Discussion started by: ace_friends22

2. Shell Programming and Scripting

Help with calculating frequency of specific word in a string

Discussion started by: perl_beginner

3. UNIX for Dummies Questions & Answers

How to print line starts with specific word and contains specific word using sed?

Discussion started by: tmalik79

4. Shell Programming and Scripting

break the string and print it in a new line after a specific word

Discussion started by: kannansr621

5. Shell Programming and Scripting

Parse a String for a Specific Word

Discussion started by: ravzter

6. Shell Programming and Scripting

search-word-print-specific-string

Discussion started by: Jassz

7. Shell Programming and Scripting

awk or sed command to print specific string between word and blank space

Discussion started by: elamurugu

8. Shell Programming and Scripting

search a word and print specific string using awk

Discussion started by: dragon.1431

9. Shell Programming and Scripting

Finding a word at specific location in a string

Discussion started by: swapnil.nawale

10. Shell Programming and Scripting

Determining Word Frequency of Specific Terms

Discussion started by: richsark