Shell scripting: frequency of specific word in a string and statistics

05-10-2013

Registered User

26, 0

Join Date: May 2013

Last Activity: 23 September 2015, 9:49 AM EDT

Posts: 26

Thanks Given: 14

Thanked 0 Times in 0 Posts

Shell scripting: frequency of specific word in a string and statistics

Hello friends, I need a BIG help from UNIX collective intelligence:

I have a CSV file like this:

Code:

VALUE,TIMESTAMP,TEXT
1,Sun May 05 16:13:05 +0000 2013,"RT @gracecheree: Praying God sends me a really great man one day. Gotta trust in his timing. 
0,Sun May 05 16:13:05 +0000 2013,@sendi__ we're seeing that on 25th x,azzeslam,Azhar :),sendi__,,,,,
-1,Sun May 05 16:13:05 +0000 2013,still BN. in BaganSerai,Time_Lock,Azrif Asmi,,,,,,
0,Sun May 05 16:13:07 +0000 2013,Can't trust NO bitch!,_SoSoftWilliams,Kenya .. ‚��,,,,,,
0,Sun May 05 16:13:07 +0000 2013," me, i'll take some. ���",_blasianBOMB,JohnnyRocket.,,,,,,
1,Sun May 05 16:13:07 +0000 2013,"she'll be okay,  dear @tweetsfrmleyka_",elisyax,-

Now, in order to get some statistical info, I'd like to extract specific words from field (TEXT), their values from the first filed (VALUE) and then obtain 2 CSV:

CSV 1:

WORDS to search in field=TEXT, case insensitive.
Search and list in decreasing order the first 50 words and their values as follow:

OUTPUT CSV 1:

Code:

WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,1%Frequency,0%Frequency,-1%Frequency
that,456,150,13,258,10%,40%,50%
really,345,212,115,100,52%,33%,15%
great,245,111,65,23,15%,15%,60%
day,123,55,25,32,20%,20%,60%

Code:

WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,1%Frequency,0%Frequency,-1%Frequency
that,456,150,13,258,10%,40%,50%
really,345,212,115,100,52%,33%,15%
great,245,111,65,23,15%,15%,60%
day,123,55,25,32,20%,20%,60%
.............

SHELL, AWK, PYTHON, ETC......

Many thanks for your BIG help in advance.

Moderator's Comments:

Please use code tags when posting data and code samples!

---------- Post updated 05-10-13 at 11:11 AM ---------- Previous update was 05-09-13 at 11:29 AM ----------

I found this very handy script from radoulov and i guess that can be a good starting point:

Code:

awk 'END {
  print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    if (sc) {
      print "-----------------------------------"
      printf "Total number of Special records = %d\n", \
      sc  
      for (S in sa)
        printf "Total number of %s records = %d\n", \
        S, sa[S]
        }        
    print RS
    }
FNR == 1 {
  if (f) {
    print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    if (sc) {
      print "-----------------------------------"
      printf "Total number of Special records = %d\n", \
      sc
      for (S in sa)
        printf "Total number of %s records = %d\n", \
        S, sa[S]
        }        
    print RS
    split(x, z)
    split(x, sa)
    s = sc = 0
    }
    f = FILENAME
  }    
$3 ~ /^(PTR|MX|NS|CNAME|A)$/ && !s { z[$3]++ }
s && $2 == "IN" { sc++; sa[$3]++ }
/SPECIALS/ { s = 1 }' db*

some help in order to adapt?

many thanks friends!

Last edited by Scott; 05-10-2013 at 01:25 PM.. Reason: CODE tags, not ICODE tags, please.

kraterions

View Public Profile for kraterions

Find all posts by kraterions

05-10-2013

Registered User

939, 225

Join Date: Mar 2011

Last Activity: 8 May 2020, 3:48 AM EDT

Location: Éire

Posts: 939

Thanks Given: 27

Thanked 225 Times in 219 Posts

I'd try a Perl solution to be honest:

Code:

 perl -ne '
chomp;
@rec=split(/,/, $_, 3);
@words=split/\b\s*/,$rec[2];
map {$counts{lc($_)}++ if /^\w+$/;}@words;
END{
  @wanted=qw(that really great day);
  for (sort {$counts{b}<=>$counts{a}} @wanted){
    print "$_ $counts{$_}\n";
  }
} ' tmp/tmp.dat

I'm heading out now, but you could extend the counts data structure to count{total=>${TOTAL COUNTS TO DATE}, appeared=>{++ for each record it appeared in},0=>${+1 if $rec[0]==0}...} and that would allow you produce the extended table you require

Last edited by Skrynesaver; 05-10-2013 at 01:46 PM.. Reason: added wanted array and how to aproach the rest of the requirements

Skrynesaver

View Public Profile for Skrynesaver

Find all posts by Skrynesaver

05-10-2013

Registered User

26, 0

Join Date: May 2013

Last Activity: 23 September 2015, 9:49 AM EDT

Posts: 26

Thanks Given: 14

Thanked 0 Times in 0 Posts

Hello Skrynesaver, first of all many thanks for your attention,

I launched your script and here his output:

Code:

$/wfs.pl finaltest.csv > finaltest1.csv
Bareword found where operator expected at ./wfs.pl line 12, near "} ' tmp"
  (Might be a runaway multi-line '' string starting on line 2)
	(Missing operator before tmp?)
syntax error at ./wfs.pl line 2, near "-ne"

Then i wrote the path inside your script:

Code:

}print "$_ $counts{$_}\n";
 }
} '  /Desktop/finaltest1.csv

and the output was:

Code:

syntax error at ./wfs.pl line 2, near "-ne"
Execution of ./wfs.pl aborted due to compilation errors.

I don�t know perl and i can't improve your code

Hope to hear from you soon!!!!

Have a nice night.

Last edited by Scott; 05-15-2013 at 07:27 PM.. Reason: Code tags

kraterions

View Public Profile for kraterions

Find all posts by kraterions

05-11-2013

Registered User

939, 225

Join Date: Mar 2011

Last Activity: 8 May 2020, 3:48 AM EDT

Location: Éire

Posts: 939

Thanks Given: 27

Thanked 225 Times in 219 Posts

Hi Kraterions,

that wasn't a script, but rather a command line one off, as a script it would look like this, with advised addons.

Code:

#!/usr/bin/perl
use strict;

open(my $tweets, $ARGV[0])|| die "Couldn't open $ARGV[0] $!\n";
my %counts;
my %freq;
my $tot;
while(<$tweets>){
        chomp;
        my @rec=split(/,/, $_, 3);
        my @words=split/\b\s*/,$rec[2];
        map {   $counts{lc($_)}++ if /^\w+$/;
                $freq{$rec[0]}{lc($_)}++ if /^\w+$/;
                $tot++}@words;
}
my @wanted=qw(that really great day);
print "WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,1%Frequency,0%Frequency,-1%Frequency\n";
for (sort {$counts{b}<=>$counts{a}} @wanted){
        print "$_,$counts{$_},",$freq{1}{$_}||0,",",$freq{0}{$_}||0,",",$freq{-1}{$_}||0,sprintf("%0.2f",($counts{$_}/$tot)*100),"%,",sprintf("%0.2f",($freq{0}{$_}/$tot)*100),"%,",sprintf("%0.2f",($freq{-1}{$_}/$tot)*100),"%\n";
}

Last edited by Skrynesaver; 05-11-2013 at 07:10 AM..

This User Gave Thanks to Skrynesaver For This Post:

Skrynesaver

View Public Profile for Skrynesaver

Find all posts by Skrynesaver

05-11-2013

Registered User

26, 0

Join Date: May 2013

Last Activity: 23 September 2015, 9:49 AM EDT

Posts: 26

Thanks Given: 14

Thanked 0 Times in 0 Posts

Hi Skrynesaver,

many many thanks for your BIG HELP.

Impressive job, really!

I launched your script and here his output:

Code:

WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,1%Frequency,0%Frequency,-1%Frequency
that,20,11,1,80.43%,0.02%,0.17%
great,1,1,0,00.02%,0.00%,0.00%
really,9,5,2,20.19%,0.04%,0.04%
day,1,1,0,00.02%,0.00%,0.00%

So, I'd like to ask you about 4 points:

1) As you can see the FIELD "(-1)Frequency" is not present
2) Words column is not sort in decreasing order
3) Is possible to set an option to search words "like grep -i" case insensitive in order to match as follows:

Code:

my @wanted=qw(day) result= today days monday etc.......

4) Is possible to search "like grep -i" case insensitive and list in decreasing order the 50 most frequent words, where words >=3 characters?

Again, many many thanks for your attention and for your BIG HELP.

Hope to hear from you soon!!!!

Have a good time.

Last edited by kraterions; 05-12-2013 at 02:55 AM..

kraterions

View Public Profile for kraterions

Find all posts by kraterions

05-14-2013

Registered User

26, 0

Join Date: May 2013

Last Activity: 23 September 2015, 9:49 AM EDT

Posts: 26

Thanks Given: 14

Thanked 0 Times in 0 Posts

News?

kraterions

View Public Profile for kraterions

Find all posts by kraterions

05-15-2013

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

You could also try an awk solution:

Code:

awk -F, '
NR>1 {
   split(tolower(substr($0, length($1","$2",")+1)), words, "[^A-Za-z_\047]")
   for(wnum in words) {
       w=words[wnum]
       if(length(w)>0) {
          counts[w]=counts[w]+1
          freq[$1,w]++
       }
    }
}
END {
  OFS=","
  print "WORD,TOTFrequency,(1)Frequency,(0)Frequency,"\
        "(-1)Frequency,1%Frequency,0%Frequency,-1%Frequency"
  for (w in counts) printf ("%s,%d,%d,%d,%d,%0.1f%%,%0.1f%%,%0.1f%%\n",
         w, counts[w],
         freq[1,w]+0, freq[0,w]+0, freq[-1,w]+0,
         freq[0,w]*100/counts[w],
         freq[1,w]*100/counts[w],
         freq[-1,w]*100/counts[w]) | "sort -t, -k2,2nr -k1,1"
}
' infile

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

Shell Programming and Scripting

Shell scripting: frequency of specific word in a string and statistics

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Get string before specific word in UNIX

Discussion started by: ace_friends22

2. Shell Programming and Scripting

Help with calculating frequency of specific word in a string

Discussion started by: perl_beginner

3. UNIX for Dummies Questions & Answers

How to print line starts with specific word and contains specific word using sed?

Discussion started by: tmalik79

4. Shell Programming and Scripting

break the string and print it in a new line after a specific word

Discussion started by: kannansr621

5. Shell Programming and Scripting

Parse a String for a Specific Word

Discussion started by: ravzter

6. Shell Programming and Scripting

search-word-print-specific-string

Discussion started by: Jassz

7. Shell Programming and Scripting

awk or sed command to print specific string between word and blank space

Discussion started by: elamurugu

8. Shell Programming and Scripting

search a word and print specific string using awk

Discussion started by: dragon.1431

9. Shell Programming and Scripting

Finding a word at specific location in a string

Discussion started by: swapnil.nawale

10. Shell Programming and Scripting

Determining Word Frequency of Specific Terms

Discussion started by: richsark