Shell scripting: frequency of specific word in a string and statistics


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Shell scripting: frequency of specific word in a string and statistics
# 15  
Old 06-14-2013
1) try a space between -v and W= (some awks require it)

2) I think I misunderstood your requirement try:
Code:
#!/bin/bash
awk -F, '
NR>1 {
   split(tolower(substr($0, length($1","$2",")+1)), words, "[^A-Za-z\047]")
   for(wnum in words) {
       w=words[wnum]
       if(length(w)>=4) {
          counts[w]=counts[w]+1
          freq[$1,w]++
       }
    }
}
END {
  OFS=","
  print "WORD,TOTFrequency,(1)Frequency,(0)Frequency,"\
        "(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency,NV"
  for (w in counts) printf ("%s,%d,%d,%d,%d,%0.1f%%,%0.1f%%,%0.1f%%,%0.2f\n",
         w, counts[w],
         freq[1,w]+0, freq[0,w]+0, freq[-1,w]+0,
         freq[1,w]*100/counts[w],
         freq[0,w]*100/counts[w],
         freq[-1,w]*100/counts[w],
         (0+freq[1,w]+freq[-1,w])?
         freq[1,w]-freq[0,1]/(freq[1,w]+freq[-1,w]):0) | "sort -t, -k2,2nr -k1,1"
}
'  infile

This User Gave Thanks to Chubler_XL For This Post:
# 16  
Old 06-16-2013
Hi Hi Chubler_XL,

many thanks for your help,

this is exactly what I need, but there is an error:


Code:
awk: syntax error at source line 23
 context is
	         (0+freq[1,w]+freq[-1,w])? >>> 
 <<<          freq[1,w]-freq[0,1]/(freq[1,w]+freq[-1,w]):0) | "sort -t, -k2,2nr -k1,1"
awk: illegal statement at source line 24

Many thanks for your big help
# 17  
Old 06-16-2013
Works OK here, could be the newline try joining line 23 and 24 into one like this:

Code:
(0+freq[1,w]+freq[-1,w])?freq[1,w]-freq[0,1]/(freq[1,w]+freq[-1,w]):0) | "sort -t, -k2,2nr -k1,1"

or putting a backslash on the end of line 23 like this:
Code:
(0+freq[1,w]+freq[-1,w])?\
freq[1,w]-freq[0,1]/(freq[1,w]+freq[-1,w]):0) | "sort -t, -k2,2nr -k1,1"

# 18  
Old 06-16-2013
Hi Chubler_XL,

you are great!!!

Last thing, how can i set here:

Code:
#!/bin/bash
awk -F, -v W="day,really" '
BEGIN { split(W,w,",") ; for(i in w) N[w[i]]}
{ for(word in N) if ($1 ~ word) for(i=2;i<=NF;i++) v[word,i]+=$i }
END {
   for(word in N) {
       printf "%s", word
       for(i=2; word SUBSEP i in v; i++)
           printf ",%s", v[word,i] ((i==7||i==9||i==11)?"%":"")
       printf "\n"
   }
}'

just the first occurrence in each analyzed line ?

There are some duplicates in terms of final value!

Itīs possible that the same word appears in the same line more times, and i need to calculate it just one time!


Many thanks for your big help!!
# 19  
Old 07-02-2013
Hi Chubler_XL,

I'm using your script:

input file:

Code:
1,Tue Jul 02 15:14:55 +0000 2013,"RT @IamReallTyga: Our generation is fucked up when it comes to 
0,Tue Jul 02 15:14:56 +0000 2013,"If you're ever thinking about ordering insanity
0,Tue Jul 02 15:14:57 +0000 2013,Busy morning as Mayor meeting Holywood traders &amp; visiting Holywood Family Young person recommended sell chain on eBay when told value!
0,Tue Jul 02 15:14:57 +0000 2013,Panama Private Foundation | A stronger way to protect your assets. A combination of an Offshore Company &amp;t http://t.co/S9JJ4c64co
0,Tue Jul 02 15:14:58 +0000 2013,"I just might you w my drugs
0,Tue Jul 02 15:14:58 +0000 2013,#ICantLiveWithout its :D
0,Tue Jul 02 15:14:59 +0000 2013,"You can play hoes
1,Tue Jul 02 15:15:01 +0000 2013,"RT @Kev2Player: Our generation is fucked up when it comes t
1,Tue Jul 02 15:15:06 +0000 2013,"We you'll find SmartMoney
0,Tue Jul 02 15:15:07 +0000 2013,Gettin money now that

your code used as follows:

Code:
./script.sh | head -100 > outfile

Code:
#!/bin/bash
awk -F, '
NR>1 {
   split(tolower(substr($0, length($1","$2",")+1)), words, "[^A-Za-z\047]")
   for(wnum in words) {
       w=words[wnum]
       if(length(w)>=4) {
          counts[w]=counts[w]+1
          freq[$1,w]++
       }
    }
}
END {
  OFS=","
  print "WORD,TOTFrequency,(1)Frequency,(0)Frequency,"\
        "(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency,NV"
  for (w in counts) printf ("%s,%d,%d,%d,%d,%0.1f%%,%0.1f%%,%0.1f%%,%0.2f\n",
         w, counts[w],
         freq[1,w]+0, freq[0,w]+0, freq[-1,w]+0,
         freq[1,w]*100/counts[w],
         freq[0,w]*100/counts[w],
         freq[-1,w]*100/counts[w],
         (0+freq[1,w]+freq[-1,w])?\
         freq[1,w]-freq[0,1]/(freq[1,w]+freq[-1,w]):0) | "sort -t, -k2,2nr -k1,1"
}
'  infile

Here you can see his output:

Code:
WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,(1%)Frequency,(0%)Frequency,(-1%)Frequency,NV
when,525,517,6,2,98.5%,1.1%,0.4%,517.00
comes,516,516,0,0,100.0%,0.0%,0.0%,516.00
generation,516,516,0,0,100.0%,0.0%,0.0%,516.00
fucked,502,502,0,0,100.0%,0.0%,0.0%,502.00
yepitstrey,451,451,0,0,100.0%,0.0%,0.0%,451.00
money,27,4,19,4,14.8%,70.4%,14.8%,4.00
http,25,0,25,0,0.0%,100.0%,0.0%,0.00
your,22,4,18,0,18.2%,81.8%,0.0%,4.00
iamrealltyga,20,20,0,0,100.0%,0.0%,0.0%,20.00

the value NV should be a number between -1 and 1

it seems strange, do you have any ideas?

thanks for your attention

Last edited by kraterions; 07-02-2013 at 01:14 PM..
# 20  
Old 07-02-2013
Looks like we are missing some brackets replace:
Code:
freq[1,w]-freq[0,1]/(freq[1,w]+freq[-1,w]):0) | "sort -t, -k2,2nr -k1,1"

with

Code:
(freq[1,w]-freq[0,1])/(freq[1,w]+freq[-1,w]):0) | "sort -t, -k2,2nr -k1,1"

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Get string before specific word in UNIX

Hi All, I'm writing unix shell script and I have these files. I need to get name before _DETL.tmp. ABC_AAA_DETL.tmp ABC_BBB_DETL.tmp ABC_CCC_DETL.tmp PQR_DETL.tmp DEF_DETL.tmp JKL_DETL.tmp YUI_DETL.tmp TG_NM_DDD_DETL.tmp TG_NM_EEE_DETL.tmp GHJ_DETL.tmp RTY_DETL.tmp output will... (3 Replies)
Discussion started by: ace_friends22
3 Replies

2. Shell Programming and Scripting

Help with calculating frequency of specific word in a string

Input file: #read_1 AWEAWQQRZZZQWQQWZ #read_2 ZZAQWRQTWQQQWADSADZZZ #read_3 POGZZZZZZADWRR . . Desired output file: #read_1 3 #read_1 1 #read_2 2 #read_2 3 #read_3 6 . . (3 Replies)
Discussion started by: perl_beginner
3 Replies

3. UNIX for Dummies Questions & Answers

How to print line starts with specific word and contains specific word using sed?

Hi, I have gone through may posts and dint find exact solution for my requirement. I have file which consists below data and same file have lot of other data. <MAPPING DESCRIPTION ='' ISVALID ='YES' NAME='m_TASK_UPDATE' OBJECTVERSION ='1'> <MAPPING DESCRIPTION ='' ISVALID ='NO'... (11 Replies)
Discussion started by: tmalik79
11 Replies

4. Shell Programming and Scripting

break the string and print it in a new line after a specific word

Hi Gurus I am new to this forum.. I am using HP Unix OS. I have one single string in input file as shown below Abc123 | cde | fgh | ghik| lmno | Abc456 |one |two |three | four | Abc789 | five | Six | seven | eight | Abc098 | ........ I want to achive the result in a output file as shown... (3 Replies)
Discussion started by: kannansr621
3 Replies

5. Shell Programming and Scripting

Parse a String for a Specific Word

Hello, I'm almost there with scripting, and I've looked at a few examples that could help me out here. But I'm still at a lost where to start. I'm looking to parse each line in the log file below and save the output like below. Log File AABBCGCAT022|242|3 AABBCGCAT023|243|4... (6 Replies)
Discussion started by: ravzter
6 Replies

6. Shell Programming and Scripting

search-word-print-specific-string

Hi, Our input xml looks like: <doc> <str name="account_id">1111</str> <str name="prd_id">DHEP155EK</str> </doc> - <doc> <str name="account_id">6666</str> <str name="prd_id">394531662</str> </doc> - <doc> <str name="account_id">6666</str> <str... (1 Reply)
Discussion started by: Jassz
1 Replies

7. Shell Programming and Scripting

awk or sed command to print specific string between word and blank space

My source is on each line 98.194.245.255 - - "GET /disp0201.php?poc=4060&roc=1&ps=R&ooc=13&mjv=6&mov=5&rel=5&bod=155&oxi=2&omj=5&ozn=1&dav=20&cd=&daz=&drc=&mo=&sid=&lang=EN&loc=JPN HTTP/1.1" 302 - "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.0.3705; .NET CLR... (5 Replies)
Discussion started by: elamurugu
5 Replies

8. Shell Programming and Scripting

search a word and print specific string using awk

Hi, I have list of directory paths in a variable and i want to delete those dirs and if dir does not exist then search that string and get the correct path from xml file after that delete the correct directory. i tried to use grep and it prints the entire line from the search.once i get the entire... (7 Replies)
Discussion started by: dragon.1431
7 Replies

9. Shell Programming and Scripting

Finding a word at specific location in a string

Hi All , I have different strings (SQL queries infact) of different lengths such as: 1. "SELECT XYZ FROM ABC WHERE ABC.DEF='123' " 2. "DELETE FROM ABC WHERE ABC.DEF='567'" 3. "SELECT * FROM ABC" I need to find out the word coming after the... (1 Reply)
Discussion started by: swapnil.nawale
1 Replies

10. Shell Programming and Scripting

Determining Word Frequency of Specific Terms

Hello, I require a perl script that will read a .txt file that contains words like 224.199.207.IN-ADDR.ARPA. IN NS NS1.internet.com. 4.200.162.207.in-addr.arpa. IN PTR beeriftw.internet.com. arroyoeinternet.com. IN A 200.199.227.49 I want to focus on words: IN... (23 Replies)
Discussion started by: richsark
23 Replies
Login or Register to Ask a Question