Shell scripting: frequency of specific word in a string and statistics
Hello friends, I need a BIG help from UNIX collective intelligence:
I have a CSV file like this:
Code:
VALUE,TIMESTAMP,TEXT
1,Sun May 05 16:13:05 +0000 2013,"RT @gracecheree: Praying God sends me a really great man one day. Gotta trust in his timing.
0,Sun May 05 16:13:05 +0000 2013,@sendi__ we're seeing that on 25th x,azzeslam,Azhar :),sendi__,,,,,
-1,Sun May 05 16:13:05 +0000 2013,still BN. in BaganSerai,Time_Lock,Azrif Asmi,,,,,,
0,Sun May 05 16:13:07 +0000 2013,Can't trust NO bitch!,_SoSoftWilliams,Kenya .. ‚ô°,,,,,,
0,Sun May 05 16:13:07 +0000 2013," me, i'll take some. üíÇ",_blasianBOMB,JohnnyRocket.,,,,,,
1,Sun May 05 16:13:07 +0000 2013,"she'll be okay, dear @tweetsfrmleyka_",elisyax,-
Now, in order to get some statistical info, I'd like to extract specific words from field (TEXT), their values from the first filed (VALUE) and then obtain 2 CSV:
CSV 1:
WORDS to search in field=TEXT, case insensitive.
Search and list in decreasing order the first 50 words and their values as follow:
WORDS to search in field=TEXT, case insensitive, where |=or ( that|really|great|day|.......)
Search and list in decreasing order specific words and their values as follow:
Please use code tags when posting data and code samples!
---------- Post updated 05-10-13 at 11:11 AM ---------- Previous update was 05-09-13 at 11:29 AM ----------
I found this very handy script from radoulov and i guess that can be a good starting point:
Code:
awk 'END {
print f ":"
for (Z in z)
printf "Total number of %s records = %d\n", \
Z, z[Z]
if (sc) {
print "-----------------------------------"
printf "Total number of Special records = %d\n", \
sc
for (S in sa)
printf "Total number of %s records = %d\n", \
S, sa[S]
}
print RS
}
FNR == 1 {
if (f) {
print f ":"
for (Z in z)
printf "Total number of %s records = %d\n", \
Z, z[Z]
if (sc) {
print "-----------------------------------"
printf "Total number of Special records = %d\n", \
sc
for (S in sa)
printf "Total number of %s records = %d\n", \
S, sa[S]
}
print RS
split(x, z)
split(x, sa)
s = sc = 0
}
f = FILENAME
}
$3 ~ /^(PTR|MX|NS|CNAME|A)$/ && !s { z[$3]++ }
s && $2 == "IN" { sc++; sa[$3]++ }
/SPECIALS/ { s = 1 }' db*
some help in order to adapt?
many thanks friends!
Last edited by Scott; 05-10-2013 at 01:25 PM..
Reason: CODE tags, not ICODE tags, please.
perl -ne '
chomp;
@rec=split(/,/, $_, 3);
@words=split/\b\s*/,$rec[2];
map {$counts{lc($_)}++ if /^\w+$/;}@words;
END{
@wanted=qw(that really great day);
for (sort {$counts{b}<=>$counts{a}} @wanted){
print "$_ $counts{$_}\n";
}
} ' tmp/tmp.dat
I'm heading out now, but you could extend the counts data structure to count{total=>${TOTAL COUNTS TO DATE}, appeared=>{++ for each record it appeared in},0=>${+1 if $rec[0]==0}...} and that would allow you produce the extended table you require
Last edited by Skrynesaver; 05-10-2013 at 01:46 PM..
Reason: added wanted array and how to aproach the rest of the requirements
Hello Skrynesaver, first of all many thanks for your attention,
I launched your script and here his output:
Code:
$/wfs.pl finaltest.csv > finaltest1.csv
Bareword found where operator expected at ./wfs.pl line 12, near "} ' tmp"
(Might be a runaway multi-line '' string starting on line 2)
(Missing operator before tmp?)
syntax error at ./wfs.pl line 2, near "-ne"
that wasn't a script, but rather a command line one off, as a script it would look like this, with advised addons.
Code:
#!/usr/bin/perl
use strict;
open(my $tweets, $ARGV[0])|| die "Couldn't open $ARGV[0] $!\n";
my %counts;
my %freq;
my $tot;
while(<$tweets>){
chomp;
my @rec=split(/,/, $_, 3);
my @words=split/\b\s*/,$rec[2];
map { $counts{lc($_)}++ if /^\w+$/;
$freq{$rec[0]}{lc($_)}++ if /^\w+$/;
$tot++}@words;
}
my @wanted=qw(that really great day);
print "WORD,TOTFrequency,(1)Frequency,(0)Frequency,(-1)Frequency,1%Frequency,0%Frequency,-1%Frequency\n";
for (sort {$counts{b}<=>$counts{a}} @wanted){
print "$_,$counts{$_},",$freq{1}{$_}||0,",",$freq{0}{$_}||0,",",$freq{-1}{$_}||0,sprintf("%0.2f",($counts{$_}/$tot)*100),"%,",sprintf("%0.2f",($freq{0}{$_}/$tot)*100),"%,",sprintf("%0.2f",($freq{-1}{$_}/$tot)*100),"%\n";
}
Last edited by Skrynesaver; 05-11-2013 at 07:10 AM..
This User Gave Thanks to Skrynesaver For This Post:
1) As you can see the FIELD "(-1)Frequency" is not present
2) Words column is not sort in decreasing order
3) Is possible to set an option to search words "like grep -i" case insensitive in order to match as follows:
Code:
my @wanted=qw(day) result= today days monday etc.......
4) Is possible to search "like grep -i" case insensitive and list in decreasing order the 50 most frequent words, where words >=3 characters?
Again, many many thanks for your attention and for your BIG HELP.
Hope to hear from you soon!!!!
Have a good time.
Last edited by kraterions; 05-12-2013 at 02:55 AM..
Hi All,
I'm writing unix shell script and I have these files. I need to get name before _DETL.tmp.
ABC_AAA_DETL.tmp
ABC_BBB_DETL.tmp
ABC_CCC_DETL.tmp
PQR_DETL.tmp
DEF_DETL.tmp
JKL_DETL.tmp
YUI_DETL.tmp
TG_NM_DDD_DETL.tmp
TG_NM_EEE_DETL.tmp
GHJ_DETL.tmp
RTY_DETL.tmp
output will... (3 Replies)
Hi,
I have gone through may posts and dint find exact solution for my requirement.
I have file which consists below data and same file have lot of other data.
<MAPPING DESCRIPTION ='' ISVALID ='YES' NAME='m_TASK_UPDATE' OBJECTVERSION ='1'>
<MAPPING DESCRIPTION ='' ISVALID ='NO'... (11 Replies)
Hi Gurus
I am new to this forum.. I am using HP Unix OS.
I have one single string in input file as shown below
Abc123 | cde | fgh | ghik| lmno | Abc456 |one |two |three | four | Abc789 | five | Six | seven | eight | Abc098 | ........
I want to achive the result in a output file as shown... (3 Replies)
Hello,
I'm almost there with scripting, and I've looked at a few examples that could help me out here. But I'm still at a lost where to start. I'm looking to parse each line in the log file below and save the output like below.
Log File
AABBCGCAT022|242|3
AABBCGCAT023|243|4... (6 Replies)
My source is on each line
98.194.245.255 - - "GET /disp0201.php?poc=4060&roc=1&ps=R&ooc=13&mjv=6&mov=5&rel=5&bod=155&oxi=2&omj=5&ozn=1&dav=20&cd=&daz=&drc=&mo=&sid=&lang=EN&loc=JPN HTTP/1.1" 302 - "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.0.3705; .NET CLR... (5 Replies)
Hi,
I have list of directory paths in a variable and i want to delete those dirs and if dir does not exist then search that string and get the correct path from xml file after that delete the correct directory. i tried to use grep and it prints the entire line from the search.once i get the entire... (7 Replies)
Hi All ,
I have different strings (SQL queries infact) of different lengths such as:
1. "SELECT XYZ FROM ABC WHERE ABC.DEF='123' "
2. "DELETE FROM ABC WHERE ABC.DEF='567'"
3. "SELECT * FROM ABC"
I need to find out the word coming after the... (1 Reply)
Hello,
I require a perl script that will read a .txt file that contains words like
224.199.207.IN-ADDR.ARPA. IN NS NS1.internet.com.
4.200.162.207.in-addr.arpa. IN PTR beeriftw.internet.com.
arroyoeinternet.com. IN A 200.199.227.49
I want to focus on words:
IN... (23 Replies)