I did not run the code, only skimmed it, but it seems to me that if a word that meets the frequency threshold does not occur in a line, the printf statement will not print a 0. I'm assuming a 0 would be desirable as opposed to an empty string. Perhaps a "+0" or a format string with a numeric conversion specifier would be in order?
For example:
Regards,
Alister
These 2 Users Gave Thanks to alister For This Post:
Well spotted - my test data didn't have a line with zero count, fixed below.
Also matches words regardless of their case and removes common punctuantion (eg comma, full stop, semi-colon, colon, brackets, etc.):
Last edited by Chubler_XL; 03-07-2011 at 12:33 AM..
This User Gave Thanks to Chubler_XL For This Post:
Chubler_XL's works perfectly... in gawk, nawk, awk. Trying to see how...
It probably does need some explination.
Consider the following input
It produces to 2 arrays from this g is a global word count:
l is a word count for each line
The whole file is processed like this, note t is also counting the number of lines in the file. At the end we go thru the g array and delete any entries with less than our 40 limit, this changes g to the popular word list. Now for each line (i = 1 thru t) we print the count in l[i,w] were w is each word remaining in g.
If no entry exists for the line (ie this popular word is not on line i) l[i,w] will be null, but the + in front of +l[i,w] causes awk to treat it as numeric and print a zero for us instead of a blank.
This User Gave Thanks to Chubler_XL For This Post:
Hi experts, I've been struggling to format a large genetic dataset. It's complicated to explain so I'll simply post example input/output
$cat input.txt
ID GENE pos start end
blah1 coolgene 1 3 5
blah2 coolgene 1 4 6
blah3 coolgene 1 4 ... (4 Replies)
Hello friends, I need a BIG help from UNIX collective intelligence:
I have a CSV file like this:
VALUE,TIMESTAMP,TEXT
1,Sun May 05 16:13:05 +0000 2013,"RT @gracecheree: Praying God sends me a really great man one day. Gotta trust in his timing.
0,Sun May 05 16:13:05 +0000 2013,@sendi__... (19 Replies)
Hi, I wanted to calculate cumulative frequency distribution of my data that involves several arithmetic calls. I did things in excel but its taking me forever. this is what I want to do:
var1.txt contains n observations which I have to compute for frequency which is given by 1/n and subsequently... (7 Replies)
Hi
I have a file like below
############################################
# ParentFolder Flag SubFolders
Colateral 1 Source1/Checksum
CVA 1 Source1/Checksum
Flexing 1 VaR/Checksum
Flexing 1 SVaR/Checksum
FX 1 ... (5 Replies)
hello,
Here is a program for creating a word-frequency
# wf.gk --- program to generate word frequencies from a file
{
# remove punctuation: This will remove all punctuations from the file
gsub(/_]/, "", $0)
#Start frequency analysis
for (i = 1; i <= NF; i++)
freq++
}
END
#Print output... (11 Replies)
Hello everyone,
I am using a chunk of code to display the frequency of a file name in a list of directories. The code looks like this:
find . -name "*.log" | cut -d/ -f4 | cut -d. -f1 | awk '{print $1}' | sort | uniq -c | sort -nr
The file paths would look something like this:... (1 Reply)
Hello,
I require a perl script that will read a .txt file that contains words like
224.199.207.IN-ADDR.ARPA. IN NS NS1.internet.com.
4.200.162.207.in-addr.arpa. IN PTR beeriftw.internet.com.
arroyoeinternet.com. IN A 200.199.227.49
I want to focus on words:
IN... (23 Replies)