Dear all,
I have an AWK script which provides frequency of words. However I am interested in getting the frequency of chunked data. This means that I have already produced valid chunks of running text, with each chunk on a line. What I need is a script to count the frequencies of each string. A pseudo sample is provided below
The output would be
I have been able to sort the data so that all similar strings are clubbed together
My question is how do I manipulate a script so that a whole line is treated as an entity and lines that match (I have come till there) can be treated as one unit and a frequency counter set up.
My awk script handles space as delimiter but I do not know how to make it recognise start of line and end of line CRLF as delimiters.
I am sure this tool will be useful to people who work with chunked big data.
Many thanks
Following may help you in same, but still I am not sure completly about your requirement if this following doesn't fulfil your requirement,
please do let us know the expected output of yours, what you have tried with OS name you are using.
Output will be as follows.
Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
Dear all,
I have an AWK script which provides frequency of words. However I am interested in getting the frequency of chunked data. This means that I have already produced valid chunks of running text, with each chunk on a line. What I need is a script to count the frequencies of each string. A pseudo sample is provided below
The output would be
I have been able to sort the data so that all similar strings are clubbed together
My question is how do I manipulate a script so that a whole line is treated as an entity and lines that match (I have come till there) can be treated as one unit and a frequency counter set up.
My awk script handles space as delimiter but I do not know how to make it recognise start of line and end of line CRLF as delimiters.
I am sure this tool will be useful to people who work with chunked big data.
Many thanks
anbu23's suggestion using uniq -c is an excellent choice for this task, but if you want to know how to do it with awk, read on...
When you were dealing with words (instead of lines), your awk script probably had a loop going from 1 to NF on each input line to treat each field as a "word". To do that you probably had a loop to count occurrences of words something like:
To count occurrences of lines, it is simpler:
Note that awk assumes input files have LF line terminators; not CR/LF. But if every line is CR/LF terminated, it won't matter when you're working on whole lines. It would, however, screw up individual word counts because the last word on each line would be stored in a different bucket (one with "wordCR") from the other words on the line that would be counted in the "word" bucket.
Note that your awk script can't have CR/LF line terminators; the CR will be treated as part of whatever awk command is on that line, frequently generating syntax errors.
RavinderSingh13 provided an example of how to use awk to do this. It can be simplified a little bit to just:
Last edited by Don Cragun; 12-11-2014 at 06:23 AM..
Reason: Fix typos.
These 2 Users Gave Thanks to Don Cragun For This Post:
Many thanks to all for their help. Especially to Don for his kind and helpful explanation. One is never too old to learn (just turned 65) and this forum is a wonderful place to learn with helpful people.
Hi, I have tab-deliminated data similar to the following:
dot is-big 2
dot is-round 3
dot is-gray 4
cat is-big 3
hot in-summer 5
I want to count the frequency of each individual "unique" value in the 1st column. Thus, the desired output would be as follows:
dot 3
cat 1
hot 1
is... (5 Replies)
Hi all,
I am trying to analyze my data, and I will need your experience.
I have some files with the below format:
res1 = TYR res2 = ASN
res1 = ASP res2 = SER
res1 = TYR res2 = ASN
res1 = THR res2 = LYS
res1 = THR res2 = TYR
etc (many lines)
I am... (3 Replies)
I need to write a shell script "cmn" that, given an integer k, print the k most common words in descending order of frequency.
Example Usage:
user@ubuntu:/$ cmn 4 < example.txt :b: (3 Replies)
dear all.. i need help
i have data
ID,A,B,C,D,E,F,G,H --> header
917188,4,1,2,1,4,6,3,5 --> data
i want output :
ID,OUT1,OUT2,OUT3 --> header
917188,3,3,2
where OUT1 is count of 1 and 2 from $2-$9
OUT2 is count of 3 and 4 from $2-$9... (3 Replies)
I have a large file with fields delimited by '|', and I want to run some analysis on it. What I want to do is count how many times each field is populated, or list the frequency of population for each field.
I am in a Sun OS environment.
Thanks,
- CB (3 Replies)
I've got a problem i'm hoping other more experienced programmers have had to deal with sometime in their careers and can help me: how to get fullnames that were chunked together into one field in an old database into separate more meaningful fields.
I'd like to get the records that nicely fit... (2 Replies)