Sponsored Content
Top Forums Shell Programming and Scripting Frequency Count of chunked data Post 302928332 by gimley on Thursday 11th of December 2014 04:37:13 AM
Old 12-11-2014
Frequency Count of chunked data

Dear all,
I have an AWK script which provides frequency of words. However I am interested in getting the frequency of chunked data. This means that I have already produced valid chunks of running text, with each chunk on a line. What I need is a script to count the frequencies of each string. A pseudo sample is provided below
Code:
this interesting event
has been going on
since years
in this country
the two actors
met
one another
in this country
Mary
met
her husband
in this country

The output would be
Code:
Mary	1
has been going on	1
her husband	1
in this country	3
met	2
one another	1
since years	1
the two actors	1
this interesting event	1

I have been able to sort the data so that all similar strings are clubbed together
Code:
Mary	
has been going on
her husband
in this country
in this country
in this country
met
met
one another
since years
the two actors
this interesting event

My question is how do I manipulate a script so that a whole line is treated as an entity and lines that match (I have come till there) can be treated as one unit and a frequency counter set up.
My awk script handles space as delimiter but I do not know how to make it recognise start of line and end of line CRLF as delimiters.
I am sure this tool will be useful to people who work with chunked big data.
Many thanks
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Splitting Chunked-FullNames Nightmare

I've got a problem i'm hoping other more experienced programmers have had to deal with sometime in their careers and can help me: how to get fullnames that were chunked together into one field in an old database into separate more meaningful fields. I'd like to get the records that nicely fit... (2 Replies)
Discussion started by: RacerX
2 Replies

2. Shell Programming and Scripting

Count field frequency in a '|' delimited file

I have a large file with fields delimited by '|', and I want to run some analysis on it. What I want to do is count how many times each field is populated, or list the frequency of population for each field. I am in a Sun OS environment. Thanks, - CB (3 Replies)
Discussion started by: ChicagoBlues
3 Replies

3. Shell Programming and Scripting

Help with checking reference data frequency count

reference data GHTAS QER CC N input data NNWQERPROEGHTASTTTGHTASNCC Desired output GHTAS 2 QER 1 CC 1 N 3 (2 Replies)
Discussion started by: perl_beginner
2 Replies

4. Shell Programming and Scripting

Extracting high frequency data-lines

Hi, I have a very large log file in the following format: 198.28.0.0 - - 200 348 244.48.0.0 - - 200 211 198.28.0.0 - - 200 191 4.48.0.0 - - 200 1131 244.48.0.0 - - 200 1131 244.48.0.0 - - 200 1131 4.48.0.0 - - 200 1131 244.48.0.0 - - 200 211 4.48.0.0 - - 200 1131 ... (2 Replies)
Discussion started by: sajal.bhatia
2 Replies

5. Shell Programming and Scripting

count horizontal data

dear all.. i need help i have data ID,A,B,C,D,E,F,G,H --> header 917188,4,1,2,1,4,6,3,5 --> data i want output : ID,OUT1,OUT2,OUT3 --> header 917188,3,3,2 where OUT1 is count of 1 and 2 from $2-$9 OUT2 is count of 3 and 4 from $2-$9... (3 Replies)
Discussion started by: buncit8
3 Replies

6. Shell Programming and Scripting

count frequency of words in a file

I need to write a shell script "cmn" that, given an integer k, print the k most common words in descending order of frequency. Example Usage: user@ubuntu:/$ cmn 4 < example.txt :b: (3 Replies)
Discussion started by: mohit_iitk
3 Replies

7. Shell Programming and Scripting

Count column data

Hi Guys, B07 U51C A1 44 B1 44 Yes B07 L64U A2 44 B1 44 Yes B07 L62U A2 44 B1 44 Yes B07 L11C A4 32 B1 44 NO B05 L12Z A1 12 B1 44 NO B01 651Z A2 44 B1 44 NO B04 A51Z A2 12 B1 44 NO L07 B08D A4 12 B1 44 NO B07 RU8D A4 44 B1 44 Yes B07 L58D A4 15 B1 44 No B07 LA8D A4 44 B1 44 Yes B07... (6 Replies)
Discussion started by: asavaliya
6 Replies

8. Shell Programming and Scripting

frequency count using shell

Hello everyone, please consider the following lines of a matrix 59 32 59 32 59 32 59 32 59 32 59 32 59 32 60 32 60 33 60 33 60 33 60 33 60 33 60 33 60 33 60 33 60 33 (7 Replies)
Discussion started by: xshang
7 Replies

9. Shell Programming and Scripting

Code for count the frequency of interacting pairs

Hi all, I am trying to analyze my data, and I will need your experience. I have some files with the below format: res1 = TYR res2 = ASN res1 = ASP res2 = SER res1 = TYR res2 = ASN res1 = THR res2 = LYS res1 = THR res2 = TYR etc (many lines) I am... (3 Replies)
Discussion started by: Tzole
3 Replies

10. Shell Programming and Scripting

Count frequency of unique values in specific column

Hi, I have tab-deliminated data similar to the following: dot is-big 2 dot is-round 3 dot is-gray 4 cat is-big 3 hot in-summer 5 I want to count the frequency of each individual "unique" value in the 1st column. Thus, the desired output would be as follows: dot 3 cat 1 hot 1 is... (5 Replies)
Discussion started by: owwow14
5 Replies
Locale::Country(3pm)					 Perl Programmers Reference Guide				      Locale::Country(3pm)

NAME
Locale::Country - standard codes for country identification SYNOPSIS
use Locale::Country; $country = code2country('jp' [,CODESET]); # $country gets 'Japan' $code = country2code('Norway' [,CODESET]); # $code gets 'no' @codes = all_country_codes( [CODESET]); @names = all_country_names(); # semi-private routines Locale::Country::alias_code('uk' => 'gb'); Locale::Country::rename_country('gb' => 'Great Britain'); DESCRIPTION
The "Locale::Country" module provides access to several code sets that can be used for identifying countries, such as those defined in ISO 3166-1. Most of the routines take an optional additional argument which specifies the code set to use. If not specified, the default ISO 3166-1 two-letter codes will be used. SUPPORTED CODE SETS
There are several different code sets you can use for identifying countries. A code set may be specified using either a name, or a constant that is automatically exported by this module. For example, the two are equivalent: $country = code2country('jp','alpha-2'); $country = code2country('jp',LOCALE_CODE_ALPHA_2); The codesets currently supported are: alpha-2, LOCALE_CODE_ALPHA_2 This is the set of two-letter (lowercase) codes from ISO 3166-1, such as 'tv' for Tuvalu. This is the default code set. alpha-3, LOCALE_CODE_ALPHA_3 This is the set of three-letter (lowercase) codes from ISO 3166-1, such as 'brb' for Barbados. These codes are actually defined and maintained by the U.N. Statistics division. numeric, LOCALE_CODE_NUMERIC This is the set of three-digit numeric codes from ISO 3166-1, such as 064 for Bhutan. These codes are actually defined and maintained by the U.N. Statistics division. If a 2-digit code is entered, it is converted to 3 digits by prepending a 0. fips-10, LOCALE_CODE_FIPS The FIPS 10 data are two-letter (uppercase) codes assigned by the National Geospatial-Intelligence Agency. dom, LOCALE_CODE_DOM The IANA is responsible for delegating management of the top level country domains. The country domains are the two-letter (lowercase) codes from ISO 3166 with a few other additions. ROUTINES
code2country ( CODE [,CODESET] ) country2code ( NAME [,CODESET] ) country_code2code ( CODE ,CODESET ,CODESET2 ) all_country_codes ( [CODESET] ) all_country_names ( [CODESET] ) Locale::Country::rename_country ( CODE ,NEW_NAME [,CODESET] ) Locale::Country::add_country ( CODE ,NAME [,CODESET] ) Locale::Country::delete_country ( CODE [,CODESET] ) Locale::Country::add_country_alias ( NAME ,NEW_NAME ) Locale::Country::delete_country_alias ( NAME ) Locale::Country::rename_country_code ( CODE ,NEW_CODE [,CODESET] ) Locale::Country::add_country_code_alias ( CODE ,NEW_CODE [,CODESET] ) Locale::Country::delete_country_code_alias ( CODE [,CODESET] ) These routines are all documented in the Locale::Codes::API man page. alias_code ( ALIAS, CODE [,CODESET] ) Version 2.07 included 2 functions for modifying the internal data: rename_country and alias_code. Both of these could be used only to modify the internal data for country codes. As of 3.10, the internal data for all types of codes can be modified. The alias_code function is preserved for backwards compatibility, but the following two are identical: alias_code(ALIAS,CODE [,CODESET]); rename_country_code(CODE,ALIAS [,CODESET]); and the latter should be used for consistency. The alias_code function is deprecated and will be removed at some point in the future. Note: this function was previously called _alias_code, but the leading underscore has been dropped. The old name was supported for all 2.X releases, but has been dropped as of 3.00. SEE ALSO
Locale::Codes The Locale-Codes distribution. Locale::Codes::API The list of functions supported by this module. Locale::SubCountry ISO codes for country sub-divisions (states, counties, provinces, etc), as defined in ISO 3166-2. This module is not part of the Locale-Codes distribution, but is available from CPAN in CPAN/modules/by-module/Locale/ http://www.iso.org/iso/country_codes Official home page for the ISO 3166 maintenance agency. Unfortunately, they do not make the actual ISO available for free, so I cannot check the alpha-3 and numerical codes here. http://www.iso.org/iso/list-en1-semic-3.txt http://www.iso.org/iso/home/standards/country_codes/iso-3166-1_decoding_table.htm The source of ISO 3166-1 two-letter codes used by this module. http://unstats.un.org/unsd/methods/m49/m49alpha.htm The source of the official ISO 3166-1 three-letter codes and three-digit codes. For some reason, this table is incomplete! Several countries are missing from it, and I cannot find them anywhere on the UN site. I get as much of the data from here as I can. http://earth-info.nga.mil/gns/html/digraphs.htm The official list of the FIPS 10 codes. http://www.iana.org/domains/ Official source of the top-level domain names. https://www.cia.gov/library/publications/the-world-factbook/appendix/print_appendix-d.html The World Factbook maintained by the CIA is a potential source of the data. Unfortunately, it adds/preserves non-standard codes, so it is no longer used as a source of data. http://www.statoids.com/wab.html Another unofficial source of data. Currently, it is not used to get data, but the notes and explanatory material were very useful for understanding discrepancies between the sources. AUTHOR
See Locale::Codes for full author history. Currently maintained by Sullivan Beck (sbeck@cpan.org). COPYRIGHT
Copyright (c) 1997-2001 Canon Research Centre Europe (CRE). Copyright (c) 2001-2010 Neil Bowers Copyright (c) 2010-2013 Sullivan Beck This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. perl v5.18.2 2014-01-06 Locale::Country(3pm)
All times are GMT -4. The time now is 11:24 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy