awk Grouping and Subgrouping with Counts

03-05-2013

Registered User

6, 0

Join Date: Feb 2013

Last Activity: 5 August 2017, 3:49 PM EDT

Posts: 6

Thanks Given: 3

Thanked 0 Times in 0 Posts

awk Grouping and Subgrouping with Counts

So I have a ton of files, lines in excess of 3 MIL per file.

I need to find a solution to find the top 3 products, and then get the top 5 skews with a count of how many times that skew was viewed.

This is a sample file, shortened it for readability. Each ROW is counted as view.

Here's the sample file.

Code:

product|skew
p1|12345
p2|23456
p3|234
p4|98707
p1|12345
p2|23456
p3|2343
p4|98706
p1|12345
p2|23456
p3|234
p5|36748
p4|98708
p1|12345
p2|23456
p3|234
p4|98708
p1|12345
p6|23467
p2|23456
p3|234345
p4|98708
p1|12345
p2|23456
p3|234345
p4|98707

I can get the first top, but i'm having a tough time getting the second with count. I imagine I will have to create 2 arrays and loop through those to get the correct counts.

Can anybody provide any guidance?

for the first I can do this, piping sort and head, but stuck to get the rest.

Code:

awk -F"|" '{product[$1]++}END{for(n in product) print n, product[n]}' products.txt | sort -k2 -nr | head -2

which prints:

Code:

p4 6
p3 6

Expected result should be something like

Code:

product	skew count
p4	98708 	3
p4	98707	2
p4	98706	2
p3	234 	3
p3	234345	2
etc......

Last edited by JoshCrosby; 03-05-2013 at 11:23 PM..

JoshCrosby

View Public Profile for JoshCrosby

Find all posts by JoshCrosby

03-05-2013

Moderator

3,689, 1,352

Join Date: Jan 2012

Last Activity: 22 August 2020, 11:29 PM EDT

Location: Galactic Empire

Posts: 3,689

Thanks Given: 268

Thanked 1,352 Times in 1,258 Posts

Not sure if this is what you want! Try this:

Code:

sort -t"|" products.txt | uniq -c | awk 'BEGIN{print "product","skew","count"}{split($2,A,"|"); print A[1],A[2],$1}' OFS='\t'

This User Gave Thanks to Yoda For This Post:

Yoda

View Public Profile for Yoda

Visit Yoda's homepage!

Find all posts by Yoda

03-05-2013

Registered User

6, 0

Join Date: Feb 2013

Last Activity: 5 August 2017, 3:49 PM EDT

Posts: 6

Thanks Given: 3

Thanked 0 Times in 0 Posts

Thank you so much, based on the dataset it is perfect.. however on the larger files it takes quite a while which is fine. The one issue is that its not taking product (column 1) into account and seems to be grouping on the count

Any ideas? I sincerely appreciate your help.

Some logic
-> product (sort by top 3 - based on how many rows it appears on)
----> skew (sort by top 5 skews from the products found above)
----> count of skews

I hope that helps explain a bit more.

JoshCrosby

View Public Profile for JoshCrosby

Find all posts by JoshCrosby

03-06-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by JoshCrosby

Thank you so much, based on the dataset it is perfect.. however on the larger files it takes quite a while which is fine. The one issue is that its not taking product (column 1) into account and seems to be grouping on the count

Any ideas? I sincerely appreciate your help.

Some logic
-> product (sort by top 3 - based on how many rows it appears on)
----> skew (sort by top 5 skews from the products found above)
----> count of skews

I hope that helps explain a bit more.

A bit more, yes. But still not clear. In your sample input there are 6 occurrences each of products p1, p2, p3, and p4. You say you want the top 3 products, but your sample output only shows 2. (And since there are four products with the same number of occurrences, you don't say how to choose which 3 of those 4 should be chosen.) Your sample output didn't show the top 2 product/skew pairs p1/12345 and p2/23456 both of which appear 6 times even though p1 and p2 appear the same number of times as p3 and p4???

From what you did with your 1 count sample, you chose the last two of the set of the four most common products based on the fact that their product names sorted last. Is that really what you want?

If there are ties, should your results include all products that match the number of occurrences of the third most common product? If there are ties in the number of appearances of a skew within a product, should the results include all skews with the fifth most common skew within that product?

Will a single skew ever appear with more than one product, or are skews supposed to be unique to a product.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-06-2013

Registered User

6, 0

Join Date: Feb 2013

Last Activity: 5 August 2017, 3:49 PM EDT

Posts: 6

Thanks Given: 3

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by Don Cragun

A bit more, yes. But still not clear. In your sample input there are 6 occurrences each of products p1, p2, p3, and p4. You say you want the top 3 products, but your sample output only shows 2. (And since there are four products with the same number of occurrences, you don't say how to choose which 3 of those 4 should be chosen.) Your sample output didn't show the top 2 product/skew pairs p1/12345 and p2/23456 both of which appear 6 times even though p1 and p2 appear the same number of times as p3 and p4???

I apologize, I should have included a bit more data.

Quote:

From what you did with your 1 count sample, you chose the last two of the set of the four most common products based on the fact that their product names sorted last. Is that really what you want?

I didn't sort the output, so my bad.

Quote:

If there are ties, should your results include all products that match the number of occurrences of the third most common product? If there are ties in the number of appearances of a skew within a product, should the results include all skews with the fifth most common skew within that product?

Yes to the last question, if there are ties (skews) from the products, then those should rollup under that product.

Quote:

Will a single skew ever appear with more than one product, or are skews supposed to be unique to a product.

Not necessarily, a different product can have the same skew, even though it may mean something completely different.

Here is some more data, I hope this is enough to give you an idea. Like I said, these files are huge and contain some pretty sensitive data otherwise I would upload the entire file.

Thanks again with your help!!!!

Code:

product|skew
p1|12345
p2|23456
p3|234
p4|98707
p1|12345
p2|23456
p3|2343
p4|98706
p1|12345
p2|23456
p3|234
p5|36748
p4|98708
p1|12345
p2|23456
p3|234
p4|98708
p1|12345
p6|23467
p2|23456
p3|234345
p4|98708
p1|12345
p2|23456
p3|234345
p4|98707
p3|234345
p4|98707
p3|234345
p4|98707
p3|234345
p4|98707
p3|234345
p4|98707
p3|234345
p4|98707
p3|234345
p4|98707
p3|234345
p4|98707
p6|23467
p6|23467
p6|23467
p6|23467
p6|23467
p6|23467
p6|23467
p4|098
p4|098
p4|098
p4|098
p4|1234
p4|1234

In SQL Server, it may look like this...

Code:

-- create temporary table to hold the top 3 products
CREATE TABLE #temp_P
  (
   id INT IDENTITY(1, 1)
		  PRIMARY KEY
 , prod_id INT
 , pageviews INT
  )
INSERT	INTO [#temp_P]
		([prod_id]
	   , [pageviews])
		SELECT TOP 3
				[prod_id]
			  , COUNT(*) AS [Page Views]
		FROM	[dbo].[PageViews]
		GROUP BY [product-id]
		ORDER BY COUNT(*) DESC;

-- create temporary table to hold the top 5, by 3 products
CREATE TABLE #temp_S
  (
   id INT IDENTITY(1, 1)
		  PRIMARY KEY
 , skew_id VARCHAR(50)
 , pageviews INT
 , prod_id INT
  )
DECLARE	@counter INT
, @prod_id INT
SET @counter = 1
WHILE @counter <= (SELECT COUNT (id) FROM [#temp_P]) 
  BEGIN
	SET @prod_id = (SELECT [prod_id] FROM [#temp_P] WHERE id = @counter)
	 -- iterate through product ids and insert skews 
	BEGIN	
	  PRINT 'Inserting top 5 skews for: ' + CAST(@prod_id AS VARCHAR(100))
	  INSERT  INTO [#temp_S]
			  ([skew_id]
			 , [prod_id]
			 , [pageviews])
			  SELECT TOP 5
					  [product-id]
					, @prod_id
					, COUNT(*) AS [pageviews]
			  FROM	  [dbo].[PageViews]
			  WHERE	  [product-id] = @prod_id
			  GROUP BY [product-id]
			  ORDER BY COUNT(*) DESC;
	END

	SET @counter = @counter + 1    
  END

SELECT	[prod_id]
	  , [skew_id]
	  , [pageviews]
FROM	[#temp_S]
GROUP BY [prod_id]
	  , [skew_id]
	  , [pageviews]
ORDER BY [prod_id]
	  , [pageviews] DESC

JoshCrosby

View Public Profile for JoshCrosby

Find all posts by JoshCrosby

03-06-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I'm not sure I understand all of your requirements, but here is an awk script that I think does what you want. It looks long, but most of this proposed solution is comments rather than running code:

Code:

#!/bin/ksh
PCF=".Product_Counts"
SCF=".Skew_Counts"
# Variable dictionary:
# cmd                           command string to be used to sort product and
#                               skew count files
# ec                            exit code
# i                             loop control
# ndp                           # of different products in top 3 products
# nds                           # of different skews in top 5 skews for a
#                               given top 3 product
# P                             current product name
# p["product"]                  # of times "product" appears in 1st field
# pcf                           sorted product count filename
# pl[p["product"],plc["product"]] list of skew values associated with "product"
# plc["product"]                # of skew values associated with "product"
# ppc                           previous product count
# psc                           previous skew count
# r1, r2                        return code from some function
# S                             current skew
# s["product","skew"]           # of times "skew" appears with "product"
# scf                           sorted skew count filename
awk -v pcf="$PCF" -v scf="$SCF" '
BEGIN { FS = OFS = "|"
}
{       # Process input data...
        # Increment # of times we have seen this product.
        p[$1]++
        if(!(($1, $2) in s))
                # Add a new skew for this product.
                pl[$1, ++plc[$1]] = $2
        # Increment # of times we have seen this skew with this product.
        s[$1, $2]++
}
END {   # Sort the product counts.
        cmd = "sort -t \"|\" -k2,2nr -k1,1 -o " pcf
        for(i in p) print i, p[i] | cmd
        close(cmd)
        ppc = 0
        # Set the sort command to be used to process the skew counts.
        cmd = "sort -t \"|\" -k2,2nr -k1,1 -o " scf
        # Process the top 3 products.
        while((r1 = (getline < pcf)) == 1) {
                # Increment count of top 3 products, but include more if the
                # number of hits is the same for later products
                if(++ndp > 3 && ppc != $2) break
                P = $1
                ppc = $2
                printf("%d hits for product: %s\n", ppc, P)
                # Sort the skew counts for this product.
                for(i = 1; i <= plc[P]; i++)
                        print pl[P, i], s[P, pl[P, i]] | cmd
                close(cmd)
                # Process the top 5 skews for this product.
                nds = 0
                while((r2 = (getline < scf)) == 1) {
                        # Increment count of top 5 skews, but include more if
                        # the number of hits is the same for later skews.
                        if(++nds > 5 && psc != $2) break
                        S = $1
                        psc = $2
                        printf("\t%d hits for skew: %s\n", psc, S)
                }
                if(r2 < 0) {
                        printf("Error reading top 5 skew list from \"%s\".\n",
                                scf)
                        ec = 1
                }
                close(scf)
        }
        if(r1 < 0) {
                printf("Error reading top 3 product list from "%s".\n", pcf)
                ec = 1
        }
        close(pcf)
        exit(ec + 0)
}' file
if [ $? -eq 0 ]
then    # awk completed successfully...
        # Remove sort output files.
        rm "$PCF" "$SCF"
        exit
fi
exit 1

As always, if you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk or nawk instead of awk. I used the Korn shell while testing this script, but any shell that accepts basic Bourne shell syntax can be used for this sample.

This script produces the following output when given the input data shown in message #5 in this thread.

Code:

19 hits for product: p4
	9 hits for skew: 98707
	4 hits for skew: 098
	3 hits for skew: 98708
	2 hits for skew: 1234
	1 hits for skew: 98706
13 hits for product: p3
	9 hits for skew: 234345
	3 hits for skew: 234
	1 hits for skew: 2343
8 hits for product: p6
	8 hits for skew: 23467

If the number of hits for later products matches the number of hits for the 3rd highest number of hits, more products will be listed. And, if the number of hits for later skews matches the number of hits for the 5th highest number of hits for a skew within that product, more skews will be listed.

I haven't tested this on a file with millions of entries, but it works the way I expected with a file containing a few hundred entries.

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-06-2013

Registered User

6, 0

Join Date: Feb 2013

Last Activity: 5 August 2017, 3:49 PM EDT

Posts: 6

Thanks Given: 3

Thanked 0 Times in 0 Posts

This looks awesome!! I have a really dumb question though, in the variables - is that expecting 2 files? one for just the products with counts and one just with skews, products and counts?

---------- Post updated at 07:35 PM ---------- Previous update was at 07:12 PM ----------

Works perfect!!!

For those who want to know.

To create the .Skew_Count use this one-liner:

Code:

awk -F"|" '$1 ~/p[0-9]/ { p[$2]++ }END{for (n in p) print n, p[n]}' products.txt >.Skew_Counts

To create the .Product_Counts use this:

Code:

awk -F"|" '$1 ~/p[0-9]/ { p[$1]++ }END{for (n in p) print n, p[n]}' products.txt > .Product_Counts

Don't forget to change file to products.txt at the end of the awk command

Don - HUGE THANK YOU!!!!!!!

By the way, using a MAC, so I don't have KornShell, used Bash without issue.

JoshCrosby

View Public Profile for JoshCrosby

Find all posts by JoshCrosby

UNIX for Dummies Questions & Answers

awk Grouping and Subgrouping with Counts

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Output counts of all matching strings lessthan a number using awk

Discussion started by: cmccabe

2. Shell Programming and Scripting

Grouping and Subgrouping using awk

Discussion started by: hemanty4u

3. UNIX for Dummies Questions & Answers

awk adding counts together from column

Discussion started by: aniquebmx

4. Shell Programming and Scripting

grouping using sed or awk

Discussion started by: anil510

5. Shell Programming and Scripting

awk and perl grouping.

Discussion started by: Peasant

6. Shell Programming and Scripting

AWK script to create max value of 3rd column, grouping by first column

Discussion started by: ckmehta

7. Shell Programming and Scripting

awk grouping by name script

Discussion started by: Avto

8. Shell Programming and Scripting

Grouping using sed/awk ?

Discussion started by: pujansrt

9. UNIX for Dummies Questions & Answers

counts

Discussion started by: sbr262

10. UNIX for Dummies Questions & Answers

counts

Discussion started by: k@ssidy