So I have a ton of files, lines in excess of 3 MIL per file.
I need to find a solution to find the top 3 products, and then get the top 5 skews with a count of how many times that skew was viewed.
This is a sample file, shortened it for readability. Each ROW is counted as view.
Here's the sample file.
I can get the first top, but i'm having a tough time getting the second with count. I imagine I will have to create 2 arrays and loop through those to get the correct counts.
Can anybody provide any guidance?
for the first I can do this, piping sort and head, but stuck to get the rest.
which prints:
Expected result should be something like
Last edited by JoshCrosby; 03-05-2013 at 11:23 PM..
Thank you so much, based on the dataset it is perfect.. however on the larger files it takes quite a while which is fine. The one issue is that its not taking product (column 1) into account and seems to be grouping on the count
Any ideas? I sincerely appreciate your help.
Some logic
-> product (sort by top 3 - based on how many rows it appears on)
----> skew (sort by top 5 skews from the products found above)
----> count of skews
Thank you so much, based on the dataset it is perfect.. however on the larger files it takes quite a while which is fine. The one issue is that its not taking product (column 1) into account and seems to be grouping on the count
Any ideas? I sincerely appreciate your help.
Some logic
-> product (sort by top 3 - based on how many rows it appears on)
----> skew (sort by top 5 skews from the products found above)
----> count of skews
I hope that helps explain a bit more.
A bit more, yes. But still not clear. In your sample input there are 6 occurrences each of products p1, p2, p3, and p4. You say you want the top 3 products, but your sample output only shows 2. (And since there are four products with the same number of occurrences, you don't say how to choose which 3 of those 4 should be chosen.) Your sample output didn't show the top 2 product/skew pairs p1/12345 and p2/23456 both of which appear 6 times even though p1 and p2 appear the same number of times as p3 and p4???
From what you did with your 1 count sample, you chose the last two of the set of the four most common products based on the fact that their product names sorted last. Is that really what you want?
If there are ties, should your results include all products that match the number of occurrences of the third most common product? If there are ties in the number of appearances of a skew within a product, should the results include all skews with the fifth most common skew within that product?
Will a single skew ever appear with more than one product, or are skews supposed to be unique to a product.
A bit more, yes. But still not clear. In your sample input there are 6 occurrences each of products p1, p2, p3, and p4. You say you want the top 3 products, but your sample output only shows 2. (And since there are four products with the same number of occurrences, you don't say how to choose which 3 of those 4 should be chosen.) Your sample output didn't show the top 2 product/skew pairs p1/12345 and p2/23456 both of which appear 6 times even though p1 and p2 appear the same number of times as p3 and p4???
I apologize, I should have included a bit more data.
Quote:
From what you did with your 1 count sample, you chose the last two of the set of the four most common products based on the fact that their product names sorted last. Is that really what you want?
I didn't sort the output, so my bad.
Quote:
If there are ties, should your results include all products that match the number of occurrences of the third most common product? If there are ties in the number of appearances of a skew within a product, should the results include all skews with the fifth most common skew within that product?
Yes to the last question, if there are ties (skews) from the products, then those should rollup under that product.
Quote:
Will a single skew ever appear with more than one product, or are skews supposed to be unique to a product.
Not necessarily, a different product can have the same skew, even though it may mean something completely different.
Here is some more data, I hope this is enough to give you an idea. Like I said, these files are huge and contain some pretty sensitive data otherwise I would upload the entire file.
Thanks again with your help!!!!
In SQL Server, it may look like this...
I'm not sure I understand all of your requirements, but here is an awk script that I think does what you want. It looks long, but most of this proposed solution is comments rather than running code:
As always, if you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk or nawk instead of awk. I used the Korn shell while testing this script, but any shell that accepts basic Bourne shell syntax can be used for this sample.
This script produces the following output when given the input data shown in message #5 in this thread.
If the number of hits for later products matches the number of hits for the 3rd highest number of hits, more products will be listed. And, if the number of hits for later skews matches the number of hits for the 5th highest number of hits for a skew within that product, more skews will be listed.
I haven't tested this on a file with millions of entries, but it works the way I expected with a file containing a few hundred entries.
These 2 Users Gave Thanks to Don Cragun For This Post:
This looks awesome!! I have a really dumb question though, in the variables - is that expecting 2 files? one for just the products with counts and one just with skews, products and counts?
---------- Post updated at 07:35 PM ---------- Previous update was at 07:12 PM ----------
Works perfect!!!
For those who want to know.
To create the .Skew_Count use this one-liner:
To create the .Product_Counts use this:
Don't forget to change file to products.txt at the end of the awk command
Don - HUGE THANK YOU!!!!!!!
By the way, using a MAC, so I don't have KornShell, used Bash without issue.
The awk below is supposed to count all the matching $5 strings and count how many $7 values is less than 20. I don't think I need the portion in bold as I do not need any decimal point or format, but can not seem to get the correct counts. Thank you :).
file
chr5 77316500 77316628 ... (6 Replies)
I have a data which looks like
1440993600|L|ABCDEF
1440993600|L|ABCD
1440993601|L|ABCDEF
1440993602|L|ABC
1440993603|L|ABCDE
.
.
.
1441015200|L|AB
1441015200|L|ABC
1441015200|L|ABCDEF
So basically, the $1 is epoch date, $2 and $3 is some application data
From one if the... (5 Replies)
Hello
Im new treat me nicely, I have a headache :)
I have a script that seemed to work now it doesnt anyway, the last part is adding counts of unique items in a csv file eg
05492U34 38
05492U34 47
two columns, (many different values like this in file)
i want... (7 Replies)
I have below inside a file.
11.22.33.44
user1
11.22.33.55
user2
I need this manipulated as
alias server1.domain.com='ssh user1@11.22.33.44'
alias server2.domain.com='ssh user2@11.22.33.55' (3 Replies)
Hello folks.
After awk, i have decided to start to learn perl, and i need some help.
I have following output :
1 a
1 b
2 k
2 f
3 s
3 p
Now with awk i get desired output by issuing :
awk ' { a = a FS $2 } END { for ( i in a) print i,a }' input
1 a b
2 k f
3 s p
Can... (1 Reply)
Hi,
I need an awk script (or whatever shell-construct) that would take data like below and get the max value of 3 column, when grouping by the 1st column.
clientname,day-of-month,max-users
-----------------------------------
client1,20120610,5
client2,20120610,2
client3,20120610,7... (3 Replies)
Hello
I am trying to figure out a script which could group a log file by user names. I worked with awk command and I could trim the log file to:
<USER: John Frisbie > /* Thu Aug 06 2009 15:11:45.7974 */ FLOAT GRANT WRITE John Frisbie (500 of 3005 write)
<USER: Shawn Sanders > /* Thu Aug 06... (2 Replies)
I run awk
cat $1|awk '{print $6}'
and get a lot of results and I want results to group them. For example my result is (o/p is unknown to user)
xyz
xyz
abc
pqr
xyz
pqr
etc
I wanna group them as
xyz=total found 7
abc=total ....
pqr=
Thank (3 Replies)
To start I have a table that has ticketholders. Each ticket holder has a unique number and each ticket holder is associated to a so called household number. You can have multiple guests w/i a household.
I would like to create 3 flags (form a, for a household that has 1-4 gst) form b 5-8 gsts... (3 Replies)