Generating summary data (use awk?)


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
# 1  
Generating summary data (use awk?)

I have a data file similar to this (but many millions of lines long). You can assume that it is totally unsorted but has no duplicate rows.

Code:
Date     ,Tool_Type  ,Tool_ID      ,Time_Used
3/13/2014,Screwdriver,Screwdriver02, 6
3/13/2014,Screwdriver,Screwdriver02,20
3/13/2014,Screwdriver,Screwdriver02, 1
3/13/2014,Screwdriver,Screwdriver03,14
3/13/2014,Screwdriver,Screwdriver01, 9
3/13/2014,Screwdriver,Screwdriver01,10
3/13/2014,Screwdriver,Screwdriver03,16
3/13/2014,Screwdriver,Screwdriver02,14
3/14/2014,Screwdriver,Screwdriver01,16
3/14/2014,Screwdriver,Screwdriver01, 6
3/14/2014,Screwdriver,Screwdriver01, 7
3/14/2014,Screwdriver,Screwdriver02,15
3/14/2014,Screwdriver,Screwdriver02, 7
3/14/2014,Screwdriver,Screwdriver01, 5
3/14/2014,Screwdriver,Screwdriver01, 4
3/13/2014,Hammer     ,Hammer03     ,16
3/13/2014,Hammer     ,Hammer02     , 9
3/13/2014,Hammer     ,Hammer01     ,11
3/13/2014,Hammer     ,Hammer01     ,13
3/13/2014,Hammer     ,Hammer02     ,15
3/13/2014,Hammer     ,Hammer02     ,19
3/13/2014,Hammer     ,Hammer02     ,17
3/13/2014,Hammer     ,Hammer01     ,14
3/13/2014,Hammer     ,Hammer01     , 4
3/14/2014,Hammer     ,Hammer02     , 9
3/14/2014,Hammer     ,Hammer02     ,16
3/14/2014,Hammer     ,Hammer01     , 4
3/14/2014,Hammer     ,Hammer01     ,11
3/14/2014,Hammer     ,Hammer03     , 5
3/14/2014,Hammer     ,Hammer03     ,17

I want to get summary statistics by day-tool_type and day-tool_ID. Order is not important and I can concatenate the result of two queries if I need to.

Code:
Date     ,Tool             ,Times_Used,Total_Time_Used
3/13/2014,Hammer (all)     ,9         ,118
3/13/2014,Hammer01         ,4         , 42
3/13/2014,Hammer02         ,4         , 60
3/13/2014,Hammer03         ,1         , 16
3/13/2014,Screwdriver (all),8         , 90
3/13/2014,Screwdriver01    ,2         , 19
3/13/2014,Screwdriver02    ,4         , 41
3/13/2014,Screwdriver03    ,2         , 30
3/14/2014,Hammer (all)     ,6         , 62
3/14/2014,Hammer01         ,2         , 15
3/14/2014,Hammer02         ,2         , 25
3/14/2014,Hammer03         ,2         , 22
3/14/2014,Screwdriver (all),7         , 60
3/14/2014,Screwdriver01    ,5         , 38
3/14/2014,Screwdriver02    ,2         , 22

I think AWK is the best tool.

I think I could run through the file NR>1 building three associative arrays
Code:
count[$1 "-" $2]++
count[$1 "-" $3]++
sum[$1 "-" $2] += $4
sum[$1 "-" $3] += $4
tags[$1 "-" $2] = $1 "," $2 " (all)"
tags[$1 "-" $3] = $1 "," $3

What I don't know how to do is barf up the existing array elements at the end. Any suggestions? Am I going to run out of memory with this approach and is there a better way?

Mike

---------- Post updated at 03:59 AM ---------- Previous update was at 03:25 AM ----------

Will this work?

Code:
OFS = ","; for ( var in count) { print tags[var],count[var],sum[var] }

Mike

---------- Post updated at 09:20 AM ---------- Previous update was at 03:59 AM ----------

It works!

Code:
awk -F, 'NR == 1 {print "Headers,Redacted,For,IP,Reasons,Demand,Cum_Demand"}
         NR > 1 {byToolType = ($1 "-" $2 "-" $10); byTool = ($1 "-" $3 "-" $10)
                 count[byToolType]++; count[byTool]++
                 sum[byToolType] += $7; sum[byTool] += $7
                 tags[byToolType] = $10","$1","$2","$2"xxx,"$1"-"$2"xxx"
                 tags[byTool] = $10","$1","$2","$3","$1"-"$3
                }
         END {OFS = ","; for ( var in count) { print tags[var], count[var], sum[var] }
             }' "$scratchDir""$filename" > "$scratchDir""$outputNameString""$filename"

Never estimate the havoc leaving out -F, can wreck on an AWK script!

Last edited by Michael Stora; 03-14-2014 at 09:26 AM..
This User Gave Thanks to Michael Stora For This Post:
# 2  
Thanks for the updates!Smilie
Smilie
# 3  
I've already added a mean demand column and will be adding a standard deviation column soon. I'm going to concatenate values to the array element and the split and loop over them in the END block to calculate standard deviation.

Mike

Last edited by Michael Stora; 03-14-2014 at 04:27 PM..
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #752
Difficulty: Medium
BIOS is an acronym for Bootstrap Input/Output System.
True or False?

7 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Is there a way to handle commas inside the data when generating a csv file from shell script?

I am extracting data via sql query and some of the data has commas. Output File must be csv and I cannot update the data in the db (as it is used by other application). Example table FavoriteThings Person VARCHAR2(25), Favorite VARCHAR2(100) Sample Data Greta rain drop on... (12 Replies)
Discussion started by: patk625
12 Replies

2. Shell Programming and Scripting

Generating CSV from Column data

Hi List, I have a chunk of data like so: User Account Control: User Account Control: User Account Control: User Account Control: Disabled User Account Control: User Account Control: User Account Control: Disabled User Account Control: User Account Control: ... (3 Replies)
Discussion started by: landossa
3 Replies

3. UNIX for Dummies Questions & Answers

Generating 512MB file with dd using random data

Hello. Could anyone help me with my little annoying problem? I have to generate a 512 MB file made up with random data using DD. After some internet digging I found out that the command is: dd if=/dev/urandom of=/exemple/file bs=512MB After running this command the... (2 Replies)
Discussion started by: razolo13
2 Replies

4. Shell Programming and Scripting

Help generating a script for next-generation sequencing data

I am not sure if this is entirely possible, but I want to compare data in a particular column in several .txt files and have a new file generated. I am a biologist with limited unix knowledge. There are currently no programs written for this type of analysis. First I would like to define the... (1 Reply)
Discussion started by: kellywilliams
1 Replies

5. Shell Programming and Scripting

Summary using awk

Displaying information using awk Hey guys, i am using awk to display my information in a certain order. Database : Persia:42:John France:50:Mabel Persia:50:Rach Germany:60:JohnMy expected output is : ... (25 Replies)
Discussion started by: gregarion
25 Replies

6. Shell Programming and Scripting

generating reports based on time field of network data

hi i have data extracted in the following format ranging around 300000 to 800000 records in a text file , the format is of network data . No. Time Source Destination Protocol 1 1998-06-05 17:20:23.569905 HP_61:aa:c9 HP_61:aa:c9 ... (1 Reply)
Discussion started by: renukaprasadb
1 Replies

7. Shell Programming and Scripting

generating data for 1 hour

Hi Folks, The reqirement is that i need to generate 1 hr file with a time interval of five minutes.. For ex: my i/p is 0000-0000 and desired o/p is 0000-0005 0005-0010 0010-0015 0015-0020 0020-0025 0025-0030 0030-0035 0040-0045 0050-0055 0055-0100 Script neede urgent ... (0 Replies)
Discussion started by: aajan
0 Replies

Featured Tech Videos