Yearly Grouping of Data


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Yearly Grouping of Data
# 1  
Old 02-15-2016
Yearly Grouping of Data

I need some logic that would help to group up some records that fall between two dates:

Input Data

Code:
COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7      COL_8      COL_9    COL_10 COL_11 COL_12
C     ABC   ABCD  3     ZZ    WLOA  2015-12-01 2015-12-15 975.73   ZZZ    P      147018.64
C     ABC   ABCD  3     ZZ    WLOA  2015-12-01 2016-01-31 898.86   ZZZ    P      129018.66
C     ABC   ABCD  3     ZZ    WLOA  2015-12-01 2016-02-29 788.81   ZZZ    P      110912.18
C     ABC   ABCD  3     ZZ    WLOA  2016-12-10 2017-02-29 18106.48 ZZZ    P      110912.18
C     ABC   ABCD  3     ZZ    WLOA  2016-12-10 2017-05-31 1652.2   ZZZ    P      55947.43
C     ABC   ABCD  3     ZZ    WLOA  2016-12-10 2017-08-31 650.05   ZZZ    P      45500.00
C     ABC   ABCD  3     ZZ    WLOA  2016-12-10 2017-09-20 500.15   ZZZ    P      37525.00
C     ABC   ABCD  3     ZZ    WLOA  2016-12-10 2017-10-01 357.05   ZZZ    P      12385.00

We will pass VAR_DATE as a parameter to the script, which we will then use to work out the grouping from this value.

For each value in COL_4 (Grouping Column) we need to group up the records that are within a certain date range.

In the above example therefore, we would like to group records with the below logic:

Code:
COL_7 > VAR_DATE AND COL_8 <= VAR_DATE + 1 Year

Example Output (Year 1)

Code:
C  ABC  ABCD 3  ZZ  WLOA VAR_DATE   VAR_DATE + 1 YEAR SUM ALL VALUES ZZZ  P FINAL RECORD IN COL_12 BALANCE FOR GIVEN ID (COL_4)
C  ABC  ABCD 3  ZZ  WLOA 2015-12-01 2016-12-01   2663.40             ZZZ  P 110912.18

We then need to group up the values for the next year (Year 2)

We would then like to repeat the same, but for year 2, using the below logic:

Code:
COL_7 > VAR_DATE + 1 Year AND COL_8 <= VAR_DATE + 2 Year

Example Output (Year 2)

Code:
C  ABC  ABCD 3  ZZ  WLOA VAR_DATE + 1 YEAR VAR_DATE + 2 YEAR SUM ALL VALUES ZZZ  P FINAL RECORD IN COL_12 BALANCE FOR GIVEN ID (COL_4)
C  ABC  ABCD 3  ZZ  WLOA 2016-12-01   2017-12-01             21265.93       ZZZ  P 12385.00

We will need to keep doing this logic up to 5 years / groups but this amount of years could change so ideally the loop/amount of groups required needs to be dynamic/parameterised.

The final output would look like below, with the two new records generated from the above logic, appended to the end of the file:

Code:
COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 COL_8 COL_9 COL_10 COL_11 COL_12
C ABC ABCD 3 ZZ WLOA 2015-12-01 2015-12-15 975.73 ZZZ P 147018.64
C ABC ABCD 3 ZZ WLOA 2015-12-01 2016-01-31 898.86 ZZZ P 129018.66
C ABC ABCD 3 ZZ WLOA 2015-12-01 2016-02-29 788.81 ZZZ P 110912.18
C ABC ABCD 3 ZZ WLOA 2016-12-10 2017-02-29 18106.48 ZZZ P 110912.18
C ABC ABCD 3 ZZ WLOA 2016-12-10 2017-05-31 1652.2 ZZZ P 55947.43
C ABC ABCD 3 ZZ WLOA 2016-12-10 2017-08-31 650.05 ZZZ P 45500.00
C ABC ABCD 3 ZZ WLOA 2016-12-10 2017-09-20 500.15 ZZZ P 37525.00
C ABC ABCD 3 ZZ WLOA 2016-12-10 2017-10-01 357.05 ZZZ P 12385.00
C ABC ABCD 3 ZZ WLOA 2015-12-01 2016-12-01 2663.40 ZZZ P 110912.18 ** (New record from above logic - Year 1 ) **
C ABC ABCD 3 ZZ WLOA 2016-12-01 2017-12-01 21265.93 ZZZ P 12385.00 ** (New record from above logic - Year 2 ) **

As the above logic will be running over a large amount of records, I would assume AWK will be the most efficient solution to the above, however my experience of AWK is extremely limited, therefore I am unsure as to how to proceed with starting the above logic.

Last edited by Ads89; 02-16-2016 at 04:20 AM.. Reason: Expanding with further information of requirements.
# 2  
Old 02-15-2016
Here are ways to do it in shell and in awk:
Code:
#!/bin/ksh
VAR_DATE=${1:-2016-02-15}

year=${VAR_DATE%%-*}
mon_day=${VAR_DATE#*-}
VAR_DATE_PLUS_1YEAR="$((year + 1))-$mon_day"
VAR_DATE_PLUS_2YEAR="$((year + 2))-$mon_day"
VAR_DATE_PLUS_3YEAR="$((year + 3))-$mon_day"
VAR_DATE_PLUS_4YEAR="$((year + 4))-$mon_day"
VAR_DATE_PLUS_5YEAR="$((year + 5))-$mon_day"

echo 'From shell calculations:'
printf 'VAR_DATE is %s\n' "$VAR_DATE"
printf 'VAR_DATE_PLUS_1YEAR is %s\n' "$VAR_DATE_PLUS_1YEAR"
printf 'VAR_DATE_PLUS_2YEAR is %s\n' "$VAR_DATE_PLUS_2YEAR"
printf 'VAR_DATE_PLUS_3YEAR is %s\n' "$VAR_DATE_PLUS_3YEAR"
printf 'VAR_DATE_PLUS_4YEAR is %s\n' "$VAR_DATE_PLUS_4YEAR"
printf 'VAR_DATE_PLUS_5YEAR is %s\n' "$VAR_DATE_PLUS_5YEAR"

awk -v VAR_DATE="$VAR_DATE" '
BEGIN {	y = substr(VAR_DATE, 1, 4)
	VAR_DATE_PLUS_1YEAR = y + 1 substr(VAR_DATE, 5)
	VAR_DATE_PLUS_2YEAR = y + 2 substr(VAR_DATE, 5)
	VAR_DATE_PLUS_3YEAR = y + 3 substr(VAR_DATE, 5)
	VAR_DATE_PLUS_4YEAR = y + 4 substr(VAR_DATE, 5)
	VAR_DATE_PLUS_5YEAR = y + 5 substr(VAR_DATE, 5)
	print "From awk calculations:"
	printf "VAR_DATE_PLUS_1YEAR is %s\n", VAR_DATE_PLUS_1YEAR
	printf "VAR_DATE_PLUS_2YEAR is %s\n", VAR_DATE_PLUS_2YEAR
	printf "VAR_DATE_PLUS_3YEAR is %s\n", VAR_DATE_PLUS_3YEAR
	printf "VAR_DATE_PLUS_4YEAR is %s\n", VAR_DATE_PLUS_4YEAR
	printf "VAR_DATE_PLUS_5YEAR is %s\n", VAR_DATE_PLUS_5YEAR
}'

# 3  
Old 02-16-2016
Hi Don,

Thanks for the above, that solution works well for generating the different years I require, but I also need a way to group these records as mentioned above.

We need to group records that fall between VAR_DATE and VAR_DATE +1 year and do some calculations on some of the columns i.e. Sum of all records (COL_9) and Final record value in COL_12

Any help would be appreciated.
# 4  
Old 02-16-2016
What have you tried to solve this problem?

What are the names of the input and output files you want to process?

Give us more detail about your input file:
  1. What is the sort order for your input file?
  2. What is supposed to happen if the input date is something like 2015-01-01 with a line like the 3rd line in your sample input:
    C ABC ABCD 3 ZZ WLOA 2015-12-01 2016-01-31 898.86 ZZZ P 129018.66
    where the apparent start date in COL_7 is within the range of 1 year, but the apparent end date in COL_8 is not in that same range?
  3. Your stated logic isn't clear about what constitutes the final record that determines how COL_12 is supposed to be set. Is it the last line found in the file, the line with the highest COL_7 value, the line with the highest COL_8 value, or something else?
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Name grouping

awk 'FNR==NR {a; next} $NF in a' genes.txt refseq_exons.txt > output.txt I can not figure out how to group the same name in $4 together. Basically, all the SKI together in separate rows and all the TGFB2. Thank you :). chr1 2160133 2161174 SKI chr1 218518675 218520389 TGFB2... (1 Reply)
Discussion started by: cmccabe
1 Replies

2. Shell Programming and Scripting

UNIX grouping

Hi guys, I am a complete newbie to unix and have been tasked with creating a script to group the following data (file) by hourly slots so that I can count the transactions completed within the peak hour. I am not sure how to group data like this in unix. Can anyone please help? Here is an... (1 Reply)
Discussion started by: MrMidas
1 Replies

3. Shell Programming and Scripting

Grouping

Hi all, I am using following command: perl program.pl input.txt output.txt CUTOFF 3 > groups_3.txt containing program.pl, two files (input.txt, output.txt) and getting output in groups_3.txt: But, I wish to have 30 files corresponding to each CUTOFF ranging from 0 to 30 using the same... (1 Reply)
Discussion started by: bioinfo
1 Replies

4. Shell Programming and Scripting

Selective grouping

I have a text file in this format. Group: AAA Notes: IP : 11.11.11.11 #User xxxxxxxxx #Password aaaaaaaaaaaaaaaa Group: AAA Notes: IP : 11.11.11.22 #User yyyyyyyyyyyyy #Password bbbbbbbbbbbbb (8 Replies)
Discussion started by: anil510
8 Replies

5. Shell Programming and Scripting

Help with grouping data based on range position

Input file: data_1 1000 1290 data_4 290 234 data_2 1114 1110 data_5 534 999 data_6 900 1050 . . Desired_output_file_1_0_999: data_4 290 234 data_5 534 999 Desired_output_file_2_1000_1999: data_1 1000 1290 data_2 1114 1110 (1 Reply)
Discussion started by: perl_beginner
1 Replies

6. Solaris

rotating a log yearly

Hi, I am having some troubles using /usr/sbin/logadm to rotate sulog yearly. Can someone please assist with the correct syntax to rotate the sulog yearly? I'd like to maintain up to 3 years of logs. I am on Solaris 10. Thanks, (1 Reply)
Discussion started by: lwif
1 Replies

7. Shell Programming and Scripting

Grouping data numbers in a text file into prescribed intervals and count

I have a text file that contains numbers (listed from the smallest to the largest). For ex. 34 817 1145 1645 1759 1761 3368 3529 4311 4681 5187 5193 5199 5417 5682 . . (5 Replies)
Discussion started by: Lucky Ali
5 Replies

8. UNIX for Dummies Questions & Answers

Help with data grouping

Hi all, I have a set data as shown below, and i would like to eliminate the name that no children - boy and girl. What is the appropriate command can i use(other than grep)? Please assist... My input: name sex marital status children - boy children - girl ... (3 Replies)
Discussion started by: 793589
3 Replies

9. Shell Programming and Scripting

Grouping and summing data through unix

Hi everyone, I need a help on Unix scripting. I have a file is like this Date Amt 20071205 10 20071204 10 20071203 200 20071204 300 20071203 400 20071205 140 20071203 100 20071205 100... (1 Reply)
Discussion started by: pcharanraj
1 Replies
Login or Register to Ask a Question