Aggregation of Huge files

03-14-2014

Registered User

16, 0

Join Date: Aug 2012

Last Activity: 7 April 2014, 1:56 AM EDT

Location: Chennai

Posts: 16

Thanks Given: 12

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by Don Cragun

Making the assumption that that error code is coming from bc, you could try:

Code:

awk -F'|' -v dqANDms='["-]' '
BEGIN { f=156
        printf("s=0\n")
}
NR > 2 {gsub(dqANDms, "", $f)
        printf("s+=%s\n",  $f)
}
END {   printf("s\n")
}' file | bc

Hi Don !

Thanks for your help in a quick time !

It works fine for few but not for all !

Here is the analysis I done for 3 different files:

Hash total using script and in header has been provided below:

File 1:

Code:

Script - 23840949434129.13
Header - 23840949436509.39

File 2:

Code:

Script - 7305817379402102.5619993295
Header - 7305817400402102.5619993295

File 3:

Code:

Script - 23558431740937.266
Header - 23558431741074.536

Surprsied to see for higher precision..works fine...but not with others !

Kindly share your thoughts !

Regards,
Ravichander

Last edited by Scrutinizer; 03-14-2014 at 05:03 AM.. Reason: code tags also for data samples

Ravichander

View Public Profile for Ravichander

Find all posts by Ravichander

03-14-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

My thoughts mirror those stated by Corona688 before... You have shown us an awk script that doesn't work. You have not shown us any input nor even given us a verbal description of the format of that input. I guessed at what your input looks like based on your non-working awk script. Obviously I guessed wrong.

If your original program wasn't able to correctly identify the data to be processed, the code I provided won't either. If you can't show us some sample input so we can figure out how to get what you want out of field 156 in your input, and show us at least one line of input that the script processes incorrectly, there isn't much we can do to guess at what might be going wrong when processing your huge data files.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-14-2014

Registered User

16, 0

Join Date: Aug 2012

Last Activity: 7 April 2014, 1:56 AM EDT

Location: Chennai

Posts: 16

Thanks Given: 12

Thanked 0 Times in 0 Posts

Hi Don !

Code:

 
TR|XXX|2010-11-30|254367|23840949436509.39
XXX|ACTUAL01|2013|365|XXX|2013-11-30|XXX|XXXR|XX|TR|BAXXXNLBB|XXX9012|23000000|| | |950|289278|999|110|7245| ||| |||||| |||400204828| | |CE SCBUK|66482|||||664| ||800|||||||| || |||||||||||||||||| |110|7245|||| || | |HKD| |HKD|0||0||||||||||||||||||||||||TA|||289278|7245|664|800| |110|950|EEEEEE|7245|664|800| |950|EEEEEE|7245|664|800| |950|HKD|||0|-2380.26|||||||-307.03|-307.03|-307.03|-307.03|-307.03|-307.03|-2380.26|-2380.26|-2380.26|-2380.26|-2380.26|-2380.26||-307.02|-307.02|-307.02|-307.02|-307.02|-307.02|||||||||||||||||||||||||||||||||||||||||||||||||PCP|PTH| | | |L1|||||NB| |400204828|400454430| | | ||| || | ||| | | || | ||||| | ||||I|||| |N|PT| | | |||||||||||||||||||||||||||

This is the header and first record, and the total number of records is 254368.

Any further details needed Don ?

Ravichander

View Public Profile for Ravichander

Find all posts by Ravichander

03-14-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

In your original (non-working) code:

Code:

tail -n +2 <File_Name> |nawk -F"|" -v '%.2f' qq='"' '{gsub(qq,"");sa+=($156<0)?-$156:$156}END{print sa}' OFMT='%.5f'

why did you bother adding code to remove all of the " characters from your input when there aren't any double-quote characters in your input file? Please explain in English what the format is for this file and please explain what the format is for the numbers that will be processed by this code. Do some fields sometimes have double quoted strings containing pipe symbols (|)?

Please explain what algorithm is supposed to be used to compute the result that is printed at the end of processing.

In my last message I asks you to:

Quote:

... show us at least one line of input that the script processes incorrectly.

Is this single data line processed incorrectly? (Or is the correct result from processing this line 2380.26?)

I assume that you're using a Solaris system. What is the length (in bytes) of the longest line in your 254368 line file?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-14-2014

Registered User

16, 0

Join Date: Aug 2012

Last Activity: 7 April 2014, 1:56 AM EDT

Location: Chennai

Posts: 16

Thanks Given: 12

Thanked 0 Times in 0 Posts

Hi Don !

The below is the requirement for my side to work with unix scripting:

1. The number of records may vary from 200000 to 4500000.
2. The 156th column needs to be calculated for sum which has a decimal range of (38,10)
3. The file will be pipe de-limited and for now, the double quotes won't appear but it may come in future. So, currently we can take it like only pipe delimited.
4. While performing aggregation, we need to take absolute sum of the 156th column.
5. The maximum precision is of 38,10 is expected and on normally, the 156th column length coming as 24,10.

If the code which ever I have used/provided is erroneous or not suiting the requirement, kindly help me in arriving at a command to perform the above stated requirements.

I am finding quite difficult to find the reason as such that is causing this difference !

Regards,
Ravichander

Ravichander

View Public Profile for Ravichander

Find all posts by Ravichander

03-14-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by Ravichander

You didn't answer my question about the length of the longest line in your file! If you have any lines longer than 2048 bytes (including the terminating newline character), nawk may fail.

The number of records doesn't matter for this script.
The code that you provided did NOT calculate the sum of the numbers in the 156th field; it calculated the sum of the absolute values of the numbers in the 156h field!
The quote removal slows down the processing, but doesn't affect the results unless there is a pipe symbol (|) between quotes that is not to be treated as a field separator. If there is any possibility that a | between double quotes (") should not be treated as a field separator, this awk script will not work! If there will never be a | between " characters and there will never be " characters in the 156th field, the script should ignore " characters completely.
The absolute value of the sum is not the same as the sum of the absolute values!!! You need to clearly describe the calculation to be performed!
Using bc to calculate the sum of a set of numbers can easily handle sums with a hundred digits before and after the radix character with no loss of precision.

The script assumes that the contents of field 156 will be a string of digits with an optional leading minus sign (-) and no more than one decimal point character (.). If there is a decimal point character and a minus sign, the minus sign must still be the 1st character in the string. If the contents of field 156 contains more than one minus sign, more than one decimal point, or contains any other non-numeric characters, the results are unspecified.

When extracting data from your database, are you absolutely sure that you are getting the records and the sum that you have in your header in a single transaction? If you are getting the data in one transaction and the sum in another transaction, changes to your database between those two transactions could easily cause the differences you are seeing.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-26-2014

Registered User

16, 0

Join Date: Aug 2012

Last Activity: 7 April 2014, 1:56 AM EDT

Location: Chennai

Posts: 16

Thanks Given: 12

Thanked 0 Times in 0 Posts

Hi Don !

Thanks for your valubale time and analysis ! I have made the requirement simple :

I have extracted the amount column alone into a seperate file and the data pattern of the same will be like the one shown below:

Code:

 
18781426.84
-2010820
-668398.44
-285369
-253957.7
-272.88
-2732931.94

The maximum amount value in the file is :

Code:

 
-90005467876809.567342220989

Now, I need to take the absolute of the amount and then I need to sum it up. The total number of records will be around 7 million.

Kindly help me with a code to fulfill the above requirement.

Thanks
Ravichander

Ravichander

View Public Profile for Ravichander

Find all posts by Ravichander

Shell Programming and Scripting

Aggregation of Huge files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Aggregation of huge data

Discussion started by: Ravichander

2. UNIX for Dummies Questions & Answers

File comparison of huge files

Discussion started by: kaaliakahn

3. Shell Programming and Scripting

Compression - Exclude huge files

Discussion started by: DevendraG

4. AIX

Copy huge files system

Discussion started by: Mr.AIX

5. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Discussion started by: jiapei100

6. Shell Programming and Scripting

Help in locating a word in huge files

Discussion started by: Prateek007

7. High Performance Computing

Huge Files to be Joined on Ux instead of ORACLE

Discussion started by: magedfawzy

8. UNIX for Advanced & Expert Users

Huge files manipulation

Discussion started by: Klashxx

9. UNIX for Dummies Questions & Answers

Difference between two huge files

Discussion started by: pyaranoid

10. Shell Programming and Scripting

Comparing two huge files

Discussion started by: kmkbuddy_1983