Aggregation of Huge files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Aggregation of Huge files
# 1  
Old 02-28-2014
Power Aggregation of Huge files

Hi Friends !!

I am facing a hash total issue while performing over a set of files of huge volume:

Command used:

Code:
tail -n +2 <File_Name> |nawk -F"|" -v '%.2f' qq='"' '{gsub(qq,"");sa+=($156<0)?-$156:$156}END{print sa}' OFMT='%.5f'

Pipe delimited file and 156 column is for hash totalling.

File 1:

Record count is 254368

Absolute Sum in DB is 23840949436509.39

Absolute Sum using above script is 23840949436510.18750

File 2:

Record count is 2580100

Absolute Sum in DB is 7305817400402102.5619993295

Absolute Sum using above script is 7305817400403184.00000

Kindly help me in resolving this issue and do suggest me if any better way to do absolute hash totalling for huge volume.

Thanks in advance,
Ravi

Moderator's Comments:
Mod Comment edit by bakunin: you are welcome but you would be even more welcome if you would use these CODE-tags for your code. Thank you for using them yourself from now on.

Last edited by bakunin; 02-28-2014 at 05:10 AM..
# 2  
Old 02-28-2014
awk uses 32-bit floating point numbers which do not have infinite precision -- they have at best 9 decimal digits precision. If you want infinite precision like a database will do, try the bc utility.
# 3  
Old 02-28-2014
Hi Corona..

Can you help me with bc utility for this scenario ? I am just new to these functions !

Regards,
Ravi
# 4  
Old 02-28-2014
Assuming that you're using the tail in the command line:
Code:
tail -n +2 <File_Name> |nawk -F"|" -v '%.2f' qq='"' '{gsub(qq,"");sa+=($156<0)?-$156:$156}END{print sa}' OFMT='%.5f'

to discard the 1st two lines of your input file because they contain headers that you don't want included in your output, that the '%.2f' didn't really appear in the command line you executed (since that would be a syntax error for nawk), that you don't really want the output rounded to five digits after the decimal point in the output (as would be done in your command line by OFMT='%.5f', and assuming that field #156 in the other lines in your input file contains a double quoted string containing a string of digits with no more than one period and with an optional leading minus sign (which you want to be ignored), you could try something like:
Code:
nawk -F'|' -v dqANDms='["-]' '
BEGIN { f=156
        printf("0")
}       
NR > 2 {gsub(dqANDms, "", $f) 
        printf("+%s", $f)
}
END {   printf("\n")
}' <File_Name> | bc


Last edited by Don Cragun; 02-28-2014 at 12:45 PM.. Reason: Fix typos.
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 02-28-2014
Quote:
Originally Posted by Ravichander
Hi Corona..
Can you help me with bc utility for this scenario?
Depends what your scenario is, I don't know yet, all I have is a program which doesn't do what you want...
# 6  
Old 03-06-2014
Quote:
Originally Posted by Don Cragun
Assuming that you're using the tail in the command line:
Code:
tail -n +2 <File_Name> |nawk -F"|" -v '%.2f' qq='"' '{gsub(qq,"");sa+=($156<0)?-$156:$156}END{print sa}' OFMT='%.5f'

to discard the 1st two lines of your input file because they contain headers that you don't want included in your output, that the '%.2f' didn't really appear in the command line you executed (since that would be a syntax error for nawk), that you don't really want the output rounded to five digits after the decimal point in the output (as would be done in your command line by OFMT='%.5f', and assuming that field #156 in the other lines in your input file contains a double quoted string containing a string of digits with no more than one period and with an optional leading minus sign (which you want to be ignored), you could try something like:
Code:
nawk -F'|' -v dqANDms='["-]' '
BEGIN { f=156
        printf("0")
}       
NR > 2 {gsub(dqANDms, "", $f) 
        printf("+%s", $f)
}
END {   printf("\n")
}' <File_Name> | bc

Hi Don !

Thanks for the work around solution and it is working fine for small files, but when I execute large files..facing below error:

Code:
 
0705-001: bundling space exceeded on line 1 stdin

Kindly help me in this regard.

Regards,
Ravichander
# 7  
Old 03-06-2014
Making the assumption that that error code is coming from bc, you could try:
Code:
awk -F'|' -v dqANDms='["-]' '
BEGIN { f=156
        printf("s=0\n")
}
NR > 2 {gsub(dqANDms, "", $f)
        printf("s+=%s\n",  $f)
}
END {   printf("s\n")
}' file | bc

This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Aggregation of huge data

Hi Friends, I have a file with sample amount data as follows: -89990.3456 8788798.990000128 55109787.20 -12455558989.90876 I need to exclude the '-' symbol in order to treat all values as an absolute one and then I need to sum up.The record count is around 1 million. How... (8 Replies)
Discussion started by: Ravichander
8 Replies

2. UNIX for Dummies Questions & Answers

File comparison of huge files

Hi all, I hope you are well. I am very happy to see your contribution. I am eager to become part of it. I have the following question. I have two huge files to compare (almost 3GB each). The files are simulation outputs. The format of the files are as below For clear picture, please see... (9 Replies)
Discussion started by: kaaliakahn
9 Replies

3. Shell Programming and Scripting

Compression - Exclude huge files

I have a DB folder which sizes to 60GB approx. It has logs which size from 500MB - 1GB. I have an Installation which would update the DB. I need to backup this DB folder, just incase my Installation FAILS. But I do not need the logs in my backup. How do I exclude them during compression (tar)? ... (2 Replies)
Discussion started by: DevendraG
2 Replies

4. AIX

Copy huge files system

Dear Guy’s By using dd command or any strong command, I’d like to copy huge data from file system to another file system Sours File system: /sfsapp File system has 250 GB of data Target File system: /tgtapp I’d like to copy all these files and directories from /sfsapp to /tgtapp as... (28 Replies)
Discussion started by: Mr.AIX
28 Replies

5. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Hi, all: I've got two folders, say, "folder1" and "folder2". Under each, there are thousands of files. It's quite obvious that there are some files missing in each. I just would like to find them. I believe this can be done by "diff" command. However, if I change the above question a... (1 Reply)
Discussion started by: jiapei100
1 Replies

6. Shell Programming and Scripting

Help in locating a word in huge files

hi i receive about 5000 files per day in my system. Each of them are like: cat ABC.april24.dat ABH00001990 01993 409009092 0909 INI iop 9033 AAB0000237893784 8430900 898383 AUS 34349089008 849843 9474822 AAA00003849893498098394 84834 348348439 -438939 IN AAA00004438493893849384... (2 Replies)
Discussion started by: Prateek007
2 Replies

7. High Performance Computing

Huge Files to be Joined on Ux instead of ORACLE

we have one file (11 Million) line that is being matched with (10 Billion) line. the proof of concept we are trying , is to join them on Unix : All files are delimited and they have composite keys.. could unix be faster than Oracle in This regards.. Please advice (1 Reply)
Discussion started by: magedfawzy
1 Replies

8. UNIX for Advanced & Expert Users

Huge files manipulation

Hi , i need a fast way to delete duplicates entrys from very huge files ( >2 Gbs ) , these files are in plain text. I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but it always ended with the same result (memory core dump) In using HP-UX large servers. Any advice will... (8 Replies)
Discussion started by: Klashxx
8 Replies

9. UNIX for Dummies Questions & Answers

Difference between two huge files

Hi, As per my requirement, I need to take difference between two big files(around 6.5 GB) and get the difference to a output file without any line numbers or '<' or '>' in front of each new line. As DIFF command wont work for big files, i tried to use BDIFF instead. I am getting incorrect... (13 Replies)
Discussion started by: pyaranoid
13 Replies

10. Shell Programming and Scripting

Comparing two huge files

Hi, I have two files file A and File B. File A is a error file and File B is source file. In the error file. First line is the actual error and second line gives the information about the record (client ID) that throws error. I need to compare the first field (which doesnt start with '//') of... (11 Replies)
Discussion started by: kmkbuddy_1983
11 Replies
Login or Register to Ask a Question