The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
SVN subversion performance issue. email-lalit Red Hat 6 06-11-2008 06:48 PM
performance issue vishwaraj Shell Programming and Scripting 1 03-03-2008 02:29 AM
performance issue big123456 UNIX for Advanced & Expert Users 1 08-28-2007 09:53 AM
performance issue rein AIX 1 07-12-2007 02:54 AM
Performance issue shibz UNIX for Advanced & Expert Users 5 12-17-2002 11:12 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 09-24-2008
naoseionome naoseionome is offline
Registered User
  
 

Join Date: Sep 2008
Posts: 13
performance issue using gzcat, awk and sort

hi all,
I was able to do a script to gather a few files and sort them.

here it is:
Code:
#!/usr/bin/ksh


ls *mainFile* |cut -c20-21 | sort > temp

set -A line_array
i=0
file_name='temp'

while read file_line
do
 line_array[i]=${file_line}
 let i=${i}+1
  


# mainFile
gzcat *mainFile-dsa${file_line}* | awk '   
BEGIN { FS = "," } ; 
{if($1="") {mykey=$1} else {mykey=prev}}
{if(mykey != prev) 
    {print mykey",1,"NR","$0; prev=mykey} 
else 
    {print prev",1,"NR","$0; prev=mykey}}
' > final
# line
gzcat *line-dsa${file_line}* | awk '   
BEGIN { FS = "," } ; 
{if($1="") {mykey=$1} else {mykey=prev}}
{if(mykey != prev) 
    {print mykey",2,"NR","$0; prev=mykey} 
else 
    {print prev",2,"NR","$0; prev=mykey}}
' >> final
# ss
gzcat *ss-dsa${file_line}* | awk '   
BEGIN { FS = "," } ; 
    {print $1",3,"NR","$0;} 
' >> final
#bsginfo
gzcat *bsginfo-dsa${file_line}* | awk '   
BEGIN { FS = "," } ; 
    {print $1",4,"NR","$0;} 
' >> final
#gprs
gzcat *gprs-dsa${file_line}* | awk '   
BEGIN { FS = "," } ; 
{if($1="") {mykey=$1} else {mykey=prev}}
{if(mykey != prev) 
    {print mykey",5,"NR","$0; prev=mykey} 
else 
    {print prev",5,"NR","$0; prev=mykey}}
function isnum(n) { return n ~ /^[0-9]+$/ }
' >> final
#odbdata
gzcat *odbdata-dsa${file_line}* | awk '   
BEGIN { FS = "," } ; 
    {print $1",6,"NR","$0;} 
' >> final

ls *mainFile* |cut -c0-8 | sort | read data

#sort -t "," +0 -2 -n final > final2
sort  -t ',' +0 -1n +1 -2n +2 -3n  final > final2 
#sort final > final2
rm  final
rm  temp
gzip final2
mv final2.gz ${data}-final-dsa${file_line}.csv.gz


done < ${file_name}
my problems:
- when lines in each file exceeds a few millions "NR" instead of having the normal number, so i can apply sort, it gets in scientific notation and I'm not able to guarantee the lines order;
- the server as a I/0 charge very big so i should be able to do all the process only in memory (there are processors without charge and memory).
- can i receive the several gzcat input into only one awk script? or it is not possible?
- can i use pipe to send the previous result to the next instruction without writing to the "final" file?
- when it gets to sort instruction I/0 use goes from 30% to 100% and memory use stays the same, why?

can someone help me out on any of this question?
it is getting really hard for a newbie like me to get a solution my problems because a system that should take one day doing his operations is taking 5 days and i'm trying to get solutions in areas that i really don't understand for now.

Best regards,
Ricardo Tomás
  #2 (permalink)  
Old 09-25-2008
era era is offline Forum Advisor  
Herder of Useless Cats (On Sabbatical)
  
 

Join Date: Mar 2008
Location: /there/is/only/bin/sh
Posts: 3,652
Quote:
Originally Posted by naoseionome View Post
- when lines in each file exceeds a few millions "NR" instead of having the normal number, so i can apply sort, it gets in scientific notation and I'm not able to guarantee the lines order;
That's really pesky. You can avoid the scientific format with printf but if the line numbers exceed the capacity of the data type used internally by awk for integers, the output will be bogus.

Code:
borkstation$ awk 'END { print 123456789123456 }' /dev/null
1.23457e+14
borkstation$ awk 'END { printf "%i\n", 123456789123456 }' /dev/null
2147483647
borkstation$ perl -le 'print 123456789123456'
123456789123456
So the only workaround I can suggest is to switch to Perl in order to solve this. There is a script a2p in the Perl distribution which can convert awk scripts to Perl scripts, although I hear it's not perfect.

Quote:
Originally Posted by naoseionome View Post
- the server as a I/0 charge very big so i should be able to do all the process only in memory (there are processors without charge and memory).
I'm sorry, I can't find a question in that. Can you rephrase?

Quote:
Originally Posted by naoseionome View Post
- can i receive the several gzcat input into only one awk script? or it is not possible?
The scripts seem to be different for each file, so it seems a bit dubious. Certainly you could try to refactor the code to reduce duplication. It seems hard to write an awk script which could decide which fields to select purely based on the looks of the input (remember, file names are not visible when you receive data from a pipe), but if you know how to do that, by all means give it a try. Perhaps you could marshal the output from gzcat into a form where you can also include headers with information about which field numbers to use, or something. (Think XML format, although you don't have to use the specifics of XML, of course. Something simple like a prefix on each line which says which fields to look at is probably a lot easier to code and understand.)

Quote:
Originally Posted by naoseionome View Post
- can i use pipe to send the previous result to the next instruction without writing to the "final" file?
Group the commands into a subshell and pipe the output from that shell to sort.

Code:
( awk one; awk too; awk some more ) | sort
Quote:
Originally Posted by naoseionome View Post
- when it gets to sort instruction I/0 use goes from 30% to 100% and memory use stays the same, why?
sort uses temporary files if the inputs are big.
  #3 (permalink)  
Old 09-25-2008
Annihilannic Annihilannic is offline Forum Advisor  
  
 

Join Date: May 2008
Location: Sydney, Australia
Posts: 1,009
I usually find printf "%.f\n",variablename does the trick in awk.

sort usually has a command-line option to change the amount of memory it will allocate... usually the default is quite small, so you may see some benefit by increasing it. You can also sometimes control where it will store temporary files, so you may be able to specify some faster disks, or some that do not contain the original data so that they are not competing with each other. See man sort for details...
  #4 (permalink)  
Old 09-25-2008
era era is offline Forum Advisor  
Herder of Useless Cats (On Sabbatical)
  
 

Join Date: Mar 2008
Location: /there/is/only/bin/sh
Posts: 3,652
Neat, printf with %.f works for me, didn't know that one, thanks!
  #5 (permalink)  
Old 09-25-2008
naoseionome naoseionome is offline
Registered User
  
 

Join Date: Sep 2008
Posts: 13
changes

hi,
I'm doing some changes already.
I'm running the test now with printf and trying to figure the amount of memory for sort (1 or 2 or 3 Gb :P)

Code:
 					Originally Posted by naoseionome 					 				
 				- the server as a I/0 charge very big so i should be able to do all the process only in memory (there are processors without charge and memory).
I wanted to say that hard disk is working in the maximum but there is memory and processor available! I will start using a bit of the available memory for sort.

I planning on doing gunzip to the files in the biggining. This way i can send all the files into the same script and i just need to do an if for each file, like: IF FILENAME== /*line*/ "line code". this way i can send the result to sort instead of writting the "final" file.


thanks for the help.

Best regards,
Ricardo Tomás
  #6 (permalink)  
Old 09-25-2008
otheus's Avatar
otheus otheus is offline Forum Staff  
Moderator ala Mode
  
 

Join Date: Feb 2007
Location: Innsbruck, Austria
Posts: 1,884
Quote:
Originally Posted by naoseionome View Post
hi,
I'm doing some changes already.
I'm running the test now with printf and trying to figure the amount of memory for sort (1 or 2 or 3 Gb :P)
Some architectures/OS' allow only 2GB of memory per process. Keep that in mind.

Also, if your system starts swapping, you'll lose the memory advantage.

Quote:
I wanted to say that hard disk is working in the maximum but there is memory and processor available! I will start using a bit of the available memory for sort.
You can also sort within awk. Just load in the values into a hashed array and foreach() the values to get them out. It uses more memory, but less CPU time.
In that case, using an external sort would be better.

Quote:
I planning on doing gunzip to the files in the biggining. This way i can send all the files into the same script and i just need to do an if for each file, like: IF FILENAME== /*line*/ "line code". this way i can send the result to sort instead of writting the "final" file.
Sounds like a good plan.
Closed Thread

Bookmarks

Tags
awk, big line numbers, gzcat, integer size, multiple files, performance tuning, sort

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 03:04 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0