awk eating too much memory?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk eating too much memory?
# 8  
Old 09-30-2011
I'm assuming the main contenders are these lines:

Code:
awk '/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}END{print "\nTotal",tot,"Domains"}' $RootPath/$today.biz > $RootPath/zonefile/$today.biz
awk '/Total/ {print $2}' $RootPath/zonefile/$today.biz > $RootPath/$today.count

This script can't cause a load average of 8 unless you're running run 8 at once. Do you really want this to run slower? Even more of them might pile up.

The second awk is pointless and doubles the workload of your script since it scans the entire output of the first awk to find one line. The first awk can easily create that file by itself, on-the-fly.

Code:
awk -v TOTALFILE="$RootPath/$today.count" '/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
        print "\nTotal",tot,"Domains";
        print tot > TOTALFILE
}' $RootPath/$today.biz > $RootPath/zonefile/$today.biz
# no longer needed, the first awk generates it for us
# awk '/Total/ {print $2}' $RootPath/zonefile/$today.biz > $RootPath/$today.count

---------- Post updated at 10:28 AM ---------- Previous update was at 10:19 AM ----------

Depending on how many cores you have, this might complete even faster:

Code:
gunzip < zone.gz | awk -v TOTALFILE="$RootPath/$today.count" -v BIZFILE="$RootPath/$today.biz" '
# print ALL lines into BIZFILE
{ print $0 > BIZFILE }
/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
        print "\nTotal",tot,"Domains";
        print tot > TOTALFILE
}' > $RootPath/zonefile/$today.biz

Do you even need $RootPath/$today.biz at all anymore? IF not this could be simplified further.

Last edited by Corona688; 09-30-2011 at 05:18 PM..
# 9  
Old 09-30-2011
Thanks a lot for your effort brother

Code:
gunzip < biz.zone.gz | awk -v TOTALFILE="$RootPath/$today.count" -v BIZFILE="$RootPath/$today.biz"
# print ALL lines into BIZFILE
{ print $0 > BIZFILE }
/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
        print "\nTotal",tot,"Domains";
        print tot > TOTALFILE
} $RootPath/$today.biz > $RootPath/zonefile/$today.biz



###### It will start the Count #####


### Calculation Part
a=$(< $RootPath/$today.count)
b=$(< $RootPath/$yesterday.count)
c=$(awk 'NR==FNR{a[$0];next} $0 in a{tot++}END{print tot}' $RootPath/zonefile/$today.biz $RootPath/zonefile/$yesterday.biz)


echo "$current_date Today Count For BIZ TlD $a" >> $LOG
echo "$current_date New Registration Domain Counts $((c - a))" >> $LOG
echo "$current_date Deleted Domain Counts $((c - b))" >> $LOG
cat $LOG | mail -s "BIZ Tld Count log" 07anis@gmail.com

this is the code exactly iam using but i get error when i execute the script :-(

Quote:
+ awk -v TOTALFILE=/var/domaincount/biz//01102011.count -v BIZFILE=/var/domaincount/biz//01102011.biz
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options: GNU long options:
-f progfile --file=progfile
-F fs --field-separator=fs
-v var=val --assign=var=val
-m[fr] val
-W compat --compat
-W copyleft --copyleft
-W copyright --copyright
-W dump-variables[=file] --dump-variables[=file]
-W exec=file --exec=file
-W gen-po --gen-po
-W help --help
-W lint[=fatal] --lint[=fatal]
-W lint-old --lint-old
-W non-decimal-data --non-decimal-data
-W profile[=file] --profile[=file]
-W posix --posix
-W re-interval --re-interval
-W source=program-text --source=program-text
-W traditional --traditional
-W usage --usage
-W version --version

To report bugs, see node `Bugs' in `gawk.info', which is
section `Reporting Problems and Bugs' in the printed version.

gawk is a pattern scanning and processing language.
By default it reads standard input and writes standard output.

Examples:
gawk '{ sum += $1 }; END { print sum }' file
gawk -F: '{ print $1 }' /etc/passwd
mew.sh: line 31: syntax error near unexpected token `$RootPath/$today.biz'
mew.sh: line 31: `} $RootPath/$today.biz > $RootPath/zonefile/$today.biz
# 10  
Old 09-30-2011
Take a closer look at Corona's post. What you posted is missing single-quotes around the awk script.

Regards,
Alister
# 11  
Old 09-30-2011
To make it clearer:

Code:
gunzip < biz.zone.gz | awk -v TOTALFILE="$RootPath/$today.count" -v BIZFILE="$RootPath/$today.biz" '
# print ALL lines into BIZFILE
{ print $0 > BIZFILE }
/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
        print "\nTotal",tot,"Domains";
        print tot > TOTALFILE
}' > $RootPath/zonefile/$today.biz

I also corrected a mistake in it -- awk needs no input filename when fed by a stream!
This User Gave Thanks to Corona688 For This Post:
# 12  
Old 10-01-2011
Three ways of computing the number of deletions, additions and unchanged, experiment with your data and OS to see which is the best:
Code:
# generate raw data
awk -v n=1e6 '
BEGIN {
  srand()
  while (--n > 0)
    printf("abc%dzzz\n", n*rand()) > ARGV[1 + (rand() < 0.55)]
  exit
}
' old.raw new.raw

printf "method:\tdeleted\tadded\tunchanged\n"

# method 1
awk '
NR == FNR {
  if (!($0 in a)) { ++o; a[$0] = -1 }
  next
}
{
  if ((x = ++a[$0]) > 1) next
  if (x < 1) { ++c; a[$0] = 1 }
  else if (x < 2) ++e
  #print
}
END { printf("awk:\t%d\t%d\t%d\n", o - c, e, c) }
' old.raw new.raw #> n.awku

# method 2
sort -u old.raw > o.sortu
oc=$(wc -l < o.sortu)
sort -u new.raw > n.sortu
nc=$(wc -l < n.sortu)
all=$(sort -mu o.sortu n.sortu |wc -l)
printf "sort:\t%d\t%d\t%d\n" $((all-nc)) $((all-oc)) $((oc+nc-all)) 

# method 3
comm o.sortu n.sortu | awk -F'\t' '
 { if ($1)++a; else if ($2) ++b; else ++c }
 END { printf("comm:\t%d\t%d\t%d\n", a, b, c) }'

This User Gave Thanks to binlib For This Post:
# 13  
Old 10-03-2011
Thanks all , for your prompt replies ,

Quote:
Originally Posted by Corona688
To make it clearer:

Code:
gunzip < biz.zone.gz | awk -v TOTALFILE="$RootPath/$today.count" -v BIZFILE="$RootPath/$today.biz" '
# print ALL lines into BIZFILE
{ print $0 > BIZFILE }
/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
        print "\nTotal",tot,"Domains";
        print tot > TOTALFILE
}' > $RootPath/zonefile/$today.biz

I also corrected a mistake in it -- awk needs no input filename when fed by a stream!

ya its working fine, dude but my problem was only the file size,

Quote:
; The use of the Data contained in
; .com, and .net top-level domain zone files (including the checksum
; files) is subject to the restrictions described in the access Agreement
;

$ORIGIN COM.
$TTL 900
@ IN SOA anish.servers.net. anish.servers.net. (
1317183572 ;serial
1800 ;refresh every 30 min
900 ;retry every 15 min
604800 ;expire after a week
86400 ;minimum of 15 min
)
$TTL 172800
NS A.ANISH.SERVERS.NET.
NS G.ANISH.SERVERS.NET.
NS H.ANISH.SERVERS.NET.
NS C.ANISH.SERVERS.NET.
NS I.ANISH.SERVERS.NET.
NS B.ANISH.SERVERS.NET.
NS D.ANISH.SERVERS.NET.
NS L.ANISH.SERVERS.NET.
NS F.ANISH.SERVERS.NET.
NS J.ANISH.SERVERS.NET.
NS K.ANISH.SERVERS.NET.
NS E.ANISH.SERVERS.NET.
NS M.ANISH.SERVERS.NET.
COM. 86400 DNSKEY 256 3 8 AQPEwTIOsHspGTJb1CGweIPOEak/zuDi4ZaDJleSYa/7yoTzmfF9K21W5YRsm5C8F3jGvQbS8kcCqVE1IOiuQ1RNIdq603eSZv68Pzhn43Dhc7NBAEdtygb6cmlGHYvmIcYdYy1hSsj18P1 QTGTxmdlXnFQDDol1wwjS4/RwlKwgsQ==
COM. 86400 DNSKEY 257 3 8 AQPDzldNmMvZFX4NcNJ0uEnKDg7tmv/F3MyQR0lpBmVcNcsIszxNFxsBfKNW9JYCYqpik8366LE7VbIcNRzfp2h9OO8HRl+H+E08zauK8k7evWEmu/6od+2boggPoiEfGNyvNPaSI7FOIroDsnw/taggzHRX1Z7SOiOiPWPNIwSUyWOZ79VmcQ1GLkC6NlYvG3HwYmynQv6oFwGv/KELSw7ZSdrbTQ0HXvZbqMUI7BaMskmvgm1G7oKZ1YiF7O9ioVNc0+7ASbqmZN7Z98EGU/Qh2K/BgUe8Hs0XVcdPKrtyYnoQHd2ynKPcMMlTEih2/2HDHjRPJ2aywIpKNnv4oPo/
COM. 86400 NSEC3PARAM 1 0 0 -
COM. RRSIG NS 8 1 172800 20111003041551 20110926030551 41798 COM. bnfhmBWvn+dT+0cFJDf6PtbpXjLoVfd7DxMnm1/6loge1uaLBaIs6/kMOqATZ2TKl2NtfnjHTcekzKUAfDGDcCmSmvjMD4BVOLmHI0Sw5fnedTH+/V0a3EdoslGz64Xj1wLaPdQNEZOpS+zhNY+RD4nI/it+AekIxcLpelICohs=
COM. 86400 RRSIG NSEC3PARAM 8 1 86400 20111003041551 20110926030551 41798 COM. kZJE4UhCffn1QcdyOOP+SUXfRgy8AOVbAIm6FDAZ5KHPny/qvISB5sluDWUFIai1CuugVbgVgUIaWaQqP9X+DP47hmqS8qyCNSQ2fekc2McQlu+dGaTwqcHmSwCrxV7Av6+trzYPkA2X/1m6tVT+T62x1ly/q+GT5DSVUNO/VnQ=
COM. 900 RRSIG SOA 8 1 900 20111005041932 20110928030932 41798 COM. JHj4he/55NXCrGrm7xzrTjsGbgVYlll1YLTUMWw5IPchpJUTe+PhLKZ93Kn3N6lWQ7gNAU/kwzWNa7cBEdfLROB22iCvfCG1S+j2YOKCejvDdsAy+g8yANZOiqaW/2ZAALqJCL2mCXUqyBXYRsgwvks+Ur8bYyM14xUF8KG+cjs=
COM. 86400 RRSIG DNSKEY 8 1 86400 20111001182533 20110924182033 30909 COM. hGnWUsF7zYK3iJODN/HtrcyiPQGFaEMgAHoFNtspTvYFrvDgoZvy/Clt+PuPXoa5k1fl7O1qFXJluuky+9xcGE/E+wqpwoayMah0Xw3lcr3k+MEFky1ofBDFPiN1DWpbPrsR9SwUcndobRETH/cNyujB7B0Mtf10U7+UOK1+CsNmCcrYl8RGgpzPPsIhQnyyI7YnwS7htCo4Ksxx6QjOOahJnDb5IjQt6x4DVXDUJKkpPVPdbOMcwW zNPYDyaBGPVHkyb+lU+G6xLibRPytRhV1dstHyFf6nbIeFudUpG9qNyIQL7N5eRXFUKsWQoAOV+ThNPee2NFkjQmRk0By1Dw==
ENERCONTECHNOLOGIES NS NS1.BIZ.RR
ENERCONTECHNOLOGIES NS NS2.BIZ.RR
SELF-DRIVE-CAR-RENTAL NS NS3.IZP
SELF-DRIVE-CAR-RENTAL NS NS6.IZP
SELF-DRIVE-CAR-RENTAL NS NS7.IZP
SELF-DRIVE-CAR-RENTAL NS IZA.HOSTING.DIGIWEB.IE.
SELF-DRIVE-CAR-RENTAL NS NS4.IZP
NANCYVRAINE NS NS1.IMCONLINE.NET.
NANCYVRAINE NS NS2.IMCONLINE.NET.
SELFDRIVECARRENTAL NS NS3.IZP
SELFDRIVECARRENTAL NS NS6.IZP
SELFDRIVECARRENTAL NS NS7.IZP
SELFDRIVECARRENTAL NS IZA.HOSTING.DIGIWEB.IE.
SELFDRIVECARRENTAL NS NS4.IZP
WORLDDATASOURCE NS NS01.DOMAINCONTROL
NS1.ANISHTECH A 118.99.98.50
NS2.BABYTROLTANISHLE A 174.78.35.253
NS2.LOUBOUTINSALES A 69.04.247.123
NS2.LAURENZAINS A 69.162.67.212
NS1.BUYOURSHOES A 69.56.247.123
NS2.BUYOURSHOES A 69.56.247.123
NS1.HOSTINGALWAYS A 74.52.251.165
NS2.HOSTINGALWAYS A 74.52.251.174
NS1.PETROTECHHAULING A 108.60.199.98
NS1.MAHALAHEATHER A 24.232.33.129
NS2.MAHALAHEATHER A 211.139.119.206
NS1.GENIXTECHNOLOGY A 182.50.129.6
NS2.GENIXTECHNOLOGY A 182.50.129.6
NS1.DEVABKK A 108.62.124.151
NS2.DEVABKK A 108.62.124.151
NS1.LAGISSOTTE A 108.62.124.152
NS1.CONTROLHELM5 A 208.100.8.40
NS2.CONTROLHELM5 A 208.100.8.41
NS2.LAGISSOTTE A 108.62.124.152
NS1.MKT-EVIDENYDIGITAL A 188.138.124.46
NS2.MKT-EVIDENYDIGITAL A 188.138.126.32
NS1.JOANNEHAIR A 108.62.124.153
NS1.TRACKTHEFOOD A 50.22.93.76
NS1.ATOUCHOFSPARKLE-YUMA A 209.202.252.20
NS2.ATOUCHOFSPARKLE-YUMA A 209.202.254.20
NS1.MANDINGOPICS A 64.251.6.3
NS1.MANDINGOPICS A 69.60.115.123
NS2.TRACKTHEFOOD A 50.22.93.77
CONVEYANCINGSYDNEYCITY A 65.39.205.54
OVERLORD.LIGHTENHEIM A 144.38.216.163
DRONE.LIGHTENHEIM A 144.38.216.164
NS1.MONSES A 173.231.56.170
NS2.MONSES A 173.231.56.171
NS2.JOANNEHAIR A 108.62.124.153
NS1.VOQE A 173.254.196.42
NS2.VOQE A 173.254.196.42
NS1.ARIELSTINE A 69.93.119.112
NS2.ARIELSTINE A 69.93.79.100
NS1.DATOBE A 173.231.56.170
NS2.DATOBE A 173.231.56.171
NS1.THECLACH A 108.62.124.154
NS2.THECLACH A 108.62.124.154
NS1.SCREW-PAYPAL A 173.201.24.59
NS2.SCREW-PAYPAL A 173.201.24.59
NS1.INDEXEDHEALTHINFO A 69.93.119.112
NS2.INDEXEDHEALTHINFO A 69.93.79.100
NS1.DIRECTORYPERFECT A 69.93.125.48
NS2.DIRECTORYPERFECT A 69.93.70.45
;End of file
The file contain these kind of data's from that it using awk its sorting only uniqe domain names alone. so even i used your code(Corona688) also its taking time and load,
# 14  
Old 10-04-2011
Quote:
Originally Posted by anishkumarv
ya its working fine, dude but my problem was only the file size,
First you said it's memory, then CPU time, now file size -- which is your goal here?
Quote:
The file contain these kind of data's from that it using awk its sorting only uniqe domain names alone. so even i used your code(Corona688) also its taking time and load,
Of course it takes time and load. 8 gigabytes of data isn't going to be sorted in a nanosecond.

I asked questions which could be used to further improve the code. Is BIZFILE actually needed for anything, now that you don't need to recalculate the database count? If not, leaving out { print $0 > BIZFILE } will avoid a lot of disk-writing and give some more boost.

I'm not quite following the logic in this awk script:
Code:
/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}

Absolutely nothing in that domain file snippet of yours contains 'IN NS', so that ought to never match. It doesn't look like the first field is what you're actually interested in anyway. How does this work?

---------- Post updated at 10:06 AM ---------- Previous update was at 09:25 AM ----------

I've been trying to think of an awkless way for you, so far I'm stumped.

Building it in pure C means needing an associative array, i.e. I'm ending up just building a hardcoded implementation of awk. It'd have to be a really good associative array to get the necessary speed -- I bet awk's would be faster.

Building it with other shell commands means piping it through grep and cut before feeding it into a sort -u, and then afterwards, reprocessing the output again to get the record count -- either that, or doing a tee and wc -l. That's a 5-long pipe chain for 8GB of data -- in effect processing 40 gigs of data, not 8... That's not going to be more efficient.

I could build a C program that does the grep | cut for you, which would let you pipe it directly into sort -u | tee | wc -l. That's only a 4-long pipe chain... Unless you've got 4 cores, that's probably still not better than the script you have now.

awk's flexible enough to do everything in one shot, which is pretty tough to beat.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

how to find a job which is writing a big file and eating up space?

how to find a job which is writing a big file and eating up space? (3 Replies)
Discussion started by: rush2andy
3 Replies

2. Shell Programming and Scripting

AWK Memory Limit ?

Is there an input file memory limit for awk? I have a 38Mb text file that I am trying to print out certatin lines and add a string to the end of that line. When I excute the script on the 38Mb file the string I am adding is put on a new line. If I do the same with a smaller file the... (3 Replies)
Discussion started by: cold_Que
3 Replies

3. Shell Programming and Scripting

[bash] IF is eating my loops

Hi! Could someone explain me why the below code is printing the contents of IF block 5 times instead of 0? #!/bin/bash VAR1="something" VAR2="something" for((i=0;i<10;i++)) do if(($VAR1=~$VAR2)) then echo VAR1: $VAR1 echo... (3 Replies)
Discussion started by: machinogodzilla
3 Replies

4. Solaris

Sendmail is eating high memory

Hi, I have installed sendmail on my solaris server. But sendmail its up high memory. its eat upto around 9-10 GB memory. What to do in this ? Thanks NeeleshG (6 Replies)
Discussion started by: neel.gurjar
6 Replies

5. Shell Programming and Scripting

Memory exhausted in awk

Dear All, I have executed a awk script in linux box which consists of 21 Million records.And i have two mapping files of 500 and 5200 records.To my surprise i found an error awk: cmd. line:19: (FILENAME=/home/FILE FNR=21031272) fatal: Memory exhausted. Is there any limitation for records... (3 Replies)
Discussion started by: cskumar
3 Replies

6. Solaris

This application is eating up the CPU

Hi, I am not very much fmiliar with Solaris OS. My main concern for posting is One application is eating 50% of CPU and I cannot run that application, If I perform any action in that application it takes real long time. I have solaris installed on my development machine.I have my application... (11 Replies)
Discussion started by: pandu345
11 Replies

7. What is on Your Mind?

What are you eating ?

Hi, guys ! I was wondering... how many of you are vegetarians ? and why ? (31 Replies)
Discussion started by: Sergiu-IT
31 Replies

8. UNIX for Dummies Questions & Answers

/proc is eating my disk man

hi I have an sun ultra 5 running a firewall which has logging enabled (essential). The disk is sliced up with /proc on / (c0t0d0s0). / is sliced at 3 gig. My problem is this, one afternoon, a manager asked me to retrieve some firewall logs, so i went into the relevant directory (also on the /... (3 Replies)
Discussion started by: hcclnoodles
3 Replies

9. UNIX for Dummies Questions & Answers

Eating memory

Hello I run Gentoo Linux on my computer: Athlon XP 1700+ ~1,46 mhz 512 mb ram After a while, my computer works really slow, and when I cat /proc/meminfo, I see that I only have 8mb of 512 mb free! How is that possible? I dont run anything I can think of that eats that amount of... (4 Replies)
Discussion started by: Maestin
4 Replies

10. UNIX for Dummies Questions & Answers

Hosting Service Eating Space

Dear Group, I am not much used to UNIX. The company I am hosting wiht refuses to help me with this trouble, but as near as I can see, it is NOT my trouble. I have had this service for over a year. I just renewed for another year and all of a sudden the disk quota has been disappearing. I... (3 Replies)
Discussion started by: cindy
3 Replies
Login or Register to Ask a Question