![]() |
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| get last section from large logfile | kburrows | Shell Programming and Scripting | 9 | 05-23-2004 08:06 PM |
| Changing the Effective Group ID | Jody | UNIX for Dummies Questions & Answers | 2 | 12-05-2002 03:53 PM |
| most effective search ? | simon2000 | UNIX for Dummies Questions & Answers | 3 | 10-09-2002 10:18 AM |
| Changing effective user | hilmel | Security | 6 | 12-06-2001 04:31 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
what is the most effective way to process a large logfile?
I am dealing with a very large firewall logfile (more than 10G),
the logfile like this *snip* Nov 9 10:12:01 testfirewall root: [ID 702911 local5.info] Nov 9 10:12:01 testfirewall root: [ID 702911 local5.info] 0:00:11 accept testfw01-hme0 >hme0 proto: icmp; src: test001.example.net; dst: abc.dst.net; rule: 1; icmp-type: 8; icmp-code: 0; product: VPN-1 & Fire *snip* I don't need any line including "icmp or snmp", and since there are many lines with no content (like the first line in the example, no info after local5.info), I perform a "src" grep, and then I pick up all the lines with which the 16th field not starting with 192.12 or 192.34, or including "test", then I print several fields, using a tab (\t) instead of space to separate them, and at last, delete all the ";" character in the logfile. My command is as following, egrep -vi "icmp|snmp" /logs/logfile | egrep -i "src" | awk '$16!~/(^192.(12|34)|.*test.*)/' | awk 'BEGIN {OFS="\t"} {print $1$2, $11,$10,$14,$16,$18,$20," ",$26}' | sed 's/;//g' > /tmp/logfile2 I don't think my way is efficient, so anyone here can give me some suggestions on how to organize my command? Thank you! Last edited by fedora; 11-13-2006 at 07:24 PM.. |
|
|||||
|
You want a single process and this does that. A perl or ksh solution might beat this by a little bit, provided that they carefully use only built-in commands and never invoke anything external. Perl and ksh compile the script while awk does not. And a custom C program can beat anything else.
Your 5 stage pipeline will not be even close to a single process. Even if you have 5 CPU's available that can be dedicated to the pipeline, all of that reading and writing to pipes is expensive. (Anything is expensive when you do it many millions times.) And you probably do not have 5 CPU's available for the entire run. Without 5 dedicated CPU's you will need to context switch several million times as well. |
|
||||
|
Just an idea
Why don't you split the file into small files of 1GB each. Then use Pederarbo's awk script to go through each one of the split files. And Awk being a stream editor there can be nothing faster to work on data than working on data streams.
After you are done with the cleansing of the files you could append them into a single file. About splitting the files is just an idea and might save because you would be handling small sets of data flowing in one continuous stream than one large one of 10GB. |
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|