Thank you for taking the time to look at this and provide input.
To start, I am not a linux/unix expert but I muddle through the best I can.
I am also in no way shape or form a programmer. Please keep that in mind as you read this script.
This script is designed to find all files in a given directory that begin with "asalog", find lines containing a specific word and then process down those lines and output just the needed information. These files are currently zipped. The files are stored on remote ZFS storage. Copying all of the files down to the local system at once then unzipping is not feasible do to storage limitations. The script works as designed but it is very slow to do the task.
Please look over the code and suggest ways that I could improve its speed. The last run took 238 minutes to complete.
Due to access limitations I have to work within BASH, I do not have the option (nor the knowledge) to utilize perl, python, etc.
Any help is welcome as well as comments on the script as it sits. It has been cobbled together by remembering programming structure learned taking Turbo Pascal in high school (many years ago) and lots of google searches.
Last edited by garlandxj11; 04-12-2016 at 09:50 PM..
You have the comment:
but there is no ports2.txt anywhere in your script. Do you care about having the file GC_ports.txt, or do you really just need the four GC_(tcp|udp)(in|out)bound.txt files that are created from this file?
Assuming you have more than one compressed file that is less than a week old, using -exec zcat {} + will be faster than -exec zcat {} \; and you could replace all four awk scripts and the sed script with a single awk script (which would considerably reduce the time spent reading and writing data that should only need to be read once and written at most twice; instead of reading the data five times and writing it six or seven times).
But, I would guess (hard to make any sound judgements here with no samples of the data being processed) that the bulk of the time being spent in this script is in compressing and recompressing relatively large files for your archives. And if you just need the files that you are splitting out of GC_ports.txt, the time spent creating, compressing, and archiving that unneeded file could be significant.
Can you show us some sample uncompressed data that is being pushed through the pipeline by find ... -exec zcat {} \;? Figuring out exactly what that pipeline is doing without knowing where the slashes and colons are makes it hard to feel confident about suggesting ways to streamline your awk and sed scripts.
Even though echo is a shell built-in, invoking echo four times in a row instead of calling printf (another shell built-in) once is inefficient. (Depending on what operating system you're using, you could probably produce the same output with a single echo instead of a single printf, but I prefer print since its formatting options are more portable.) Do you really want/need that many empty lines in the output produced by this script?
This User Gave Thanks to Don Cragun For This Post:
Sorry, Ports2.txt is incorrect. I pull data from two different archives so I created one script to test functionality and then copied that into two separate script files using a third script to run the other two. I had the script itself running so I pulled the code from the original file instead and missed that it was still referencing the test file.
Ports.sh only contains lines to run
vacavilleports.sh and goldcampports.sh
the code I posted was from a file named testports.sh which was the test code copied into vacavilleports.sh and goldcampports.sh then each of those was modified to reference their proper archive locations and output files.
I did not know if opening the script file while it was running would impact it so I chose the safe route of opening the test version.
Here is sample output from
Here is some sample input
Here is the exact code from goldcampports.sh
---------- Post updated at 09:44 PM ---------- Previous update was at 09:31 PM ----------
In answer to your other questions,
1. No I do not care about having the GC_ports.txt file.
2. My only goal is to reach the output in those 4 files.
3. The blank lines are just for spacing. I did not spend much time researching the best way to do blank lines as this script has minimal output, just enough to let the person who runs it know what it is doing. I am mainly the person who runs it, but I built that in just in case someone else had to run it and got confused by the system not returning immediately to the prompt.
I hope the script is not too hard to follow, I am a network engineer not a programmer or a unix admin. This all is to assist a client in redoing their firewall.
The input is from the firewall log. We are looking for lines with the word built in it and capturing the source IP, destination IP, destination port and protocol of the connections. The four output files are dumped into 4 sheets in excel for us to see what ip's are talking and what we need to build rules for. When a previous company setup the firewall they left any/any rules in place for internal traffic and only locked down the outside interface. So we have to figure out what rules we need to create before removing those any/any rules and causing massive connectivity issues.
Yes, there are many tools out there to do this for us, but this is all client owned hardware and they don't have those tools installed. So we are left with this.
The log files it pulls are in 1 hour intervals, so 24 files per day times 7 days = 168 compressed log files. I did try copying down the zipped files and then uncompressing them on the local machine but expanded they are almost 60 gig. (repetitive text compresses VERY VERY well)
Thank you again for your suggestions and assistance.
Last edited by Scrutinizer; 04-13-2016 at 04:44 PM..
Reason: code tags
What is the difference in "+" instead of "\;" ? What about that would help with the speed? Sorry for my ignorance. I really am trying to learn as I go on this.
What is the difference in "+" instead of "\;" ? What about that would help with the speed? Sorry for my ignorance. I really am trying to learn as I go on this.
First off: no problem! There no dumb questions, just dumb answers, so don't be shy.
The difference is that "\;" will call the command named in "-exec" each time a file is found. For instance, let us suppose in the curent directory are 5 files, "a", "b", "c", "d" and "e" (and nothing else):
This wil delete all the files, but it will delete every file individually. In fact this will be executed:
But "rm" can take a list of files as well and this:
would be the same as
The difference seems to be small, but the most time, when calling a command like "rm", is the loading and starting of the program, not its execution. Therefore, if the command is executed only every fifth time the speed gain will be quite noticeably.
I was thinking of taking Scrutinizer's suggestions a step further, getting rid of the unneeded GC_ports.txt file completely and just using one awk script to produce the four desired output files (GC_tcpinbound.txt, GC_tcpoutbound.txt, GC_udpinbound.txt, and GC_updoutbound.txt). As Bakunin explained, find -exec command+ instead of find -exec command\; reduces the number of times zcat is invoked. I added the, -v option to zcat to get a visible indication that progress is being made while the script runs.
Please remove /var/network_logs/gc/archive/GC_ports.txt if that file is still present from an earlier run of your script. Then see if something more like:
runs a little faster for you.
I know that you said you wanted to use bash, but I generally find that ksh will run scripts like this a little faster. These shells use different output formats for the output from the times built-in utility, but should otherwise produce identical results for this script. (You may want to try both a few times with real data to see how much of a difference in speed there is between bash and ksh on your system.)
When run with InputDir and OutputDir set to "." and with six copies of a compressed version of the sample input you provided in post #3 in files named asalog_test1.Z through asalog_test6.Z, it produces the output files GC_updoutbound.txt containing:
and the compressed tar archive file GC_ports.tgz and writes the following to standard output and standard error output:
while it runs.
While your script from post #3 in this thread (using bash but converted to use files in the current directory) produces the output:
As I said before, I imagine that a good portion of the time in this script is spent decompressing the asalog* files and, depending on the sizes of your four output files, recompressing the data as it creates the compressed archive, but I'm hoping the reduced number of processes running and the reduced number of times the uncompressed data is read and written will make this noticeably faster when you're working with real data.
Note that you provided sample UPD outbound records as sample input data and you showed sample output data for TCP inbound records. So, I'm not sure that I produced the correct output formats for inbound or outbound records (since the output format for inbound records is not the same as the output format for outbound records).
Gents,
I did the below code to get an output (report) ,.. the code works fine but I believe it can be more shorted using better method.
Please if you can help, to generate same output improving the code , will be great.
here my code.
# get diff in time
awk '{$9=$8-prev8;prev8=$8;print... (8 Replies)
I am primarily a SQA/Tester and new to korn shell. How can I improve the following script?
#/bin/ksh
SourceLocation=~/Scripts/Test/Source
TrackerLocation=~/Scripts/Test/Tracker
TargetLocation=rdbusse@rdbmbp:/Users/rdbusse/Scripts/Test/Target
for file in $(cd $SourceLocation; ls)
do
... (7 Replies)
I'm obviously very new to this. I'm trying to write a simple for loop that will read the directory names in /Users and then copy a file into the same subdir in each user directory.
I have this, and it works but it isn't great.
#!/bin/bash
HOMEDIRS=/Users/*
for dirs in $HOMEDIRS; do
if ];... (5 Replies)
I have the following code, and I am changing it to
#!/bin/bash
hasArgumentCModInfile=0
hasArgumentSrcsInfile=0
hasArgumentRcvsInfile=0
OLDIFS="$IFS"
IFS="|=" # IFS controls splitting. Split on "|" and "=", not whitespace.
set -- $* # Set the positional... (3 Replies)
Hi guys, I use this function which was provided to me by someone at this site. It works perfectly for validating a users input option against allowed options..
example:
validateInput "1" "1 3 4 5" would return 0 (success)
function validateInput {
input=$1
allowedInput=$2
for... (4 Replies)
Hi;
I want to access our customer database to retreive all clients that have as language index 2 or 3 and take their client number.
My input is a file containing all client numbers.
i access the data base using a function call "scpshow". The total number of clients i want to scan is 400 000... (6 Replies)
Hi all,
I have a script that scan files, find old templet and replace it with new one.
#!/bin/ksh
file_name=$1
old_templet=$2
new_templet=$3
# Loop through every file like this
for file in file_name
do
cat $file | sed "s/old_templet/new_templet/g" > $file.new
#do a global searce and... (8 Replies)