Need help improving my script.

04-12-2016

Registered User

4, 0

Join Date: Apr 2016

Last Activity: 14 April 2016, 12:00 AM EDT

Posts: 4

Thanks Given: 1

Thanked 0 Times in 0 Posts

Need help improving my script.

Thank you for taking the time to look at this and provide input.
To start, I am not a linux/unix expert but I muddle through the best I can.
I am also in no way shape or form a programmer. Please keep that in mind as you read this script.

This script is designed to find all files in a given directory that begin with "asalog", find lines containing a specific word and then process down those lines and output just the needed information. These files are currently zipped. The files are stored on remote ZFS storage. Copying all of the files down to the local system at once then unzipping is not feasible do to storage limitations. The script works as designed but it is very slow to do the task.

Please look over the code and suggest ways that I could improve its speed. The last run took 238 minutes to complete.

Due to access limitations I have to work within BASH, I do not have the option (nor the knowledge) to utilize perl, python, etc.

Any help is welcome as well as comments on the script as it sits. It has been cobbled together by remembering programming structure learned taking Turbo Pascal in high school (many years ago) and lots of google searches.

Code:

echo Search started at:
date +"%m/%d/%Y %T"
# Displays the start up information and the start time

find /var/network_logs/gc/archive/asalog*  -mtime -7 -exec zcat {} \;  |  awk '/Built/&& !/10.10.120.145/{print $10, $11, $15, $18;}' | sed -e 's!/! !g' -e  's!:! !g' | awk '{if ($1 == "inbound") print $1, $2, $3, $4, $6, $7, $8; else if ($1 == "outbound") print $1, $2, $6, $7, $3, $4, $5;}' | awk '!seen[$0]++ {print}' >> /home/kenneth.cramer/asa/GC_ports.txt

# Finds all files with that begin with the name asalog that were written in the last 7 days. It then reads the files line by line looking
# for any lines containing the word Built but not the 10.10.120.145 IP address and prints out the 7th, 8th, 12th and 15th words in the line
# It then looks for any "/" slashes or ":" colons in the four words and replaaces them with spaces.
# The script now prints out the needed words from the line and then writes only unique lines to the output file.

echo
echo
echo
echo Sorting data into proper files.
# Displays that the script is now sorting the information

awk '{if ($1 == "inbound" && $2 == "TCP") print $2, $3, $4, $5, $6, $7 >> "/home/kenneth.cramer/asa/GC_tcpinbound.txt"; else if ($1 == "inbound" && $2 == "UDP") print $2, $3, $4, $5,  $6, $7 >> "/home/kenneth.cramer/asa/GC_udpinbound.txt"; else if ($1 == "outbound" && $2 == "TCP") print $2, $3, $4, $5, $6, $7 >> "/home/kenneth.cramer/asa/GC_tcpoutbound.txt"; else if ($1 == "outbound" && $2 == "UDP") print $2, $3, $4, $5, $6, $7 >> "/home/kenneth.cramer/asa/GC_udpoutbound.txt";}' /home/kenneth.cramer/asa/GC_ports.txt
# The script now reads the file ports2.txt and sorts the data into 4 files based on it finding "Inbound or Outbound" and "TCP or UDP" in the line.


echo
echo
echo
echo Compressing files for transport

tar -czvf /home/kenneth.cramer/asa/GC_ports.tgz /home/kenneth.cramer/asa/GC_*.txt
# Compresses the output files into a single file for transport off the machine.

echo Process completed for Gold Camp at:
date +"%m/%d/%Y %T"
echo
echo
times

Last edited by garlandxj11; 04-12-2016 at 09:50 PM..

garlandxj11

View Public Profile for garlandxj11

Find all posts by garlandxj11

04-12-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You have the comment:

Code:

# The script now reads the file ports2.txt and sorts the data into 4 files based on it finding "Inbound or Outbound" and "TCP or UDP" in the line.

but there is no ports2.txt anywhere in your script. Do you care about having the file GC_ports.txt, or do you really just need the four GC_(tcp|udp)(in|out)bound.txt files that are created from this file?

Assuming you have more than one compressed file that is less than a week old, using -exec zcat {} + will be faster than -exec zcat {} \; and you could replace all four awk scripts and the sed script with a single awk script (which would considerably reduce the time spent reading and writing data that should only need to be read once and written at most twice; instead of reading the data five times and writing it six or seven times).

But, I would guess (hard to make any sound judgements here with no samples of the data being processed) that the bulk of the time being spent in this script is in compressing and recompressing relatively large files for your archives. And if you just need the files that you are splitting out of GC_ports.txt, the time spent creating, compressing, and archiving that unneeded file could be significant.

Can you show us some sample uncompressed data that is being pushed through the pipeline by find ... -exec zcat {} \;? Figuring out exactly what that pipeline is doing without knowing where the slashes and colons are makes it hard to feel confident about suggesting ways to streamline your awk and sed scripts.

Even though echo is a shell built-in, invoking echo four times in a row instead of calling printf (another shell built-in) once is inefficient. (Depending on what operating system you're using, you could probably produce the same output with a single echo instead of a single printf, but I prefer print since its formatting options are more portable.) Do you really want/need that many empty lines in the output produced by this script?

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-12-2016

Registered User

4, 0

Join Date: Apr 2016

Last Activity: 14 April 2016, 12:00 AM EDT

Posts: 4

Thanks Given: 1

Thanked 0 Times in 0 Posts

Sorry, Ports2.txt is incorrect. I pull data from two different archives so I created one script to test functionality and then copied that into two separate script files using a third script to run the other two. I had the script itself running so I pulled the code from the original file instead and missed that it was still referencing the test file.

Ports.sh only contains lines to run
vacavilleports.sh and goldcampports.sh

the code I posted was from a file named testports.sh which was the test code copied into vacavilleports.sh and goldcampports.sh then each of those was modified to reference their proper archive locations and output files.

I did not know if opening the script file while it was running would impact it so I chose the safe route of opening the test version.

Here is sample output from

Code:

GC_tcpinbound.txt
TCP internal 10.20.114.190 intmgmt 10.20.100.175 258
TCP internal 10.20.114.190 intmgmt 10.20.100.175 6455
TCP internal 10.20.114.190 intmgmt 10.20.100.175 1678
TCP internal 10.20.114.190 intmgmt 10.20.100.162 33923

Here is some sample input

Code:

Apr  5 19:00:02 Apr 05 2016 19:01:02: %ASA-6-302015: Built outbound UDP connection 731199055 for internal:10.20.114.120/53 (10.20.114.120/53) to intmgmt:10.20.100.48/53099 (10.20.100.48/53099)
Apr  5 19:00:02 Apr 05 2016 19:01:02: %ASA-6-302015: Built outbound UDP connection 731199056 for internal:10.20.114.120/53 (10.20.114.120/53) to intmgmt:10.20.100.48/43185 (10.20.100.48/43185)
Apr  5 19:00:02 Apr 05 2016 19:01:02: %ASA-6-302015: Built outbound UDP connection 731199057 for internal:10.20.114.120/53 (10.20.114.120/53) to intmgmt:10.20.100.48/42319 (10.20.100.48/42319)
Apr  5 19:00:02 Apr 05 2016 19:01:02: %ASA-6-302016: Teardown UDP connection 731198699 for outside:158.96.0.254/53 to internal:10.20.114.124/58504 duration 0:00:00 bytes 179
Apr  5 19:00:02 Apr 05 2016 19:01:02: %ASA-6-302015: Built outbound UDP connection 731199059 for internal:10.20.114.120/53 (10.20.114.120/53) to intmgmt:10.20.100.48/54069 (10.20.100.48/54069)

Here is the exact code from goldcampports.sh

Code:

echo Search started at:
date +"%m/%d/%Y %T"
# Displays the start up information and the start time

find /var/network_logs/gc/archive/asalog*  -mtime -7 -exec zcat {} \;  |  awk '/Built/&& !/10.10.120.145/{print $10, $11, $15, $18;}' | sed -e 's!/! !g' -e  's!:! !g' | awk '{if ($1 == "inbound") print $1, $2, $3, $4, $6, $7, $8; else if ($1 == "outbound") print $1, $2, $6, $7, $3, $4, $5;}' | awk '!seen[$0]++ {print}' >> /home/kenneth.cramer/asa/GC_ports.txt

# Finds all files with that begin with the name asalog that were written in the last 7 days. It then reads the files line by line looking
# for any lines containing the word Built but not the 10.10.120.145 IP address and prints out the 7th, 8th, 12th and 15th words in the line
# It then looks for any "/" slashes or ":" colons in the four words and replaaces them with spaces.
# The script now prints out the needed words from the line and then writes only unique lines to the output file.

echo
echo
echo
echo Sorting data into proper files.
# Displays that the script is now sorting the information

awk '{if ($1 == "inbound" && $2 == "TCP") print $2, $3, $4, $5, $6, $7 >> "/home/kenneth.cramer/asa/GC_tcpinbound.txt"; else if ($1 == "inbound" && $2 == "UDP") print $2, $3, $4, $5,  $6, $7 >> "/home/kenneth.cramer/asa/GC_udpinbound.txt"; else if ($1 == "outbound" && $2 == "TCP") print $2, $3, $4, $5, $6, $7 >> "/home/kenneth.cramer/asa/GC_tcpoutbound.txt"; else if ($1 == "outbound" && $2 == "UDP") print $2, $3, $4, $5, $6, $7 >> "/home/kenneth.cramer/asa/GC_udpoutbound.txt";}' /home/kenneth.cramer/asa/GC_ports.txt
# Thee script now reads the file ports2.txt and sorts the data into 4 files based on it finding "Inbound or Outbound" and "TCP or UDP" in the line.
                             DOH!!! This  ^ should read    #The script now reads the file GC_ports.txt and sorts the data into 4 files based on it finding "Inbound or Outbound" and "TCP or UDP" in the line.

echo
echo
echo
echo Compressing files for transport

tar -czvf /home/kenneth.cramer/asa/GC_ports.tgz /home/kenneth.cramer/asa/GC_*.txt
# Compresses the output files into a single file for transport off the machine.

echo Process completed for Gold Camp at:
date +"%m/%d/%Y %T"
echo
echo
times

---------- Post updated at 09:44 PM ---------- Previous update was at 09:31 PM ----------

In answer to your other questions,

1. No I do not care about having the GC_ports.txt file.
2. My only goal is to reach the output in those 4 files.
3. The blank lines are just for spacing. I did not spend much time researching the best way to do blank lines as this script has minimal output, just enough to let the person who runs it know what it is doing. I am mainly the person who runs it, but I built that in just in case someone else had to run it and got confused by the system not returning immediately to the prompt.

I hope the script is not too hard to follow, I am a network engineer not a programmer or a unix admin. This all is to assist a client in redoing their firewall.
The input is from the firewall log. We are looking for lines with the word built in it and capturing the source IP, destination IP, destination port and protocol of the connections. The four output files are dumped into 4 sheets in excel for us to see what ip's are talking and what we need to build rules for. When a previous company setup the firewall they left any/any rules in place for internal traffic and only locked down the outside interface. So we have to figure out what rules we need to create before removing those any/any rules and causing massive connectivity issues.

Yes, there are many tools out there to do this for us, but this is all client owned hardware and they don't have those tools installed. So we are left with this.

The log files it pulls are in 1 hour intervals, so 24 files per day times 7 days = 168 compressed log files. I did try copying down the zipped files and then uncompressing them on the local machine but expanded they are almost 60 gig. (repetitive text compresses VERY VERY well)

Thank you again for your suggestions and assistance.

Last edited by Scrutinizer; 04-13-2016 at 04:44 PM.. Reason: code tags

garlandxj11

View Public Profile for garlandxj11

Find all posts by garlandxj11

04-12-2016

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

I would suggest to also use some of you vertical real estate, since that will greatly improve readability for future maintenance..

I expect Don's suggestion of using the + instead of \; will significantly increase processing speed.

An untested example of what a single awk might look like:

Code:

find /var/network_logs/gc/archive/asalog*  -mtime -7 -exec zcat {} +  |
awk '
  !/Built inbound|Built outbound/ || /10\.10\.120\.145/ {
    next
  }
  {
    $0=$10 FS $11 FS $15 FS $18                   # recalculate fields
    gsub("[/:]",FS)
    if ($1 == "inbound")
      $0=$1 FS $2 FS $3 FS $4 FS $6 FS $7 FS $8   # recalculate fields
    else if ($1 == "outbound")
      $0=$1 FS $2 FS $6 FS $7 FS $3 FS $4 FS $5   # recalculate fields
  } 
  !seen[$0]++
' >> /home/kenneth.cramer/asa/GC_ports.txt

Last edited by Scrutinizer; 04-12-2016 at 11:57 PM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

04-13-2016

Registered User

4, 0

Join Date: Apr 2016

Last Activity: 14 April 2016, 12:00 AM EDT

Posts: 4

Thanks Given: 1

Thanked 0 Times in 0 Posts

What is the difference in "+" instead of "\;" ? What about that would help with the speed? Sorry for my ignorance. I really am trying to learn as I go on this.

garlandxj11

View Public Profile for garlandxj11

Find all posts by garlandxj11

04-13-2016

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by garlandxj11

What is the difference in "+" instead of "\;" ? What about that would help with the speed? Sorry for my ignorance. I really am trying to learn as I go on this. Smilie

First off: no problem! There no dumb questions, just dumb answers, so don't be shy.

The difference is that "\;" will call the command named in "-exec" each time a file is found. For instance, let us suppose in the curent directory are 5 files, "a", "b", "c", "d" and "e" (and nothing else):

Code:

find . -type f -exec rm {} \;

This wil delete all the files, but it will delete every file individually. In fact this will be executed:

Code:

rm a
rm b
rm c
rm d
rm e

But "rm" can take a list of files as well and this:

Code:

find . -type f -exec rm {} +

would be the same as

Code:

rm a b c d e

The difference seems to be small, but the most time, when calling a command like "rm", is the loading and starting of the program, not its execution. Therefore, if the command is executed only every fifth time the speed gain will be quite noticeably.

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

04-13-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I was thinking of taking Scrutinizer's suggestions a step further, getting rid of the unneeded GC_ports.txt file completely and just using one awk script to produce the four desired output files (GC_tcpinbound.txt, GC_tcpoutbound.txt, GC_udpinbound.txt, and GC_updoutbound.txt). As Bakunin explained, find -exec command + instead of find -exec command \; reduces the number of times zcat is invoked. I added the, -v option to zcat to get a visible indication that progress is being made while the script runs.

Please remove /var/network_logs/gc/archive/GC_ports.txt if that file is still present from an earlier run of your script. Then see if something more like:

Code:

#!/bin/ksh
InputDir='/var/network_logs/gc/archive'
OutputDir='/home/kenneth.cramer/asa'

# Display the start time...
date +'Search started at: %m/%d/%Y %T%nProcessing asalog files...'

# Find and uncompress asalog* files that are less than a week old...
find "$InputDir"/asalog* -mtime -7 -exec zcat -v {} + |
awk -v OutputDir="$OutputDir" '
!/Built/ || /10.10.120.145/ {
	# Discard lines that do not contain "Built" and lines that contain
	# IP address 10.10.120.145.
	next
}
{	# Throw away unneeded data...
	$0 = $10 OFS $11 OFS $15 OFS $18
	# and change "/"s and ":"s to spaces (recomputing field boundaries).
	gsub("[/:]", " ")
}
$1 == "inbound" {
	# Process inbound records.
	if(seen[$1, $2, $3, $4, $6, $7, $8]++) {
		# Discard duplicates.
		next
	}
	# Following asuumes we only have TCP and UDP inbound records.
	# Print to one of two inbound text files.
	print $2, $3, $4, $6, $7, $8 > (OutputDir "/GC_" \
	    (($2 == "TCP") ? "tcp" : "udp") "inbound.txt")
}
$1 == "outbound" {
	# Process outbound records.
	if(seen[$1, $2, $6, $7, $3, $4, $5]++) {
		# Discard duplicates.
		next
	}
	# Following asuumes we only have TCP and UDP inbound records.
	# Print to one of two outbound text files.
	print $2, $6, $7, $3, $4, $5 > (OutputDir "/GC_" \
	    (($2 == "TCP") ? "tcp" : "udp") "outbound.txt")
}'

# Compress the output files into a single file for transport off the machine...
printf '\nCompressing files for transport...\n'

tar -czvf "$OutputDir/GC_ports.tgz" "$OutputDir"/GC_*.txt

# Print end time and statistics...
date +'%nProcess completed for Gold Camp at: %m/%d/%Y %T'
times

runs a little faster for you.

I know that you said you wanted to use bash, but I generally find that ksh will run scripts like this a little faster. These shells use different output formats for the output from the times built-in utility, but should otherwise produce identical results for this script. (You may want to try both a few times with real data to see how much of a difference in speed there is between bash and ksh on your system.)

When run with InputDir and OutputDir set to "." and with six copies of a compressed version of the sample input you provided in post #3 in files named asalog_test1.Z through asalog_test6.Z, it produces the output files GC_updoutbound.txt containing:

Code:

UDP intmgmt 10.20.100.48 internal 10.20.114.120 53

and the compressed tar archive file GC_ports.tgz and writes the following to standard output and standard error output:

Code:

Search started at: 04/13/2016 07:37:58
Processing asalog files...
./asalog_test1.Z:	   43.4%
./asalog_test2.Z:	   43.4%
./asalog_test3.Z:	   43.4%
./asalog_test4.Z:	   43.4%
./asalog_test5.Z:	   43.4%
./asalog_test6.Z:	   43.4%

Compressing files for transport...
a ./GC_udpoutbound.txt

Process completed for Gold Camp at: 04/13/2016 07:37:58
user	0m0.00s
sys	0m0.00s

while it runs.

While your script from post #3 in this thread (using bash but converted to use files in the current directory) produces the output:

Code:

Search started at:
04/13/2016 07:39:54



Sorting data into proper files.



Compressing files for transport
a ./GC_ports.txt
a ./GC_udpoutbound.txt
Process completed for Gold Camp at:
04/13/2016 07:39:54


0m0.002s 0m0.017s
0m0.011s 0m0.014s

As I said before, I imagine that a good portion of the time in this script is spent decompressing the asalog* files and, depending on the sizes of your four output files, recompressing the data as it creates the compressed archive, but I'm hoping the reduced number of processes running and the reduced number of times the uncompressed data is read and written will make this noticeably faster when you're working with real data.

Note that you provided sample UPD outbound records as sample input data and you showed sample output data for TCP inbound records. So, I'm not sure that I produced the correct output formats for inbound or outbound records (since the output format for inbound records is not the same as the output format for outbound records).

Hope this helps,
- Don

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Need help improving my script.

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Improving code

Discussion started by: jiam912

2. Shell Programming and Scripting

Help with improving korn shell script

Discussion started by: bayouprophet

3. Shell Programming and Scripting

Basic help improving for in loop

Discussion started by: Heath_T

4. Shell Programming and Scripting

Improving code by using associative arrays

Discussion started by: kristinu

5. Shell Programming and Scripting

Improving this validate function

Discussion started by: pyscho

6. UNIX for Dummies Questions & Answers

Improving Unix Skills

Discussion started by: sak900354

7. Shell Programming and Scripting

improving my script

Discussion started by: bcheaib

8. UNIX for Dummies Questions & Answers

improving my script (find & replace)

Discussion started by: amir_yosha