Daily script taking increasingly longer each day


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Daily script taking increasingly longer each day
# 1  
Old 03-05-2017
Network Daily script taking increasingly longer each day

Hello,

I was wondering if anyone had an idea why a recurring script might take slightly longer each day.

The intent of the script is to run on a dedicated VM and connect to several thousand remote servers to assay their operating status by running a series of commands and storing the results locally. This script runs once daily.

On the VM, the script is run in parallel (using GNU parallel) to increase the rate that the target servers are checked. The VM runs as many parallel connections as it can handle (maxed out resources) which equates to roughly 100 connections at any given time.
The first part of the script initiates an SSH connection and passes about 30 commands to the target server (basic commands such as top, cat ..., grep ..., etc). Results are stored in a local file.
The second part of the script initiates an SCP request and copies zero to many (max 30) gzip files to the local server based on their filename timestamp -- these files represent logs of the activity on the servers, where more active servers have larger/more gzipped log files.

I've set a limit in GNU parallel of 140 seconds. If steps 1 and 2 combined take longer than 140 seconds then end the job and move on. This number is high as the process shouldn't take longer than 100 seconds.

So far I've covered how the data is pulled. The data is stored in many txt files on the local VM, each named with the ip address and date of creation -- this is necessary for later aggregation and analysis by our monitoring tool. I've also had to hash the files into directories based on their IP structure.. i.e.:
ip 1.2.3.4 would be in folder 1.2/1.2.3/1.2.3.4_sshresult_date.txt
This was done because the monitoring/aggregation tool had a difficult time reading from a single folder with such a large number of files. Once our monitoring tool reads the .txt files it also deletes them to prepare for the next day's run.

This is a big issue for me as when this process was first put in, it took 5 hours to complete. In the past 2 months that number has climbed to 8.5 hours, with no changes in the script. I've added some extra logging to the SSH and SCP components of the script and each day I can see that the average SSH time increases by 0.5 to 1 second. The same goes for the SCP execution.


What I've tried:
1) Reinitializing the monitoring/aggregating tool (which sits activtely on the VM as well) in case it was causing a memory leak or file locking issues.
2) Rebooting the server occasionally to clear memory in case there existed a general memory leak from any source.

Some possible explanations that I've thought of or have been suggested to me:
1) inodes on the local VM may be out of wack due to the large number of files created and deleted each day. I've never had to deal with inode management so I'm not sure if this is plausible, or how to deal with it if it is the issue.
2) Something I thought was perhaps the process of connecting each day by SSH and SCP to a remote server is producing some effect on that remote server which is actually causing it to respond more slowly each day -- this would be worst case scenario. I don't see how initiating an SSH&SCP session once each per day could affect the target server, but could this be possible assuming the target server could be considered 'old/fragile'?

Sorry for the long post, I just wanted to make sure I gave enough detail. Please let me know if you have any questions and I'll do my best to answer.

Thanks for any help provided! I'm relatively new to large scale bash scripting and I want this thing to run efficiently.
# 2  
Old 03-05-2017
Two comments

In performance tuning slower means you check I/O first. A priori it sounds like a disk efficiency problem. That slight increment indicates I/O is most likely the problem - increasingly bigger file sizes, more files, bad directory lookup performance.

Example: huge numbers of files in a single directory degrade lookup performance. This is disk hardware and filesystem dependent. We had a poster here years back who could not understand why it took ls 90 seconds to locate a file in a directory with a million emails. Workaround was/is to create a multi-branched directory tree with a lot fewer entries per directory. The find command has similar problems. The tell for this is when you get really large directory file sizes. ls -ld somedirectory PS: directory files in most filesystems are not self-reorganizing - they don't shrink. So the smoking gun does not go away usually.

Because this runs on lots of servers, you should use the very data you are getting to see if the problem is localized to a few servers, or generally spread across all of them. Not clear to me you did this.

Parallel is also obfuscating granularity for process observation. One process that does something outrageous may not show itself right away. You may have to resort to using the time command on each process and look for outliers. But first you must find at least one poster child server that clearly runs slower today compared with a while back.

Sounds like your next few Sundays (or whenever you can monopolize some servers) are spoken for.

Last edited by jim mcnamara; 03-05-2017 at 12:49 PM..
# 3  
Old 03-05-2017
One more. In thinking about the system design, why do you not put the onus of processing on each remote?

The remote has crontab script to run stuff at 2:00AM or whatever. Then it sends a communication via scp-ed file to your local system, 'come get your files and it ran 03:14:10 duration today' or ' I have a problem'. Whatever you need to see

Your local code just checks once a minute to see who has sent files. At the end of the run, checks that a required number of communication files exists or from required servers.

Do not forget to include the monitoring nagios box (or whatever) as part of the problem set. It could have problems, too.

Why would you do this? It is kind of like instrumenting a code base deployed all over the place, what you need to start finding problems. We use a database to keep this stuff. Oracle in our case. SQL is an extremely efficient and powerful tool for scanning datasets for almost anything. Oracle dumps a daily control file for our local monitoring script, because we have a lot of 'If today is Tuesday and I like bacon then do this' kinds of ill-conceived business rules about monitoring stuff.
# 4  
Old 03-06-2017
Do you have shared resources like NFS?
Hundreds of parallel df can cause load in the NFS server, and increase execution time.

Run an ssh job manually with "time", stop your monitoring engine and run the ssh job again. Compare the execution times.
# 5  
Old 03-10-2017
Quote:
Originally Posted by jim mcnamara
Because this runs on lots of servers, you should use the very data you are getting to see if the problem is localized to a few servers, or generally spread across all of them. Not clear to me you did this.

Parallel is also obfuscating granularity for process observation. One process that does something outrageous may not show itself right away. You may have to resort to using the time command on each process and look for outliers. But first you must find at least one poster child server that clearly runs slower today compared with a while back.
I looked at specific server trends over time and there is no incremental pattern while looking at a single box -- it could be 20 seconds lower the second day, 40 seconds higher the third, etc. This variability is probably due to two factors: the general activity on the target box (if it's working hard it may respond more slowly), and the fact that the local process could be resource impacted due to parallel maxing out connections.

Quote:
Originally Posted by jim mcnamara
One more. In thinking about the system design, why do you not put the onus of processing on each remote?
I would love to but I have a restriction of not being allowed to write anything to any of the remote servers. Another factor is that these servers are constantly flipping between up and down due to maintenance and a crontab script would not run reliably.

Quote:
Originally Posted by jim mcnamara
Why would you do this?
The idea was to get ahead of potential problems or catch things that other monitoring tools weren't looking at. It is not the solution I wanted, but it's what someone higher up signed off on.

Quote:
Originally Posted by MadeInGermany
Do you have shared resources like NFS?
Hundreds of parallel df can cause load in the NFS server, and increase execution time.

Run an ssh job manually with "time", stop your monitoring engine and run the ssh job again. Compare the execution times.
Yes, actually. The SCP segment of the code downloads to an NFS parition... I will investigate this.


Thank you for your suggestions! Smilie I'll let you know how things go when I dig into this.
# 6  
Old 03-17-2017
Quote:
Originally Posted by Threeze

Quote:
Originally Posted by MadeInGermany
Do you have shared resources like NFS?
Hundreds of parallel df can cause load in the NFS server, and increase execution time.

Run an ssh job manually with "time", stop your monitoring engine and run the ssh job again. Compare the execution times.
Yes, actually. The SCP segment of the code downloads to an NFS parition... I will investigate this.
So I adjusted the script so that it would not utilize the NFS storage partition at all and I'm seeing roughly an equivalent execution speed (there was no improvement). Leads me to believe that NFS I/O was not the limiting factor here...
# 7  
Old 03-20-2017
My known_hosts file is surprisingly large. With the way I wrote the script it would have plateaued over time (don't ask), so my question is could a very large known_hosts file slow down SSH connection time? I would guess that it could take longer to validate an actively initiating SSH connection...
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Script is not longer working in the crontab

This is the crontab it is supossed to be running everyday but it didnt 5 0 * * * /export/app/CO/opge/scr/Informe_parametros_colombia.ksh >/dev/null 2>&1 Inside the above script connects to a database and extract data to a flat file, manually i run the script at about 2 a.m. and Works OK,... (6 Replies)
Discussion started by: alexcol
6 Replies

2. Solaris

Copy via samba on vmware workstation with Solaris taking much longer than usual

i have a vmware workstation with solaris 10 installed on this. i copying 2 gb data via samba from windows to this vmware workstation. copy speed is 24 kb/sec. how i can speed up this copy process ? (7 Replies)
Discussion started by: rehantayyab82
7 Replies

3. Shell Programming and Scripting

Script for daily use

I have a clear case command for example. ct lsprivate -co this displays the list of checked out files. and i have many views where i work daily I need a script which can run daily at our specified time. setting each and every view i have and list the check outs i have in them. and consolidate... (10 Replies)
Discussion started by: Syed Imran
10 Replies

4. Shell Programming and Scripting

Script to check if last modified day is previous day

Hi, I would like to write a script that checks if a file ('counter') was modified the previous day, if so erase its contents and write 00000000 into it. For e.g. if the file 'counter' was last modified at 11.30pm on 24th May and the script runs at 12.15am of 25th May, it should erase it's... (1 Reply)
Discussion started by: hegdepras
1 Replies

5. UNIX for Dummies Questions & Answers

Run a .sh script daily

Hi, I juat wondering how can you set it up so that .sh files will execute automatically once a day. from google I've got use crontab but when I type this into my session it say I am not allowed to use this programme. Any other ways to achieve what I'm looking for? thanks (1 Reply)
Discussion started by: blackieconnors
1 Replies

6. Shell Programming and Scripting

Script to find previous month last day minus one day timestamp

Hi All, I need to find the previous month last day minus one day, using shell script. Can you guys help me to do this. My Requirment is as below: Input for me will be 2000909(YYYYMM) I need the previous months last day minus 1 day timestamp. That is i need 2000908 months last day minus ... (3 Replies)
Discussion started by: girish.raos
3 Replies

7. Shell Programming and Scripting

Write a shell script to find whether the first day of the month is a working day

Hi , I am relatively new to unix... Can u pls help me out to find out if the first day of the month is a working day ie from (Monday to Friday)...using Date and If clause in Korn shell.. This is very urgent. Thanks for ur help... (7 Replies)
Discussion started by: phani
7 Replies

8. UNIX for Dummies Questions & Answers

Need a script to do daily backups

So I have a set of directories and files that I need to backup from /directory1/ to /directory2/ each night. I have some UNIX/SSH knowledge but don't assume I know a whole lot b/c I would hate to screw something up. Here's the knowledge I have: I can access my server via SSH and can navigate to... (10 Replies)
Discussion started by: JPigford
10 Replies

9. SCO

Telnet connection to Sco Unixware from Windows 2000 taking longer !!!

Telnet connection to SCO Unixware server taking longer to return the Login prompt !!! Thanx in advance for any suggestions or help in this regard . Cheers KBiswas (9 Replies)
Discussion started by: kbiswas
9 Replies
Login or Register to Ask a Question