Quote:
Originally Posted by
jim mcnamara
Because this runs on lots of servers, you should use the very data you are getting to see if the problem is localized to a few servers, or generally spread across all of them. Not clear to me you did this.
Parallel is also obfuscating granularity for process observation. One process that does something outrageous may not show itself right away. You may have to resort to using the time command on each process and look for outliers. But first you must find at least one poster child server that clearly runs slower today compared with a while back.
I looked at specific server trends over time and there is no incremental pattern while looking at a single box -- it could be 20 seconds lower the second day, 40 seconds higher the third, etc. This variability is probably due to two factors: the general activity on the target box (if it's working hard it may respond more slowly), and the fact that the local process could be resource impacted due to parallel maxing out connections.
Quote:
Originally Posted by
jim mcnamara
One more. In thinking about the system design, why do you not put the onus of processing on each remote?
I would
love to but I have a restriction of not being allowed to write anything to any of the remote servers. Another factor is that these servers are constantly flipping between up and down due to maintenance and a crontab script would not run reliably.
Quote:
Originally Posted by
jim mcnamara
Why would you do this?
The idea was to get ahead of potential problems or catch things that other monitoring tools weren't looking at. It is not the solution I wanted, but it's what someone higher up signed off on.
Quote:
Originally Posted by
MadeInGermany
Do you have shared resources like NFS?
Hundreds of parallel df can cause load in the NFS server, and increase execution time.
Run an ssh job manually with "time", stop your monitoring engine and run the ssh job again. Compare the execution times.
Yes, actually. The SCP segment of the code downloads to an NFS parition... I will investigate this.
Thank you for your suggestions!
I'll let you know how things go when I dig into this.