New Customer: Houston we have a problem! Our webserver has strange delays and outages. Every day at the same time. Some minutes and then it goes away. We internally checked the configuration of the webserver. It works ok most of the time but then these strange delays happen.
Technician: What does your monitoring say about your virtual machine and the apache process?
Customer: There is no monitoring on the virtual machine.
Technician: <installs a monitoring client on the virtual machine and configures monitoring for the system>
Technician: <detects that the machine requests more memory than physically available within the server>
Technician: <fixes the mysql service configuration which was set to use way too much memory>
Technician: <detects that swap is not activated on the target system. enables swap for temporary memory request spikes>
Technician: <detects that at the given time the cpu load/cpu utilization/apache worker number is spiking simultaneously at the time range in question>
Technician: What does your monitoring say about your virtualization host?
Customer: There is no monitoring on the virtualization host.
Technician: <installs a monitoring client onto the virtualization host and configures monitoring for the systems>
Technician: <detects that one of the cpu sensors show a temperature of 90°C (~ 195°F). detects that multiple fans run at 0 rpm.>
Technician: Can you please check the temperature of the cpus and the operation of the cpu fans manually?
Customer: <went to do the checks as advised>
Customer: One of the cpu fans was blocked. A cable was messed up in the CPU fan. I fixed that.
Technician: <noticed the temperature was immediately falling to 48°C (~118°F). The webserver delays cease to occur.>
Lessons to learn:
Never go without having a rich basic set of monitoring with your infrastructure! You're totally blind and may easily miss the asteroids directly in front of you. If you have a proper monitoring in place anybody who can differentiate yellow and red colors from green is able to notice problems when they come up. Don't try to fix problems, if you do not have the whole picture in terms of the basic set of monitoring data.
One part is the obvious problems shown here. Another part is what a situation looks like when it's normal and to recognize when the metrics are just different. That may serve as valuable hints to what the problem may be. (rising / more then usual memory consumption, lower disk througput, ...). When you start collecting data when the trouble is already there you do not have a performance baseline to compare with, which makes it harder to figure the current problem.
And there's no real magic involved. Just to have basic information from standard tools like
free,
df,
vmstat,
lm_sensors,
smartmontools, ... available in a well presented way, that provides a useful overview of the situation. (People reading some threads of mine probably know that my personal preferred monitoring solution is
check_mk(open source version available). There are lots of great solutions available - pick your choice!)