Location: Asia Pacific, Cyberspace, in the Dark Dystopia
Posts: 19,118
Thanks Given: 2,351
Thanked 3,359 Times in 1,878 Posts
Nearly Random, Uncorrelated Server Load Average Spikes
I have been wrangling with a small problem on a Ubuntu server which runs a LAMP application.
This server runs fine, basically:
But around five to six times a day, the (1 minute) load average spikes to around 200 and quickly goes back down to normal, all in the time interval of less than one minute:
I have added sensors to the crontab processes, both for the LAMP application and the server crontabs and log all cron start and stop events in the database. This instrumentation of the server yields no fruit.
In addition, in the occasional spurious events that seem to happen at "near regular" intervals, I have had a terminal window open running top, mytop, ifstats and other command line tools trying to trap the process which is causing the load spurs. There has been nothing unusual and mysql is always the "leading contender" when the pikes occur.
There is no other process with a high CPU during the spikes, so that seems to indicate something related to mysql., obviously.
With only mysql showing up as the "leading contender", I added more instrumentation to various mysql processes in the application, and can find no event on the server or in the application which correlates to this spurious behavior.
I have correlated the spikes to (1) network interface i/o stats, (2) apache processes, (3) all crontab processes (in app and on server) and regardless of my skills at adding sensors to various processes, I cannot trap the spurious process.
I have running time series graphs, and logging in the database. When there is an incident on the graph, I go to the database log and look at all the sensor entries and can find no correlation.
There is no correlation to cron processes, network i/o, users on the server, bots, and backup processes.
I keep adding more and more instrumentation to every process on the server and in the app, but all that instrumentation looking for correlation to a server process, including the LAMP app, bears no fruit.
At first, I thought this problem was caused by bots hitting the web server; but there is no correlation to increased bot traffic or LAN interface I/O.
Then I thought the problem was caused by various cron entries in the LAMP application; but reconfiguring them, turning them on and off, adding instrumentation for start / stop times in a log, also bears no fruit.
I've been working this problem on and off for over a week and cannot find a single causal reason, no a single cause for any effect, which may be causing this spurious load average behavior.
It's only around five to six times a day for less than a minute at each event; but I want to find the cause and fix it. I'm not 100% convinced the issue is caused by a single process / issue.
I wonder if there is some underlying disk I/0 activity on the server causing these spikes? Could this be related to a potential disk error on the SSD drives? Could it be related to underlying disk raid activity?
If so, any ideas how to trap this?
My current "best guess" is that there is some underlying disk I/O activity causing this spurious issue, and that is why nothing at the application level correlates.
I have seen cases like that when you have enormous disc cache synchronising... all is fine when pure transactional mode ( the periodic sync has hmmm "little" to update ) but when it comes to batches running, it fills the cache before periodic sync and then the system has to wait the end of forced sync (file cache OS side...)
Other contention possible big file cache with big SGA: parsing both and not finding reduces perfs...
Location: Asia Pacific, Cyberspace, in the Dark Dystopia
Posts: 19,118
Thanks Given: 2,351
Thanked 3,359 Times in 1,878 Posts
Quote:
Originally Posted by vbe
I would check if at that time your server is doing something "heavy batch mod style" at the same time like backup or import/export of tables...
Yes, I know...
That is why I previously mentioned that I have checked all cron and cron-like processes which are all batch-like processes. The server only runs a LAMP application, so there are no other application processes of any consequence.
I think I mentioned all the things you mentioned already in an earlier post; but thanks for brainstorming... and coming up with ideas.
Honestly, I am already logging all batch-like processing at the application level, and there is no correlation, as I mentioned.
Here we go....
Preface:
..... so in a galaxy far, far, far away from commercial, data sharing corporations.....
For this project, I used the ESP-WROOM-32 as an MQTT (publish / subscribe) client which receives Linux server "load averages" as messages published as MQTT pub/sub messages.... (6 Replies)
Hi,
I am getting a high load average, around 7, once an hour. It last for about 4 minutes and makes things fairly unusable for this time.
How do I find out what is using this. Looking at top the only thing running at the time is md5sum.
I have looked at the crontab and there is nothing... (10 Replies)
Hi ,
I am using 48 CPU sunOS server at my work.
The application has facility to check the current load average before starting a new process to control the load.
Right now it is configured as 48. So it does mean that each CPU can take maximum one proces and no processe is waiting.
... (2 Replies)
Hello AlL,..
I want from experts to help me as my load average is increased and i dont know where is the problem !!
this is my top result :
root@a4s # top
top - 11:30:38 up 40 min, 1 user, load average: 3.06, 2.49, 4.66
Mem: 8168788k total, 2889596k used, 5279192k free, 47792k... (3 Replies)
Hi,
i have installed solaris 10 on t-5120 sparc enterprise.
I am little surprised to see load average of 2 or around on this OS.
when checked with ps command following process is using highest CPU. looks like it is running for long time and does not want to stop, but I do not know... (5 Replies)
Hello, Here is the output of top command. My understanding here is,
the load average 0.03 in last 1 min, 0.02 is in last 5 min, 0.00 is in last 15 min.
By seeing this load average, When can we say that, the system load averge is too high?
When can we say that, load average is medium/low??... (8 Replies)
Hello all, I have a question about load averages.
I've read the man pages for the uptime and w command for two or three different flavors of Unix (Red Hat, Tru64, Solaris). All of them agree that in the output of the 2 aforementioned commands, you are given the load average for the box, but... (3 Replies)
we have an unix system which has
load average normally about 20.
but while i am running a particular unix batch which performs heavy
operations on filesystem and database average load
reduces to 15.
how can we explain this situation?
while running that batch idle cpu time is about %60-65... (0 Replies)