Nearly Random, Uncorrelated Server Load Average Spikes


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Nearly Random, Uncorrelated Server Load Average Spikes
# 22  
Old 02-13-2020
Another thought I've had is this.

If you can afford to you could stop cron from the command line and see if the spikes go away.

If you can't do that (because you need the cron scheduled processes to run regularly) but you know the footprint of the spike, you could briefly stop the cron process from the command line and then watch for a spike when you issue that cron start. It won't prove anything but does it look similar.

At boot time all crontabs are read into and held in memory and that is CPU intensive. Last modified times of crontabs are also cached. The periodical wake up checks the last modified times between disk and memory. So if you break the rules and modify a crontab directly, a new job you insert won't run at all until an integrity check by cron runs. So, so, so, I guess if you write a ditty to run every 2 seconds that you can monitor and manually insert it into root's crontab, does it start running at the next spike???
This User Gave Thanks to hicksd8 For This Post:
# 23  
Old 02-13-2020
Setup metrics in file using cron, something in the line of :

Code:
TM=$(date "+%Y%m%d_%H%M")
DDUMP=/some/dir
iostat -ctdxN 15 240 >> ${DDUMP}/iostat_${TM}.out &
mpstat 15 240 >> ${DDUMP}/mpstat_${TM}.out &
vmstat 15 240 >> ${DDUMP}/vmstat_${TM}.out

If cron cannot be used for this, use a while loop in screen session with one hour sleeps between invocations.

There is also an option to install sysstat package which will generate sar metric files with five minute resolution in /var/adm/sa after install of package by default.
From those files you can easily draw graphs using software found online, ksar for instance.

With that data, perhaps a more clear cause could be found.

Hope that helps
Regards
Peasant.
This User Gave Thanks to Peasant For This Post:
# 24  
Old 02-13-2020
Thanks for the suggestion.

I was sitting at my desk with another spike occurred and there were no unusual or phantom processes popping up.

MySQL remained at the top of the CPU utilization with, followed by apache2, so I'm starting to believe something is going on with MySQL which is causing the spikes.

Since most of the MySQL tables get mostly reads compared to writes, MyISAM is faster for these "mostly reads", according to what I have read, so I have not changed any off the busy tables to INNODB.

Maybe that is the issue?

But I am hesitant to experiment with changing MYISAM tables to INNODB unless there is clear evidence that altering these tables from MYISAM to INNODB will not create more problems (slowing the DB down), than the 5 or 6 one minute spikes per day.
# 25  
Old 02-13-2020
Not sure if this applies here. One time I came across a excessive load pattern with no increase of cpu and io, was when there were a lot of processes in uninterruptible sleep state (D), which accounted for excessive load number, since Linux defines load differently than other *nixes. In our case this happened to be a problem with NFS traffic. We used ps's wchan option to find out more about the nature of the wait..
This User Gave Thanks to Scrutinizer For This Post:
# 26  
Old 02-14-2020
Thanks Scrutinizer,

I'm feeling confident that the root cause is related to MySQL (maybe 80% confidence level, off the top of my head).

The reasons are as follows:
  • Instrumentation shows only the mysqld process running "at the top" during peak times.
  • Instrumentation shows there are no disk or other I/O errors.
  • Instrumentation shows there are no Linux server cron files running when the spike occur.
  • Instrumentation and charts show no correlation to network I/0, number of users, bots or other network stats.

Perhaps, more importantly is that historically, when the server has had a performance issue, the root cause has always been related to mysql performance.

So combining "what we know now" with "what we know from the past", the logical direction to investigate in detail is MySQL.

So, back to MySQL "basics" again, I have reenabled the MySQL slow query log and set the time "back up to 10 seconds" and will see if this traps the antagonist of this caper:

Code:
mysql> SET GLOBAL slow_query_log = 'ON';
mysql> SET GLOBAL long_query_time = 10;

Back to basics, as they say.

Since I am using MQTT for most of my instrumentation these days, I may write a parser to parse the output of:

Code:
mysqldumpslow /var/log/mysql/neo-mysql-slow.log

... and publish any interesting queries out to my MQTT instrumentation pub/sub bus I use now for consolidating these types of analytical tasks.

For example, in here is an example from logrotate, which is just one example of how I am now experimenting with MQTT to log server events to the DB :

Code:
ubuntu# cat certbot
/var/log/letsencrypt/*.log {
    rotate 12
    weekly
    compress
    missingok
    postrotate
    /usr/bin/mosquitto_pub -t server/logrotate -m "letsencrypt.log" -q 1 -u user -P password
    endscript
}

And in the many PHP application cron and server-batch scripts as a part of the LAMP implementation:

Code:
$C_Start = '/usr/bin/mosquitto_pub -t forum/cron/cleanup2 -m "start" -q 1 -u user -P password;
$output = shell_exec($C_Start);

 // code

$C_End = '/usr/bin/mosquitto_pub -t forum/cron/cleanup2 -m "end" -q 1 -u user -P password;
$output = shell_exec($C_END);

These days, I do nearly all my server instrumentation using MQTT because I use two MQTT analytical tools on two mobile phones, and one MQTT analysis tool on the desktop, and also use Node-RED on the server-side for analysis and visualization, and of course raw database searches in my instrumentation DB table (and also phpmyadmin, for quick looks).

As it stands now, I am a huge fan of MQTT (these days) to use in a variety of monitoring and instrumentation applications.

If MySQL slow query analysis does not yield any fruit, will need to come up with a new analysis plan / instrumentation. Let's see what happens after a day or two with the MySQL slow query analysis.
# 27  
Old 02-14-2020
Update:

After adding more instrumentation, including Apache2 processes, Apache2 CPU and a questionable MySQL CPU graph, the first spike of the last half day occurred and there is correlation between the load spikes and sudden increase in Apache2 processes:

Nearly Random, Uncorrelated Server Load Average Spikes-screen-shot-2020-02-14-75112-pmjpg


But is is not clear what the cause is since there is no strong correlation to users, guests or bot activity. But there is some potential correlation to bot activity:

Nearly Random, Uncorrelated Server Load Average Spikes-screen-shot-2020-02-14-75711-pmjpg


Which brings me back, full circle, suspecting rogue bot activity, again.... let's see what happens during the next spike.
# 28  
Old 02-14-2020
Looks like it was "bot related"

TImeline from my MQTT instrumentation logged in the DB:
  • 1581684184 Bot activity starts to peak
  • 1581684491 . Apache process and CPU% begin to spike
  • 1581684491 . Load1 average spikes
  • 1581684511 Single mysql slow_query_log entry (coincidental?) , 11+ second query:
  • Code:
    use unixmanpages; SET timestamp=1581684511;select os, token, query, manid,formatted,MATCH(text) AGAINST ('Arduino Project with NB-IoT (3GPP) and LoRa / LoRaWAN' IN NATURAL LANGUAGE MODE) as score,strlen FROM neo_man_page_entry where strlen > 2000 AND strlen < 1000000 ORDER BY score DESC limit 3, 1;

  • 1581684542 . Application PHP cron (LAMP process) kicks of an "Hourly Cleanup2" process (coincidental?)
  • 1581684606 . One minute load average now half of peak during spike and all in recovery mode.

So, if the next spike has a similar correlation to Apache2 processes and bots, I will build some "count which bots from who" instrumentation so see if we can find out "which bots are causing the problem"... but before I build instrumentation for that, let's see what happens during the next spike hit.

As I recall, this could be an issue with any number of bots (if this is indeed the cause), including Chinese bots, Korean bots, etc. However, I have seen Bingbot also cause similar issues before.

Yea! I have that "warm feeling" which comes from closing in on solving a mystery!

But on the other hand, I am not sure if the spike in Apache2 processes is a cause or an effect, because if the site slows down (for some reason), I think the Apache2 processes can take longer to change state and they could spike as an effect and not a cause.. Hopefully, will get this all figured out soon.

Note: If I execute the "slow query" above, now, that query takes one second. So the SQL query above is more-than-likely a coincidental effect.
This User Gave Thanks to Neo For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Programming

ESP32 (ESP-WROOM-32) as an MQTT Client Subscribed to Linux Server Load Average Messages

Here we go.... Preface: ..... so in a galaxy far, far, far away from commercial, data sharing corporations..... For this project, I used the ESP-WROOM-32 as an MQTT (publish / subscribe) client which receives Linux server "load averages" as messages published as MQTT pub/sub messages.... (6 Replies)
Discussion started by: Neo
6 Replies

2. UNIX for Dummies Questions & Answers

Help with load average?

how load average is calculated and what exactly is it difference between cpu% and load average (9 Replies)
Discussion started by: robo
9 Replies

3. UNIX for Dummies Questions & Answers

Load average spikes once an hour

Hi, I am getting a high load average, around 7, once an hour. It last for about 4 minutes and makes things fairly unusable for this time. How do I find out what is using this. Looking at top the only thing running at the time is md5sum. I have looked at the crontab and there is nothing... (10 Replies)
Discussion started by: sm9ai
10 Replies

4. Solaris

Load Average and Lwps

NPROC USERNAME SWAP RSS MEMORY TIME CPU 320 oracle 23G 22G 69% 582:55:11 85% 47 root 148M 101M 0.3% 99:29:40 0.3% 53 rafmsdb 38M 60M 0.2% 0:46:17 0.1% 1 smmsp 1296K 5440K 0.0% 0:00:08 0.0% 7 daemon ... (2 Replies)
Discussion started by: snjksh
2 Replies

5. UNIX for Advanced & Expert Users

Load average in UNIX

Hi , I am using 48 CPU sunOS server at my work. The application has facility to check the current load average before starting a new process to control the load. Right now it is configured as 48. So it does mean that each CPU can take maximum one proces and no processe is waiting. ... (2 Replies)
Discussion started by: kumaran_5555
2 Replies

6. UNIX for Dummies Questions & Answers

Please Help me in my load average

Hello AlL,.. I want from experts to help me as my load average is increased and i dont know where is the problem !! this is my top result : root@a4s # top top - 11:30:38 up 40 min, 1 user, load average: 3.06, 2.49, 4.66 Mem: 8168788k total, 2889596k used, 5279192k free, 47792k... (3 Replies)
Discussion started by: black-code
3 Replies

7. Solaris

load average query.

Hi, i have installed solaris 10 on t-5120 sparc enterprise. I am little surprised to see load average of 2 or around on this OS. when checked with ps command following process is using highest CPU. looks like it is running for long time and does not want to stop, but I do not know... (5 Replies)
Discussion started by: upengan78
5 Replies

8. UNIX for Dummies Questions & Answers

top - Load average

Hello, Here is the output of top command. My understanding here is, the load average 0.03 in last 1 min, 0.02 is in last 5 min, 0.00 is in last 15 min. By seeing this load average, When can we say that, the system load averge is too high? When can we say that, load average is medium/low??... (8 Replies)
Discussion started by: govindts
8 Replies

9. UNIX for Dummies Questions & Answers

Load Average

Hello all, I have a question about load averages. I've read the man pages for the uptime and w command for two or three different flavors of Unix (Red Hat, Tru64, Solaris). All of them agree that in the output of the 2 aforementioned commands, you are given the load average for the box, but... (3 Replies)
Discussion started by: Heathe_Kyle
3 Replies

10. UNIX for Advanced & Expert Users

load average

we have an unix system which has load average normally about 20. but while i am running a particular unix batch which performs heavy operations on filesystem and database average load reduces to 15. how can we explain this situation? while running that batch idle cpu time is about %60-65... (0 Replies)
Discussion started by: gfhgfnhhn
0 Replies
Login or Register to Ask a Question