Visit Our UNIX and Linux User Community

Nearly Random, Uncorrelated Server Load Average Spikes

Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Nearly Random, Uncorrelated Server Load Average Spikes
# 36  
Old 02-15-2020

Originally Posted by Neo

So, as a sanity check, I have disabled apache2 mod pagespeed (just now) to see if there is any effect at all.

<IfModule pagespeed_module>
    # Turn on mod_pagespeed. To completely disable mod_pagespeed, you
    # can set this to "off".
    ModPagespeed off


This is just a "shot in the dark" (disabling mod pagespeed), but at least we will know something. If the spikes continue, I will turn it back on, of course.
Did not help at all. Slowed the site down a bit and did not stop any spikes.

ModPagespeed on

# 37  
Old 02-15-2020

I have some old "cyberspace situational awareness" PHP code I used for a visualization project a few years ago, which captures and stores details information on web session activity; this code has proven handy identifying rouge bots in the past.

So, I have modified that code to capture and store detailed session information, including the number of hits per IP address, the user agent string, country code, etc. when the 1 minute load average is above 20 and less than 50.

$theload = getLoadAvg();
if (floatval($theload) > 20.0 && floatval($theload) < 50.0) 
  /// the old CSA code to parse web session activity and store the results in the DB

So, let's see what happens the next time we get a spike... this should be interesting.

mysql> describe neo_csa_session_manager;
| Field        | Type             | Null | Key | Default | Extra          |
| id           | int(11) unsigned | NO   | PRI | NULL    | auto_increment |
| user_id      | int(11)          | NO   | MUL | 0       |                |
| session_id   | varchar(255)     | NO   |     | NULL    |                |
| url          | text             | NO   |     | NULL    |                |
| ip_address   | varchar(45)      | NO   | MUL | NULL    |                |
| user_agent   | varchar(255)     | NO   |     | NULL    |                |
| bot_flag     | tinyint(1)       | NO   |     | 0       |                |
| robot_txt    | mediumint(6)     | NO   |     | 0       |                |
| sitemap      | mediumint(6)     | NO   |     | 0       |                |
| riskscore    | int(11)          | NO   |     | 0       |                |
| country_iso2 | varchar(2)       | NO   |     | UN      |                |
| country      | varchar(50)      | NO   |     | UNKNOWN |                |
| hitcount     | int(10) unsigned | NO   |     | 1       |                |
| firstseen    | bigint(11)       | NO   |     | NULL    |                |
| unixtime     | bigint(11)       | YES  |     | NULL    |                |
| longitude    | float            | NO   |     | 0       |                |
| latitude     | float            | NO   |     | 0       |                |
17 rows in set (0.00 sec)

This User Gave Thanks to Neo For This Post:
# 38  
Old 02-15-2020

Just noticed, after digging around in the DB logs from my MQTT instrumentation, that the last spike correlated with a jump in data transferred out of the network interface:

Nearly Random, Uncorrelated Server Load Average Spikes-screen-shot-2020-02-15-65207-pmjpg

Typical values are much less (see below), so this would seem to validate the "rouge bots hypothesis", currently leading the candidate to explain these periodic spikes:

Nearly Random, Uncorrelated Server Load Average Spikes-screen-shot-2020-02-15-65655-pmjpg

This is also the first "hard correlation" of a spike with network interface iostats, so, let's see if my code in the post before this one will trap the next big spike Smilie
# 39  
Old 02-15-2020

Nearly Random, Uncorrelated Server Load Average Spikes-screen-shot-2020-02-16-92316-amjpg

There were two spikes three hours apart; both were captured by my HTTP session logging program, which logs session detaisl aggregated by IP address. In this case, the code starts logging (kicks off) when the one minute load average exceeds 20 and ends when the same load average exceeds 50. So, in a spike we will record a very short snap shot in time of the traffic (on the way up and on the way down, but I may change this in the future to only capture on the way up).

The results were as follows:

In both spikes, there were at least fort Chinese IP addresses present at the top of the "hit count" chart (the DB table):

All four of these IP addresses were present during the 4AM and 7AM (Bangkok Time) spikes, and all three identified with the same user agent string:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36

This indicates these IP addresses (in China) are running the same bot software; but that is only an indication (but a fairly strong indication).

However, there is no denying that my "trap the bots" code has identified four Chinese IP addresses running some bot software which is more-than-likely contributing to the cause of the spikes.

In addition, during the same two spikes spaced three hours apart (as mentioned), there was one US-based IP address running with a blank user agent string:

Keep in mind in this capture, the code only captured the session information when the one minute load average was above 20 and below 50, and there were two spikes spaced almost exactly three hours apart:

Nearly Random, Uncorrelated Server Load Average Spikes-screen-shot-2020-02-16-90255-amjpg

So, having recorded the events above, I have just now emptied that DB table and have "reset the trap" for the next spikes.

Now, turning our attention to my instrumentation log where I am using MQTT to log all application and system cron (batch) events (start and end times) as well as a number of system metrics, we see there is a correlation (during the first spike) at 1581800045 of a spike in traffic out of the network interface, along with correlating spikes in Apache2 processes and CPU.

Nearly Random, Uncorrelated Server Load Average Spikes-screen-shot-2020-02-16-90944-amjpg

Now looking at the second event (spike 2), there is a similar pattern, but of interest in that proceeding both spikes, is an hourly application cron function:

Nearly Random, Uncorrelated Server Load Average Spikes-screen-shot-2020-02-16-91546-amjpg

This seems to indicate that the cause of the spikes, in this case, is a combination of aggressive bot activity coincidental with an hourly cron / batch process, causing spikes.

To be more certain of this, I am going to change the time of the "update attachment view" cron process from kicking off on the 53 minute mark of every hour, to the 23 minute mark of every hour, and see if the times of the spikes shift in time as well.
# 40  
Old 02-15-2020
Just got another spike exactly three hours after the last one, not correlated to any cron / batch process:

Nearly Random, Uncorrelated Server Load Average Spikes-screen-shot-2020-02-16-100552-amjpg

Nearly Random, Uncorrelated Server Load Average Spikes-screen-shot-2020-02-16-100408-amjpg

Chinese IPs:

Nearly Random, Uncorrelated Server Load Average Spikes-screen-shot-2020-02-16-100817-amjpg

Same Chinese bots, same IPs, same user agent strings.
This User Gave Thanks to Neo For This Post:
# 41  
Old 02-15-2020
So, let's try this:

iptables -A INPUT -s -j DROP  #  rogue chinese bot
iptables -A INPUT -s -j DROP  #  rogue chinese bot

Empty the "trap" again and block two Chinese subnetworks with rouge, unidentified bot activity.

Honestly, this is starting to "annoy me a lot" in the possibility that these performance hits, and all the time I am spending to find the cause of these hits / spikes, wasting valuable "time in life" is related to rouge, unidentified bots from Chinese networks.

If this continues, I am going to start blocking Chinese networks at the /16 and /8 levels (entire networks).

First, let's see if this is indeed the main source of these spikes. As we all know from situational awareness theory and the famous OODA loop by John Boyd.
  4. ACT

Already, we have enough information to ACT. But lets continue to OBSERVE Smilie

The loop goes on ... and on ....

Please note that we cannot trust apache2 modules and other third-party software to automatically block IPs, because this can results in blocking the "good bots" which are important for search engine optimization and site traffic.

That means, if this is confirmed that these kinds of bots continue to be the cause of problems, then I will need to DECIDE how to deal with this situation moving forward. I think point in time, I am going to continue to "trap and trace" before making a decision. However, it does seem, at this point, that rouge, unidentified bots from Chinese networks are causing performance issues and need to be "dealt with".

If anyone else has experienced similar issues and has an interesting potential solution to this problem, please reply and share your ideas.


PS: I may consider automating this, as follows:
  1. Capture network session activity when one minute load average exceeds a threshold (as I am doing now).
  2. Filter results captured in the DB based on "hitcount" and "country".
  3. If the "hitcount" exceeds a certain threshold and "country" is in an array of "known to have rouge bots countries".
  4. THEN BLOCK the ip_address/24
# 42  
Old 02-16-2020

Experienced (and trapped) another spike from another Chinese IP address (which is at the top of the "hitcount" list during the spikes):

Nearly Random, Uncorrelated Server Load Average Spikes-screen-shot-2020-02-17-95106-amjpg


with the same user agent string as before:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36

Nearly Random, Uncorrelated Server Load Average Spikes-screen-shot-2020-02-17-94331-amjpg

Yesterday, as reader who follow this caper my call recall, I blocked two Chinese subnetworks /24

iptables -A INPUT -s -j DROP  #  rogue chinese bot
iptables -A INPUT -s -j DROP  #  rogue chinese bot

Now, we see rouge, unidentified bot activity from, more than likely in the same data center.

So, I will change the block to:

iptables -A INPUT -s -j DROP  #  rogue chinese bot
iptables -A INPUT -s -j DROP  #  rogue chinese bot

... let's what what they do next...... I am interested to learn if "they" are manually shifting servers or this is an automatic response to the block.
This User Gave Thanks to Neo For This Post:

Previous Thread | Next Thread
Test Your Knowledge in Computers #632
Difficulty: Easy
Apple development of what would become the iPhone began in 2004.
True or False?

10 More Discussions You Might Find Interesting

1. Programming

ESP32 (ESP-WROOM-32) as an MQTT Client Subscribed to Linux Server Load Average Messages

Here we go.... Preface: ..... so in a galaxy far, far, far away from commercial, data sharing corporations..... For this project, I used the ESP-WROOM-32 as an MQTT (publish / subscribe) client which receives Linux server "load averages" as messages published as MQTT pub/sub messages.... (6 Replies)
Discussion started by: Neo
6 Replies

2. UNIX for Dummies Questions & Answers

Help with load average?

how load average is calculated and what exactly is it difference between cpu% and load average (9 Replies)
Discussion started by: robo
9 Replies

3. UNIX for Dummies Questions & Answers

Load average spikes once an hour

Hi, I am getting a high load average, around 7, once an hour. It last for about 4 minutes and makes things fairly unusable for this time. How do I find out what is using this. Looking at top the only thing running at the time is md5sum. I have looked at the crontab and there is nothing... (10 Replies)
Discussion started by: sm9ai
10 Replies

4. Solaris

Load Average and Lwps

NPROC USERNAME SWAP RSS MEMORY TIME CPU 320 oracle 23G 22G 69% 582:55:11 85% 47 root 148M 101M 0.3% 99:29:40 0.3% 53 rafmsdb 38M 60M 0.2% 0:46:17 0.1% 1 smmsp 1296K 5440K 0.0% 0:00:08 0.0% 7 daemon ... (2 Replies)
Discussion started by: snjksh
2 Replies

5. UNIX for Advanced & Expert Users

Load average in UNIX

Hi , I am using 48 CPU sunOS server at my work. The application has facility to check the current load average before starting a new process to control the load. Right now it is configured as 48. So it does mean that each CPU can take maximum one proces and no processe is waiting. ... (2 Replies)
Discussion started by: kumaran_5555
2 Replies

6. UNIX for Dummies Questions & Answers

Please Help me in my load average

Hello AlL,.. I want from experts to help me as my load average is increased and i dont know where is the problem !! this is my top result : root@a4s # top top - 11:30:38 up 40 min, 1 user, load average: 3.06, 2.49, 4.66 Mem: 8168788k total, 2889596k used, 5279192k free, 47792k... (3 Replies)
Discussion started by: black-code
3 Replies

7. Solaris

load average query.

Hi, i have installed solaris 10 on t-5120 sparc enterprise. I am little surprised to see load average of 2 or around on this OS. when checked with ps command following process is using highest CPU. looks like it is running for long time and does not want to stop, but I do not know... (5 Replies)
Discussion started by: upengan78
5 Replies

8. UNIX for Dummies Questions & Answers

top - Load average

Hello, Here is the output of top command. My understanding here is, the load average 0.03 in last 1 min, 0.02 is in last 5 min, 0.00 is in last 15 min. By seeing this load average, When can we say that, the system load averge is too high? When can we say that, load average is medium/low??... (8 Replies)
Discussion started by: govindts
8 Replies

9. UNIX for Dummies Questions & Answers

Load Average

Hello all, I have a question about load averages. I've read the man pages for the uptime and w command for two or three different flavors of Unix (Red Hat, Tru64, Solaris). All of them agree that in the output of the 2 aforementioned commands, you are given the load average for the box, but... (3 Replies)
Discussion started by: Heathe_Kyle
3 Replies

10. UNIX for Advanced & Expert Users

load average

we have an unix system which has load average normally about 20. but while i am running a particular unix batch which performs heavy operations on filesystem and database average load reduces to 15. how can we explain this situation? while running that batch idle cpu time is about %60-65... (0 Replies)
Discussion started by: gfhgfnhhn
0 Replies

Featured Tech Videos