Nearly Random, Uncorrelated Server Load Average Spikes


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Nearly Random, Uncorrelated Server Load Average Spikes
# 29  
This reminds me a strange issue we had going for months before I managed to prove it was Patrol related...
I never liked BMC Patrol as for me expensive and big CPU Hog... compared to much better vendors such as sysload...
What happened is that periodically something that should not happen does: all the Patrol system, Oracle etc requests fall at the same time... Could be the same here 2-3 bots with requests you see you believe innocent so dont bother when fall together can put high load... I would not make a fuss if it does not last long, but remember it can happen it could fall on a moment when the box is running at 100% and then its not quite the same
This User Gave Thanks to vbe For This Post:
# 30  
Exactly Victor,

It's not a big deal because the spikes are just for a minute 4 to 6 times a day; but the problem is when (potentially) all the "bad things" align all at once (bots, DB, system loads), and the one minute problem becomes a two or three minute problem (it's possible, of course).

I'm going to refine my instrumentation and see if I can figure it out. If so, great. Normally, I can solve most any system-level computer problem and with (more than) a bit of uncertainty in the new COVID-19 biohazard around these parts, I'm not so keen in going out with so many tourists here now (none of the foreign tourists are wearing masks, as I can see today, and there are a LOT of tourists now) ; so this little spike problem is keeping me busy inside, avoiding a potential virus from Chinese wildlife markets.
# 31  
Glad you figured it out already.

What's regarding instrumentation, I found it very helpful once I had basic monitoring data available for all my servers. I'm still sticking to the solution, I'm knowing and using for many years now(check_mk, open source of course), as it is easy to handle, flexible to extend and with thousands of check plugins ready at hand if needed and lots of features available if you need to do more. So you have the basic metrics of your equipment in reach.

Some examples of many basic graphs I which you get ready configured out of the box:

Network Interface Usage

Image

Memory/Swap Usage

Image

Filesystem grow and trending

Image

Image

So it's just a few clicks away to check and you'll get informed about all the basic stuff(disk full, memory full, cpu overloaded, network errors, ...), so you do not have to care for yourself in case of trouble and often you'll notice anomalies before it get's critical.

There maybe a lot of hot stuff out there like prometheus, netdata(demo), grafana(demo), ... but that far exceeds my needs and costs me too much - in terms of time and energy to get acquainted with - which I rather invest in other areas.
These 2 Users Gave Thanks to stomp For This Post:
# 32  
Those rrdtools based tools are quite efficient.
rrdtool is one of the gems in computing, much like MQTT Neo mentioned.

As for Prometheus, give it a shot, it's not that hard Smilie
But i think we already discussed that in past, as long as <insert monitoring> works for you, that is what is most important.

But sometimes drill down is necessary when problem is less obvious and cannot be deduced from graphs.

Regards
Peasant.
This User Gave Thanks to Peasant For This Post:
# 33  
Update:

Two more spikes over the course of the past 12 hours, none of which show any correlation to an increase the number of bots; however, that does not say anything about the velocity of bots hitting the site. However, there is no correlation (over the last 12 hours and with two spikes recorded) with an increased overall number of bots. In addition, there is no correlation to increase network I/0.

The only consistent correlation so far is:
  • The one minute load average spike.
  • The increase in the total number of apache2 processes.
  • The increase in the total percentage of apache2 CPU % recorded.

There is weak, but inconsistent, correlation with:
  • An increase in the number of bots on the site .
  • An increase in the overall MySQL CPU percent.

There is no correlation, weak or otherwise to:
  • An increase in network I/O.
  • Any application cron activity.
  • Any system cron activity.
  • Any system disk I/O errors or anomalies.

So, as a sanity check, I have disabled apache2 mod pagespeed (just now) to see if there is any effect at all.

Code:
<IfModule pagespeed_module>
    # Turn on mod_pagespeed. To completely disable mod_pagespeed, you
    # can set this to "off".
    ModPagespeed off

    ###
    ###

This is just a "shot in the dark" (disabling mod pagespeed), but at least we will know something. If the spikes continue, I will turn it back on, of course.
# 34  
Quote:
Originally Posted by stomp
Glad you figured it out already.
Not yet.

Last night did not confirm the "rogue bots are the cause" .... hypothesis (see above post). Two more spikes, no correlation to increase bot number or network I/O. But I'm still looking into it Smilie

Regarding instrumentation, I prefer to build my own, like I have done with Node-RED and MQTT.

I like instrumentation which works for me; and not instrumentation designed by others. Believe me, I have used many "others" packages in the past, over decades.

Web based packages which run on the server we are observing start having problems when the server itself is having problems, so I do not use them.

That is why I use MQTT, so the only additional load requirement of the server when under stress is to publish a short message to the network (off platform). Installing packages on the same server being tested, especially web-based programs resident on servers being monitored which are primarily web servers, is not a good way to build instrumentation, in my view (so I don't do it and only recommend it in the most simple case).

MQTT is ideal for this kind of instrumentation. MQTT is free. MQTT is very easy to operate and maintain; and MQTT permits a wide-variety of ways to store data (on any node running MQTT in the network) and visualize the data (MQTT supported apps, anywhere on the network).

So, I do not have an instrumentation problem. The issue I have is trying to decide, based on evidence and strong correlation, what to monitor.

At the moment, I am testing apache2 mod pagespeed (have turned it off, temporarily). I may turn off XCache later (after the disable mod pagespeed test, and see if that changes things.

I am also very happy with Node-RED. In fact, I am extremely impressed with it.

Let me close with saying that I use MQTT and Node-RED by choice and do want want any other packages (I have used many of them over the decades). I really like MQTT and Node-RED. These tools fit my style and work great for me. For others, please use any instrumentation and monitor tools what work for you and / or supported by your organization.
# 35  
Quote:
Originally Posted by Peasant
As for Prometheus, give it a shot, it's not that hard Smilie
But i think we already discussed that in past, as long as <insert monitoring> works for you, that is what is most important.

But sometimes drill down is necessary when problem is less obvious and cannot be deduced from graphs.

Regards
Peasant.
Prometheus uses HTTP on the same server where HTTP is the main application under observation. This is from the Prometheus docs:
  • a multi-dimensional data model with time series data identified by metric name and key/value pairs
  • PromQL, a flexible query language to leverage this dimensionality
  • no reliance on distributed storage; single server nodes are autonomous
  • time series collection happens via a pull model over HTTP
  • pushing time series is supported via an intermediary gateway
  • targets are discovered via service discovery or static configuration
  • multiple modes of graphing and dashboarding support

I'm not inclined to install an application which relies on HTTP at the data transport layer to monitor a LAMP application where HTTP and apache2 are at the core of the problem. However, for a different scenario, an HTTP-transport based monitoring system might be "just the ticket".

Thanks for the ideas, but I'm sticking with MQTT and Node-RED for the time being. MQTT is very lightweight on the server and it does not use HTTP; but instead uses a light-weight TCP connection, which I prefer, but that is just me.

I complete agree with you that:
  • As long as <insert monitoring> works for you, that is what is most important.
  • Sometimes drill down is necessary when problem is less obvious and cannot be deduced from graphs.

Except I would change your second comment to read more like:

For all but the most trivial problems, we need to drill down into the system. Time series graphs and charts are only indicators. The story continues past indicators for all but the most trivial issues we encounter .......

I'm pretty sure you agree. After all, if we only needed third party time series graphing tools to solve problems, we could hire 12 year old kids to manage our networks and servers , and we could do other tasks more fun and interesting tasks Smilie

But on the other hand, when we work in network and system management (whatever we are doing or where ever we are working), we always encounter new and very interesting problems and the more complex networks become, the more interesting the problems are.
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #919
Difficulty: Medium
The Unix epoch is the time 00:00:00 EST on 1 January 1970.
True or False?

10 More Discussions You Might Find Interesting

1. Programming

ESP32 (ESP-WROOM-32) as an MQTT Client Subscribed to Linux Server Load Average Messages

Here we go.... Preface: ..... so in a galaxy far, far, far away from commercial, data sharing corporations..... For this project, I used the ESP-WROOM-32 as an MQTT (publish / subscribe) client which receives Linux server "load averages" as messages published as MQTT pub/sub messages.... (6 Replies)
Discussion started by: Neo
6 Replies

2. UNIX for Dummies Questions & Answers

Help with load average?

how load average is calculated and what exactly is it difference between cpu% and load average (9 Replies)
Discussion started by: robo
9 Replies

3. UNIX for Dummies Questions & Answers

Load average spikes once an hour

Hi, I am getting a high load average, around 7, once an hour. It last for about 4 minutes and makes things fairly unusable for this time. How do I find out what is using this. Looking at top the only thing running at the time is md5sum. I have looked at the crontab and there is nothing... (10 Replies)
Discussion started by: sm9ai
10 Replies

4. Solaris

Load Average and Lwps

NPROC USERNAME SWAP RSS MEMORY TIME CPU 320 oracle 23G 22G 69% 582:55:11 85% 47 root 148M 101M 0.3% 99:29:40 0.3% 53 rafmsdb 38M 60M 0.2% 0:46:17 0.1% 1 smmsp 1296K 5440K 0.0% 0:00:08 0.0% 7 daemon ... (2 Replies)
Discussion started by: snjksh
2 Replies

5. UNIX for Advanced & Expert Users

Load average in UNIX

Hi , I am using 48 CPU sunOS server at my work. The application has facility to check the current load average before starting a new process to control the load. Right now it is configured as 48. So it does mean that each CPU can take maximum one proces and no processe is waiting. ... (2 Replies)
Discussion started by: kumaran_5555
2 Replies

6. UNIX for Dummies Questions & Answers

Please Help me in my load average

Hello AlL,.. I want from experts to help me as my load average is increased and i dont know where is the problem !! this is my top result : root@a4s # top top - 11:30:38 up 40 min, 1 user, load average: 3.06, 2.49, 4.66 Mem: 8168788k total, 2889596k used, 5279192k free, 47792k... (3 Replies)
Discussion started by: black-code
3 Replies

7. Solaris

load average query.

Hi, i have installed solaris 10 on t-5120 sparc enterprise. I am little surprised to see load average of 2 or around on this OS. when checked with ps command following process is using highest CPU. looks like it is running for long time and does not want to stop, but I do not know... (5 Replies)
Discussion started by: upengan78
5 Replies

8. UNIX for Dummies Questions & Answers

top - Load average

Hello, Here is the output of top command. My understanding here is, the load average 0.03 in last 1 min, 0.02 is in last 5 min, 0.00 is in last 15 min. By seeing this load average, When can we say that, the system load averge is too high? When can we say that, load average is medium/low??... (8 Replies)
Discussion started by: govindts
8 Replies

9. UNIX for Dummies Questions & Answers

Load Average

Hello all, I have a question about load averages. I've read the man pages for the uptime and w command for two or three different flavors of Unix (Red Hat, Tru64, Solaris). All of them agree that in the output of the 2 aforementioned commands, you are given the load average for the box, but... (3 Replies)
Discussion started by: Heathe_Kyle
3 Replies

10. UNIX for Advanced & Expert Users

load average

we have an unix system which has load average normally about 20. but while i am running a particular unix batch which performs heavy operations on filesystem and database average load reduces to 15. how can we explain this situation? while running that batch idle cpu time is about %60-65... (0 Replies)
Discussion started by: gfhgfnhhn
0 Replies

Featured Tech Videos