Unix/Linux Go Back    

Infrastructure Monitoring Forum for Nagios, Zabbix and other network monitoring and management tools.

Event processing & machine learning in monitoring system

Infrastructure Monitoring

Thread Tools Search this Thread Display Modes
Old Unix and Linux 06-09-2013   -   Original Discussion by pyalxx
pyalxx's Unix or Linux Image
pyalxx pyalxx is offline
Registered User
Join Date: Jun 2013
Last Activity: 9 June 2013, 3:08 PM EDT
Location: Kiev
Posts: 1
Thanks: 0
Thanked 0 Times in 0 Posts
Event processing & machine learning in monitoring system

For a couple of years I'm developing an IT infrastructure monitoring system in the research group in my university. And now we would like to use some nontrivial methods in this area.
So, I decided to contact with experienced users on the subject. My questions would be:
Do existing monitoring software give you an ability to deal with server fails efficiently or you use self-written tools? Do you use some special approaches like event prediction or machine learning and do you think it have a feature in this area?
Thank you!
Sponsored Links
Old Unix and Linux 06-14-2013   -   Original Discussion by pyalxx
DGPickett's Unix or Linux Image
DGPickett DGPickett is offline Forum Advisor  
Registered User
Join Date: Oct 2010
Last Activity: 1 February 2016, 3:35 PM EST
Location: Southern NJ, USA (Nord)
Posts: 4,673
Thanks: 8
Thanked 588 Times in 561 Posts
There is some of this sort of event predition in network protocols, to detect defective or slow paths to avoid, but servers are just supposed to run, not fail, predictable or not. The two flavors of handling this are parallel redundant concurrent load division where a dead server is detected and not sent any more load until it can respond to periodic tests. Recovery from services sent to a dying server is mostly left to client retry, but some systems of transactional middleware do requeue services that do not run to final commit, so they are run on alternative servers. Of course, query services are easier to handle than churn, where you need to rollback all when there is failure, before you requeue. Some systems do not use transactions, but structure churn so it can be applied any number of times and not have duplicate side effects (history filtering or believe the last of that seq. #).
Sponsored Links
Old Unix and Linux 09-25-2013   -   Original Discussion by pyalxx
zaxxon's Unix or Linux Image
zaxxon zaxxon is offline Forum Staff  
code tag tagger
Join Date: Sep 2007
Last Activity: 13 April 2018, 5:31 AM EDT
Location: St. Gallen, Switzerland
Posts: 6,574
Thanks: 178
Thanked 568 Times in 482 Posts
I had bad experience with expensive tools in the past, where OS names were listed in the white papers and sales publications, even in the header of scripts, which simply did not work or were a pain to get to work.
Some monitoring solutions can't come out of the box as some demands for applications etc. is far too special so you often have a lot of coding or at least configuration works.

Some companies even charge insane prices for additional probes/modules/plugins/spys (whatever they call them), that are so badly programmed or simple, you could think they are making a bad joke.
I would always setup a detailed Prove of Concept, invite the company and have detailed things tested, before buying anything. The sales often promise a lot, while the techs take the pain or the hotline/support is pushed to the front to block off the customer more or less.
Nagios, as a free tool for example, offers a lot of plugins that cover most things, but the plugins you can get for free are from very good to flawed. Again, sometimes you have to write stuff on your own but can offer them for exchange, if allowed Linux

my 2 cents
Old Unix and Linux 09-25-2013   -   Original Discussion by pyalxx
blackrageous's Unix or Linux Image
blackrageous blackrageous is offline
Registered User
Join Date: Jul 2013
Last Activity: 12 April 2017, 6:26 PM EDT
Location: Austin, Texas
Posts: 540
Thanks: 13
Thanked 111 Times in 108 Posts
This is a broad subject. Technology has never really been the issue of effectively monitoring an IT infrastructure. We've had the tools for over 20 years now and the problem has always been effective use of and implemenation of tools, It should start from the top with 4 things: a plan, a team/roles, the toolset, and processes to manage the infrastructure.

You raise the issue of non trivial methods so that suggests you're more interested in technical mechanisms. In this case it's best to ask something more specific. The best area I can point you to is this concept that is emerging and it's arguably steeped in virtualization. The concept is Reliability and Serviceabilty (RAS). Computation is becoming non-stop and this means that you can still compute and service the machine at the same time. Hardware reliability is well defined and there are predictive methods for handling this. In fact,every component, network, o/s... is well defined...so I don't really understand the "non-trivial" methods part. Whatever the specific, monitoring in general should support the emerging concept of RAS. Now that term has been mainly associated with hardware, but I think the concept extends to the entire infrastructure. I would be interested to hear more of what you have been working on and what you're targeting.
Sponsored Links

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

All times are GMT -4. The time now is 03:18 AM.