Event processing & machine learning in monitoring system
For a couple of years I'm developing an IT infrastructure monitoring system in the research group in my university. And now we would like to use some nontrivial methods in this area.
So, I decided to contact with experienced users on the subject. My questions would be:
Do existing monitoring software give you an ability to deal with server fails efficiently or you use self-written tools? Do you use some special approaches like event prediction or machine learning and do you think it have a feature in this area?
There is some of this sort of event predition in network protocols, to detect defective or slow paths to avoid, but servers are just supposed to run, not fail, predictable or not. The two flavors of handling this are parallel redundant concurrent load division where a dead server is detected and not sent any more load until it can respond to periodic tests. Recovery from services sent to a dying server is mostly left to client retry, but some systems of transactional middleware do requeue services that do not run to final commit, so they are run on alternative servers. Of course, query services are easier to handle than churn, where you need to rollback all when there is failure, before you requeue. Some systems do not use transactions, but structure churn so it can be applied any number of times and not have duplicate side effects (history filtering or believe the last of that seq. #).
I had bad experience with expensive tools in the past, where OS names were listed in the white papers and sales publications, even in the header of scripts, which simply did not work or were a pain to get to work.
Some monitoring solutions can't come out of the box as some demands for applications etc. is far too special so you often have a lot of coding or at least configuration works.
Some companies even charge insane prices for additional probes/modules/plugins/spys (whatever they call them), that are so badly programmed or simple, you could think they are making a bad joke.
I would always setup a detailed Prove of Concept, invite the company and have detailed things tested, before buying anything. The sales often promise a lot, while the techs take the pain or the hotline/support is pushed to the front to block off the customer more or less.
Nagios, as a free tool for example, offers a lot of plugins that cover most things, but the plugins you can get for free are from very good to flawed. Again, sometimes you have to write stuff on your own but can offer them for exchange, if allowed
This is a broad subject. Technology has never really been the issue of effectively monitoring an IT infrastructure. We've had the tools for over 20 years now and the problem has always been effective use of and implemenation of tools, It should start from the top with 4 things: a plan, a team/roles, the toolset, and processes to manage the infrastructure.
You raise the issue of non trivial methods so that suggests you're more interested in technical mechanisms. In this case it's best to ask something more specific. The best area I can point you to is this concept that is emerging and it's arguably steeped in virtualization. The concept is Reliability and Serviceabilty (RAS). Computation is becoming non-stop and this means that you can still compute and service the machine at the same time. Hardware reliability is well defined and there are predictive methods for handling this. In fact,every component, network, o/s... is well defined...so I don't really understand the "non-trivial" methods part. Whatever the specific, monitoring in general should support the emerging concept of RAS. Now that term has been mainly associated with hardware, but I think the concept extends to the entire infrastructure. I would be interested to hear more of what you have been working on and what you're targeting.