If you are a system administrator, or IT manager, or someone who is responsible for IT infrastructure, you should implement an enterprise level monitoring solution.
The shell script you’ve written that does a ps -ef and sends you an email might do the basic job, but it doesn’t count as monitoring.
If you want to be proactive, have peace of mind, and sleep well at night, you should implement a robust system and network monitoring solution for your IT infrastructure.
Nagios took the number one stop in our Top 5 System Monitoring Tool.
I’ve used Nagios intensively for several years, and cannot live without it anymore. Knowing all my systems (and services) are monitored by Nagios, which will notify me when something goes wrong, makes me sleep peacefully at night.
I’ll be launching my next eBook in few weeks. Yes, it is about Nagios Core 3. For those who are new to Nagios, the eBook will walk you though installation, configuration, Nagios web, and everything you need to know to setup Nagios Core 3, and start monitoring your systems. I’m very excited (despite having lot of sleepless night in creating the material for the eBook), to release the eBook that will be helpful to those who are looking to implement Nagios Core in their enterprise.
Any monitoring solution you implement for your infrastructure should have the ability to do the following.
- Universal Monitoring – You don’t want to implement a monitoring solution for Linux server, another for network equipments, another for hardware, etc. You need one monitoring solution to monitor almost all your IT systems and services. A good monitoring system should provide a framework to include plugins to monitoring various services and devices. For example, it should be able to monitor: Operating systems — *nix, Windows, etc. System resources — CPU, Disk, Swap, Process, etc. Network Equipments — Switches, Router, VPN, Firewalls, etc.
- Efficient Alert Notifications – Assign individuals (or groups) to a system or service as owners. This gives the power to the owners. Let the owners of the system (or service) be notified and take action, before you get involved. Should provide the ability for you to send notification using various methods — email, pager, SMS, IM, etc. Ability to set warning and critical alerts for systems and services that are monitored. Granular monitoring options to specify how often the system should be monitored, how many retries in case of failure, how many failure notifications to send, methods of notification, etc.
- Web Dashboard that provides overall health, issues, and alerts for all the systems across the network, along with the ability to drill-down to individual hosts (and services).
- Issue Escalation – Should provide the ability to notify managers, when the owner of the system is not taking action on an issue within certain time period. For example, when a database crashes, and DBA doesn’t fix it within reasonable time, the monitoring system should alert the manager about the issue.
- Distributed Monitoring and Scalability – Should be capable of monitoring thousands of servers and services without too much overhead. Support distributed monitoring with multiple monitoring systems across the enterprise that can talk to a central monitoring server.
- Reporting – Should generate various monitoring reports. For example, availability, trending, notification reports for administrators. Should provide daily, weekly, monthly, or custom date range analysis of various monitoring statistics
- External Application Integration – Should provide a framework (or API) that can be used by external application to update the current status of the system or service that is getting monitored. Should be able to provide enough details for external vendors to integrate their solution with the monitoring software. The more extensible the framework is, more vendors will provide solution, and more companies will use it to make the software robust.
- Open source solution – Since you’ll be exposing all your mission critical systems to the monitoring software, you should make sure that you can trust the monitoring software. Open source solutions are typically thoroughly tested and reviewed by the community for any potential security issues. Look for the track-record of the software. How many years it has been in the market, the longer the better. How many companies are using the software, the more the better.
- Community and Commercial Support – When you are implementing it on a large scale (thousands of servers), you might want to implement a solution that is official supported and backed by a company. Several open source monitoring solutions are backed by a company that provides commercial support. Even if you don’t use the commercial support initially, you might want to use the support, when you expand your monitoring footprint.
- Easy to Learn and Use – This might be obvious to some of you, but you’ll be surprised how many people end-up implementing a system that is very hard to learn and use. Don’t overlook this. The monitoring solution should be easy to implement and learn, as simple as that. You should not spend weeks trying to figure out how to get the software implemented and working successfully.
My upcoming eBook on Nagios Core 3, is structured and organized in an easy to understand way, to help you implement, configure, and manage the Nagios Core 3 on your IT infrastructure. Nagios is an extremely powerful monitoring software, that does all of the above very well.
Apart from the 10 things mentioned above, are there anything else any monitoring solution should do (or have) in your opinion?