Infrastructure monitoring is an essential process that allows you to evaluate, analyse, manage and report both the availability and performance of servers and applications. The aim is to run a health check on your infrastructure and to identify and analyse short- and long-term trends.
Other advantages of monitoring include acquiring the data necessary to evaluate infrastructure capacity, plan the expansion of resources and/or carry out audits.
There are two types of infrastructure monitoring: reactive and proactive.
- Reactive monitoring flags failures and errors. Although this implies a drop in service, monitoring helps you to act quickly to reduce them to a minimum.
- Proactive monitoring not only monitors system performance but also checks for anomalies and conditions that cause errors in order to prevent them.
In order to reliably monitor your infrastructure, you need to have specialized tools that fulfil four basic functions:
- Collect data reliably.
- Analyse the data in real time.
- An alert system to signal aberrations from certain parameters.
- Execute automated actions based on criteria defined by the infrastructure manager.
Monitoring metrics
It is important to define what you are going to measure precisely before setting up a monitoring system. Although you can measure and collect data on a large number of parameters, it is advisable to identify what the key metrics are in order to establish an effective automatic response and alert system.
We can classify the monitoring metrics into three large groups:
1. Server metrics, which are used to analyse the performance of individual machines, physical or virtual. For example, measuring the use of CPU, memory, disk space and process volume.
2. Application, process or service metrics, which can be error rates, failures and resets, latency and resource usage.
3. Network metrics, which evaluate the availability of the service and the connection between servers. You can evaluate performance by monitoring connectivity, packet loss, latency, and use of available bandwidth.
There are many other metrics that you can use, for example, to monitor the performance of groups of servers, physical or virtual, or external services of providers that may affect your own infrastructure.
The most important metrics to monitor could be:
- Network traffic: to identify possible congestion and demand variations. It is also useful as a context to explain other metrics. For example, there is often a correlation between traffic volume and latency.
- Latency: to be able to troubleshoot potential performance and network congestion issues, as well as bottlenecks that are preventing proper performance.
- Errors: the frequency and type of errors allow you to evaluate the health of components, applications, and services. It is useful to be able to discriminate by type of error to establish a granular alert system that only flags important errors.
- Saturation: this type of metric measures the use of resources. These metrics allow you to detect capacity problems; they also provide clues on possible optimization actions and incidents that have not been detected in other metrics, especially if you identify some kind of correlation.
Best practices and tips for effective IT infrastructure monitoring
- Although it is possible to create your own monitoring solution from scratch, it is best to use a specialized third-party service. This saves considerable time and costs, both in the creation of the service and its maintenance. Vendor solutions usually have more functionalities and a better user experience than a proprietary solution.
- The monitoring solution must be flexible enough to adapt to your specific needs and allow for a reasonable degree of customization so that both the data and the alert management are useful and effective.
- Use a single tool. A single tool is much easier to manage, as unified data and metrics provide a complete overview of the performance of your systems.
- Create a granular alert system that is scale-based, depending on the severity of the incident. This speeds up and improves responses by teams to problems or errors. It will also help you avoid being overwhelmed by alerts the sheer volume of which reduces their effectiveness.
- Prioritize essential systems, and design a more sensitive monitoring and alert system for this infrastructure. In an ideal world you could monitor all the metrics of your entire infrastructure, but you will most likely have to prioritize according to available resources, the complexity of your infrastructure, the priority of each element and the usefulness of each metric.
- Regularly test and test your monitoring system to make sure both the data collection and the alert and escalation rules are working correctly.
- Document your monitoring settings so that other people on the team can understand your reasoning when setting up monitoring processes. This is very useful to be able to periodically review and optimize your monitoring strategy.
- Ask your monitoring solution provider for help when you need it. Your provider has the experience and knowledge necessary to help you configure an infrastructure monitoring system adapted to your needs.
- Choose a monitoring solution that covers your entire infrastructure. With a single solution, you should be able to monitor virtual server instances on a cloud or IaaS platform, physical servers in a data centre, sensors and any other device that has a network connection. This will allow you to integrate your monitoring and alerts system into a single tool.
- Set up an automatic response system for incidents. The monitoring system, in addition to being a flexible and granular alert system, should also be able to allow the execution of commands and code automatically when a series of defined conditions have been met. For example, if the established control limits have been exceeded, perform an automated and unattended troubleshooting of services and connections through scripting.
Monitoring your infrastructure is key to anticipating incidents that impact the availability of your services. Monitoring also makes it easier to evaluate and analyse the capacity of your infrastructure and detect optimization and improvement vectors.
That is why at Adam we offer our clients a complete monitoring tool for their IT infrastructure that allows them to control their infrastructure in real time, with a flexible and customizable system of alerts and command execution.
This article has been written by
Emilio Moreno
Arquitecto Soluciones Cloud - IaaS