Our Current Setup
Just shy of a thousand physical machines running 24/7/365 in addition to an increasing number of virtual machines and containers, we have a lot of moving parts to keep on top of. Some of the services are unique to our company and need special attention. Others are more standard like webservers or databases.
As such we’ve cultivated an ever growing arsenal of monitoring solutions for various use cases:
- Icinga for the basics: checking servers, services, and triggering alerts.
- OpenTSDB with an Hbase backend for deeper inspections and debugging purposes,
- Observium for the network devices over SNMP, and
- Graffix for some special aggregations
What we need
The current monitoring model in Icinga is simple. Place a sensor, determine a threshold, send an alarm if the threshold is breached. This model is still viable and has served us well for years, but we’ve outgrown its capabilities and have become more and more restricted by its limitations.
For example, when Icinga reads a sensor (a script) and issues an alert, it does so at runtime and in isolation. This is fine most of the time, but we’ve reached a point where simple warnings can be too reductive and not very informative. It limits the flexibility and increases the difficulty of the analysis.
We need a more encompassing monitoring solution that can handle the complexities our system requires.
Ideally we’d like to have:
- a single, unified solution for all of our use cases
- a better view on the current state of the services and servers
- highly configurable sensors and triggers
- sensors inside containers with a minimal footprint
- configurable dashboards
- a metrics database for long term analysis
- hassle free configuration
Zabbix: an Overview
We found Zabbix to be a good fit and a pragmatic solution to our requirements. Zabbix is like a scientific research institute on its own. It collects metrics over specified periods of time, stores them in a database, persists history, creates trends.
In contrast to the example above, Zabbix collects all the metrics into a database where the server can use multiple measurements to do a more detailed analysis and trigger an alarm more appropriate to a specific trend. Whereas Icinga might only check what percentage of a disk is in use at a given time, Zabbix could, for example, easily calculate the rate of increased disk usage over hours or whatever timescale we specify, including predicting when the disk will be full — all without jumping through hoops.
This different paradigm not only increases the depth of our knowledge but also speeds up the creation of new sensors.
In addition to an accurate, precise alert system, it also creates beautiful graphs for analysis:

We now have the flexibility to satisfy all the needs for daily operation and development. As the operations team, we are able to determine root problems within the system faster than before. The developers receive immediate feedback on how the deployment in staging effects the performance, and our business intelligence has access to much more relevant data about how the hadoop cluster is behaving.
We can automate the process of introducing new servers into the monitoring solution with our provisioning solution by letting it install the Zabbix Agent and feeding it some metadata like:
- which operating system is running,
- which cluster it belongs to,
- the host group it should be part of, or
- which data center the new server is hosted in.
HostMetadata=:production:linux:hadoop:name-node:ops:
This new paradigm comes with somewhat of a learning curve, but ultimately will save us a lot of time in the long run, improving the quality and speed of our monitoring.
There is more… 🙂
In-depth Analysis
As soon as we plugged Zabbix in with its default templates and fed it some of our clusters for data collection, we immediately found some things we could improve we hadn’t seen before. This was directly due to the different view on the same data and saved us a lot of headaches, just in the first days we fiddled around with Zabbix.
The data analysis possibilities in Zabbix are quite extensive, from simple Triggers like:
{Template Linux OS:system.cpu.util[,iowait].avg(5m)}>40
which generates an average fromiowait
over the last five minutes and triggers an alert if over 40, to calculated items:
{host:vfs.fs.size[/,free].timeleft(1h,,0)}<1h and
{host:vfs.fs.size[/,free].timeleft(1h,,0)}<>-1
which tries to estimate the time until the disk will be full.
Zabbix offers a lot of options: Supported trigger functions. This is also an interesting read about predictive functions: Forecasting Trigger Functions.
These are only a few examples on how to tailor the triggers for your specific requirements.
Low-level Discovery
Ever wonder if it would be possible to magically have all your network interfaces and file-systems discovered in an instant? Or maybe you have an Elasticsearch cluster and need to discover all its nodes and indexes while also retrieving the same metrics from each of those? Well, for Zabbix that is not a problem.
We just set up some rules, make some prototype items, sit back, and relax. It does not matter how many Elasticsearch nodes you have or if the servers have different disks or different network cards.
Low-level discovery really lives up to its name. It goes deep into the operating system’s or application files that hold this information and sets up a prototype metric that triggers the creation of the individual metrics for each mounted file-system, network interface, or node in Elasticsearch.

Easy Configuration and Management
With the newly available provisioning solutions, configuring each of our Zabbix agents (clients) for its specific metric collection on specific services is a piece of cake, and the Zabbix user interface provides us with all the tools necessary for nesting our hosts into host groups and grouping our metrics into templates for easy distribution across the host groups.
Want to monitor a specific service like HTTP availability? Not a problem. The Zabbix interface has many solutions for such scenarios. Or maybe a newly installed system needs to be monitored? After the server is provisioned, it actively presents itself to the Zabbix server which, with its Auto-Registration rules, specifies through the configuration metadata which group the new host should belong to and which template of items and triggers it should get.
LDAP and Role-based Configuration
The Zabbix user interface supports LDAP user authentication out of the box. Although we can not directly import our LDAP groups and create user groups for the Zabbix interface automatically, the Zabbix API give us the ability to script this task and automate the process.
Here is a straight-forward, example implementation: http://dnaeon.github.io/importing-ad-ldap-users-and-groups-into-zabbix
Network Monitoring / Network Map
Having an overview of our network is important, to say the very least. Zabbix provides not only information about which network devices we have and where they are, but also how they connect to one another and how busy the network is.
Zabbix uses the SNMP protocol to detect network devices like switches and their ports and also routers. As we already said, as soon as Zabbix discovered our switches with the default SNMP metric template, through low-level discovery we were offered a range of metrics related to network speed and port performance.
Container Monitoring
Zabbix does not come with a native implementation for container monitoring, but it sure has more than enough tools to let us get creative. Its ability to calculate or aggregate metrics allows us to come up with interesting formulae to solve problems.
Containers are ephemeral, so we can’t treat them like servers. If a container is restarted the old one is gone forever and needs to be cleaned up. Using the cAdvisor for Kubernetes and retrieving data in JSON (which Zabbix can easily parse) directly from its API, we can feed life information about our containers to Zabbix and create the appropriate template from the collection of metrics we desire.
Service Availability
When thinking of services provided to internal or external customers, it’s primarily focused on the business aspect of the services. A client does not actually care about CPUs, file systems, or any other underlying application issues. The only thing that matters is whether the offered service works within the expected parameters or not.
To check a business service in this manner, one just has to implement a few sensors that perform typical transactions to the very same interface a customer would access. This creates a Zabbix IT services hierarchy that instantly shows which business service is impacted and which customers are affected by this – independently of the root cause.
There are different levels of services which need to be monitored to ensure a good customer experience. Zabbix covers the following:
- What are the IT services (business processes) offered to or used by customer(s)?
This could be, for instance, a Web User Interface, an API, or any kind of interface that is either used by a customer or within a customer’s business process. IT services are likely subject to a Service Level Agreement. - What are the applications (or other IT services) each IT services is built of?
This could be web applications based on Java or any other application that sends/receives/uses business process related data. - What are the respective dependencies between related applications (or IT services)?
This could mean the sequence of applications or services involved for a business process (part) as well as any inter-connection like data buses between applications. - What are the physical or virtual platforms the applications are deployed on?
This could be any kind of server or container (Docker ,VM’s, …) as well as physical components, devices or appliances. Such platforms may also be based on each other and such may appear like applications too. But here an application is always related to representing or implementing business logic and processing. - What are the auxiliary IT-/services, applications, platforms?
This could be a database, remote storage, file transfer service, scheduling system, etc.
Source and more Information: http://www.zabbix.org/wiki/Zabbix_IT_services_%E2%80%93_an_often_underrated_feature
Dependencies and Event Correlation
Alerting is a big part of monitoring solutions. We have to be alerted when a certain condition is met. In Zabbix you formulate these conditions in the triggers.
When we analyze a metric and set a threshold (trigger) for it, we are, in most cases, looking at a symptom, not the root problem. This generates many difficulties regarding precise alerting (primarily waking somebody up at 4:00 o’clock who’s on-call to have a look at a symptom and try to figure out where the problem might have come from). This is far from ideal.
The guys from Zabbix have definitely not overlooked this problem. With trigger dependencies we can, with a certain degree of flexibility, define dependencies between triggers.
For example, if a host is behind a router that is down, then the host is obviously also unreachable. Clearly we don’t want or need to receive two notifications about both being down. Or extending this to services, if a database cluster behind a webserver is overloaded, and the web interface is not responding, we might want to know first if something is wrong with database.
If we work with an intricate system that depends on several applications like Aerospike, Kafka, Hadoop, Zookeeper, Storm, etc. We might want to correlated some of those Triggers so we can easily identify the probable root of the problem even at 4:00 in the morning.
# Eample for trigger dependencies
{mysql1:mysql.status[Questions].last()} > 5000
and
{mysql2:mysql.status[Questions].last()} > 5000
and
{mysql3:mysql.status[Questions].last()} > 5000
Another interesting aspect to look into is trigger-based event correlation which allows us to correlate separate problems reported by one trigger. Let’s say we’re monitoring a system log to check whether a service has stopped or started. The event correlation tags the trigger for “service has stopped” and also tags the trigger for “service has started.” We do not want to be notified if the service has stopped if it immediately starts again, but we do want to be notified if it stopped for a long period of time. This flexibility improves the alerting and makes sure everything runs smoothly and quietly until a real problem shows up.

Zabbix API
On top of all the goodies one may find in Zabbix, it also comes with an API that provides access to almost all functionality available in Zabbix. Its API opens up a lot of opportunities for even greater efficiency in monitoring by allowing us to easily integrate with any software able to make or accept external calls.
Integration with ticketing systems such as JIRA is possible, so triggers can tell JIRA to create a ticket for a specific department or person.
Another great use of the Zabbix API is when you want to add hundreds or thousands of devices for monitoring, applying very specific custom rules that would not be possible via the Zabbix web interface.
Integration with provisioning tools such as Saltstack could also be useful when adding, removing, or upgrading hardware or software.
Maintenance Periods can be scripted and distributed among developers for a quick and effective way to set down times during testing of new functionality.
A final use case is the creation of host groups en masse. With simple API calls we can have a script which creates hundreds of these instead of having to click your way to certain arthritis!
Zabbix Community
Last but not least, we have access to an enormous, active, and thriving community. Being an open-source project gives users the ability to modify or create new functionality to any given purpose or use case. Not only that, but the ZABBIX LLC has its own community website were users can publish their creations making it easier to find new modules, templates, scripts, etc.
Furthermore the Zabbix forum is very active and provides answers for common problems and interesting solutions for more specific ones.
Conclusion
There are many types of monitoring tools and many requirements a monitoring tool should fulfill. As a SaaS company who is fundamentally dependent on the functioning and availability of our infrastructure and services, it’s imperative to have an excellent monitoring solution in place to ensure not only our business but also the businesses of our clients. We found that Zabbix is a solution that allows us to focus on improving our monitoring without having to worry about finding workarounds for the shortcomings of our old system.
And not to understate: it’s also a lot of fun 🙂
PS: We currently have an open position in our team: DEVOPS ENGINEER (W/M) IT OPERATIONS