A Simple Guide to Monitoring Your Elasticsearch Cluster in a Graylog Deploy

Author - Chicout

8/19/2025 07:22:00 PM

For anyone managing Graylog in a production environment, stability is paramount. And quite often, the root of instability in Graylog lies within its search and storage engine: Elasticsearch. In high-volume logging scenarios, a robust Elasticsearch cluster isn't just a recommendation; it's a vital necessity to keep your data analysis environment sane.

However, simply having an Elasticsearch cluster is no guarantee of immunity to problems. Failures can occur, performance bottlenecks can arise, and when that happens, Graylog inevitably suffers. It was with this reality in mind that I embarked on a mission to create an effective monitoring system, directly in Grafana, to keep a close watch on the health of our cluster.

Understanding Our Architecture

Before diving into the monitoring setup, it's important to understand the environment we're working with. Our Graylog deployment is designed for high availability and resilience. Here is a diagram of our system architecture:

Our high-level Graylog architecture, showing the data flow from sources to long-term backup.

As the diagram illustrates, logs from various servers and services are sent to an NGINX load balancer. This load balancer then distributes the traffic across a three-node Graylog cluster. While Graylog uses a MongoDB replica set to store its configurations and metadata, the actual log data is processed and sent to a four-node Elasticsearch cluster for indexing and storage. Finally, our backup strategy involves taking regular snapshots of the Elasticsearch data to an NFS repository, which is then archived to tape for long-term retention.

The "Cluster Elastic Search" in this diagram is the critical component we need to monitor. Its health directly impacts our ability to ingest and search logs.

The Monitoring Blueprint: A Draft

With a clear understanding of the architecture, we began by drafting a blueprint for our Grafana dashboard. The goal was to have a single pane of glass for critical metrics, allowing us to quickly spot any warning signs. This initial sketch laid out the core components of our monitoring strategy.

The initial concept for our dashboard, focusing on status and disk health.

The plan was to start with two essential panels:

Elastic Search Statuses: This panel is the immediate health check. Its background color tells a simple story: green if the node's health API returns an HTTP 200 status (the node is responsive), and red if it doesn't. This gives us an instant "is it alive?" status for each node, along with its current CPU utilization.
Elastic Search Disk IO: This panel addresses the most common failure point for Elasticsearch: disk space. The background color here acts as a traffic light based on free disk space: green for healthy (>30% free), yellow as a warning (>10% and <=30% free), and red for critical (<10 bottlenecks.(<10% free). The graphs themselves show the disk read/write rates, helping us spot I/O bottlenecks.

The data for all these metrics is collected by our trusted Zabbix server, which monitors both the underlying system infrastructure and the Elasticsearch API endpoints.

The Final Result: An Actionable Dashboard

After translating our draft into a live dashboard in Grafana, we finally have the visibility we need. The dashboard provides a clear, immediate, and actionable overview of our Elasticsearch cluster's health, allowing us to move from being reactive to proactive.

Here is the completed dashboard in action, providing the insights described in our plan:

The final result for our dashboard, after some rework.

As you can see in the final result, the dashboard evolved slightly from our initial Excalidraw draft. During implementation in Grafana, we encountered a technical difficulty: configuring a single panel where the background color is driven by one metric (like the HTTP health status) while the foreground displays a completely different one (like a CPU graph) proved to be complex. To overcome this, we adapted the design. We created a dedicated 'Stat' panel on the top left to clearly show the HTTP 200 status for each node, providing that instant green-light confirmation we needed. The CPU and Disk I/O metrics were then placed in their own distinct time-series graph panels. While this separates the status from the metric visually, the core objective remains the same: providing a clear and immediate overview of the cluster's health, which this final layout successfully achieves.

With this tool, we can now quickly correlate any Graylog instability with a specific issue in the Elasticsearch cluster—whether it's a non-responsive node or a disk that's rapidly filling up. This enhanced visibility is a crucial step in ensuring a stable and reliable log management platform.

A Simple Guide to Monitoring Your Elasticsearch Cluster in a Graylog Deploy

Understanding Our Architecture

The Monitoring Blueprint: A Draft

The Final Result: An Actionable Dashboard

Postar um comentário

Remoção total de uma instalação do MariaDB no Debian

Hot Posts

Labels

Search This Blog

Most Recent

Remoção total de uma instalação do MariaDB no Debian

Redução de espaço de bancos de dados PostgreSQL

Motivações pra Criação de um Drone

Instalando o módulo do ActiveDirectory para uso no Powershell

Defense in Depth in Active Directory: Essential Strategies with PowerShell and Windows Firewall

Made with Love by

Contact form