Coroot v1.7: monitoring ClickHouse and Zookeeper with eBPF

At Coroot, we started using eBPF to give users insights into their system performance without needing them to change code or redeploy services. This approach not only makes setup easier but also ensures full visibility, even for third-party and legacy services. To truly achieve this, though, the tool needs to support a wide range of application protocols. Coroot has long supported popular ones like HTTP, gRPC, Postgres, MySQL, Redis, Memcached, MongoDB, Kafka, and Cassandra.

But we realized we were missing two important protocols: the ClickHouse native protocol and the Zookeeper protocol. Based on our anonymous usage stats, ClickHouse is incredibly popular among our users, not surprising, since we use it ourselves at Coroot!

ClickHouse supports several protocols like Postgres, MySQL, and HTTP. However, the most efficient client libraries rely on ClickHouse’s native protocol, making it crucial for us to support.

Zookeeper, on the other hand, might seem like a niche tool that’s only directly used in large-scale infrastructures. In reality, it’s a key part of systems like Kafka and ClickHouse, helping ensure consistency in distributed setups. While Kafka can now work without Zookeeper by using its Raft-based consensus protocol, Zookeeper is still widely used in production. Similarly, ClickHouse often uses ClickHouse Keeper as a Zookeeper drop-in replacement, which communicates via the Zookeeper protocol.

Therefore, by implementing support for ClickHouse and Zookeeper protocols, we can achieve full visibility into ClickHouse setups, even those using ClickHouse Keeper, and into Kafka clusters that rely on Zookeeper.

Implementation

In Coroot’s agent code, we already capture various system calls to monitor service-to-service communications, even handling encrypted payloads by instrumenting OpenSSL and GoTLS functions. Protocol parsing in Coroot works in two stages:

  • Super-efficient protocol detection in eBPF programs
  • Deep protocol parsing in user-space code


This two-stage approach is necessary because eBPF programs have to remain lightweight to avoid impacting application performance. User-space code communicates with the kernel through PERF_MAPS, which are ring buffers. In the worst-case scenario, if the user-space program can’t process the event stream quickly enough, some data might be missed. From my perspective, this is a great trade-off for observability because it guarantees that the agent won’t affect the performance of user applications.

Now, about the ClickHouse and Zookeeper protocols: both are not well-documented. Here’s a quote from the ClickHouse documentation:

The native protocol is used in the command-line client, for inter-server communication during distributed query processing, and also in other C++ programs. Unfortunately, native ClickHouse protocol does not have formal specification yet, but it can be reverse-engineered from ClickHouse source code (starting around here) and/or by intercepting and analyzing TCP traffic.

So, we had to reverse-engineer both protocols by studying the ClickHouse and Zookeeper source code.

It wasn’t particularly difficult, thanks to our previous experience working with various protocols. And, to my knowledge, Coroot is the first tool on the market to support instrumentation of these protocols with eBPF.

I’ve got to say, the simplicity and efficiency of the ClickHouse native protocol really stood out to me. It’s clear the ClickHouse team is obsessed with performance, and huge props to Alexey Milovidov for driving that culture!

Results

Now let’s look at how we can use the collected data to easily understand ClickHouse performance.

First, we can view all the ClickHouse clients along with the number of queries they’ve made and their latency:

viewing clickhouse clients

With this data, Coroot automatically tracks Service Level Indicators (SLIs) for your ClickHouse cluster, keeping you informed if there are many failed queries or if latency increases.

To visualize the overall performance of any application, Coroot uses HeatMaps. These are incredibly useful for understanding the distribution of latency and errors. Plus, you can zoom in on any area of the chart to identify specific queries within that range.

Since ClickHouse uses its native protocol not only for client communication but also for server-to-server interactions, Coroot captures queries in both scenarios.

As engineers, we’re constantly learning because of the sheer variety of technologies out there. For me, I didn’t know much about ZooKeeper initially, but implementing protocol capturing for it was a fun learning experience, especially seeing how ClickHouse communicates with ZooKeeper in action!

Latency & Errors heatmap per second

In my experience, this kind of insight turns services from “black boxes” into something we can understand, helps us learn more about distributed systems, and gives us confidence to quickly find the root cause when things go wrong. 

Conclusion

With Coroot’s support for the ClickHouse native protocol and ZooKeeper protocol, we’re taking observability to the next level for distributed systems. By providing deep insights into both client and server communication, Coroot helps you better understand your infrastructure, eliminate blind spots, and confidently troubleshoot issues when they arise. Observability shouldn’t be complicated, and with Coroot, it’s easier than ever to make your systems transparent and reliable.

Ready to make your entire system observable? Get started with Coroot Community Edition for free, or start a free trial of Coroot Enterprise Edition for advanced capabilities.

Related posts