Menu
Having a background in managing IT ops teams and building a cloud monitoring platform, here are my observations based on my experience:
Monitoring software would do so much more if it only knew what engineers know. For instance, if a container is limited in CPU time due to throttling, that means this container is performing slower. Most engineers could realize that these two events are interrelated due to their understanding of the physical meaning of metrics. Current tools would just overlook it.
Another limitation existing tools have is their inability to see the whole picture. For example, it is advantageous to understand that an application is a set of containers that work together. So, a metric describing an individual container will not tell you anything. However, a metric describing all related containers would be quite revealing.
I’ve seen a lot of engineers work on fixing issues. Interestingly enough, they all act practically the same. They check a similar set of hypotheses about a possible root cause, confirming or rejecting them one by one: compute resources, upstream services, databases, etc.
Of course, this list is not exhaustive since each application might have its own specific issues. Nevertheless, imagine if you were able to check for common pitfalls in just one second.
We’ve built Coroot under the belief that more than 80% of issues can be detected automatically. Coroot is a virtual assistant who audits your infrastructure just like an experienced engineer would. It:
In order to get a clear picture of a distributed system, Coroot makes a model of it. The first main step here is to group individual containers into services or applications.
Previously, meta-information for the grouping was only available in manifests of configuration management tools, such as Chef, Ansible, and Puppet. Nowadays, the widespread adoption of Kubernetes makes it easier to extract this data. However, it is not enough.
We also need to know how applications communicate with each other. Coroot uses network-level tracing to get a comprehensive map of all network connections in a cluster. This approach has a number of advantages over application-level tracing:
Network tracing is a part of our open-source node-agent which is compatible with Prometheus.
Below you can see a model of the product-catalog application. It has two instances running three applications communicating with it: kubelet, prometheus, and frontend. Product-catalog itself depends only on the pg-main-1 database.
A model like this is already pretty useful since it gains visibility into the distributed system architecture. At Coroot, we went further by using application models for root cause analysis.
Unfortunately, using thresholds for metrics is not always possible, since not all applications are built the same. For instance, an application with a response time objective set at 10ms would be affected even by the slightest increase in network latency. So, setting a low threshold for network latency seems reasonable. Though, doing this may generate a large number of false positives for services resilient against such delays.
Using a model, such as the one described above, empowers Coroot to audit all subsystems within an application’s context. In other words, each inspection is aware of the instances, nodes, and services related to any particular application.
For example, the Storage inspection checks the correlation between database latency and its volumes’ performance:
The strong correlation seen here means the database latency has been affected by the storage I/O latency. If an application is meeting its SLOs in the short term, it is not necessary to perform such inspections. This eliminates false positives in root cause analysis.
The coolest thing is that this approach does not require any configuration other than the user-defined application SLOs.
Coroot aims to help in troubleshooting by providing you with a list of possible fixes and useful details on every detected issue.
Below is an example of the Storage inspection‘s report:
As you can see, there are two primary ways to fix the issue: reduce the I/O load or increase the volume performance. However, Coroot went a step further and collected all the details needed to troubleshoot:
As we work on Coroot, we are constantly asking ourselves: