Troubleshooting Microservice Architectures
DevOps Stage 2024
Speaker: Nikolay Sivko
Watch the Talk
About this Talk
How do you diagnose and troubleshoot issues in large-scale distributed systems with hundreds or thousands of interconnected services? This session tackles one of the most pressing challenges in modern platform and site reliability engineering: making sense of complex systems before they impact users.
Nikolay Sivko, Coroot CEO, breaks down exactly what information you need to understand the health of each service, and how to design dashboards that give engineers an immediate, actionable view of system status.
A core focus is observability: how metrics, logs, traces, and profiles each serve a distinct role, and how to use them together to build a complete picture of what's happening inside your infrastructure. Rather than treating these signals as interchangeable, we explore the unique perspective each one offers and when to reach for which.
The session also addresses one of the most time-consuming parts of incident response (incident analysis) and examines practical approaches to automating it. We weigh the trade-offs between fully automated and partially automated solutions, including where automation accelerates diagnosis and where human judgment remains essential.
If you're responsible for observability strategy, on-call reliability, or building internal developer platforms, this session offers a clear framework for turning raw telemetry into faster, more confident incident resolution.