Menu
At Coroot we are building a tool that helps engineers find the root cause of infrastructure outages. Such an audit requires accurate and detailed telemetry data related to every application. At the low level we want to know about: starting and exiting of processes, opening and closing TCP connections and listening sockets, TCP retransmits, opening files, OOM kills, etc.
There are many ways to obtain such data by reading procfs, sysfs, and running numerous utilities like netstat, lsof, etc. No matter how often we poll these subsystems, we will inevitably miss some of the events, e.g. short living processes and TCP connections. Using eBPF allows an agent to get real-time notifications from the kernel regarding all said events.
eBPF allows us to insert small pieces of code almost anywhere in the kernel. Two common ways to insert our code are kprobe/kretprobe and tracepoint.
The kprobe way is flexible – you can ask kernel to run a program every time any kernel function is called. However, it is very fragile – kernel functions and their arguments could be changed, renamed, or deleted in a later kernel version.
That’s why we prefer using tracepoints – special pre-defined kernel hooks which represent events of interest. They are considered a “stable API”, so their details shouldn’t change from one kernel version to the next.
Our eBPF programs are observers, they trace each tracepoint call, read arguments, and report events to the userspace. When necessary, a program can store intermediate data in kernel-space eBPF maps. The special map BPF_MAP_TYPE_PERF_EVENT_ARRAY is used to deliver events to the userspace.
Let’s see what a simple eBPF program looks like – for example, tracing the start and exit of a process. Linux kernel provides handy tracepoints to track processes: task/task_newtask and sched/sched_process_exit.
The arguments format of a particular tracepoint can be found in the corresponding file in sysfs:
# cat /sys/kernel/debug/tracing/events/task/task_newtask/format name: task_newtask ID: 138 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:pid_t pid; offset:8; size:4; signed:1; field:char comm[16]; offset:12; size:16; signed:1; field:unsigned long clone_flags; offset:32; size:8; signed:0; field:short oom_score_adj; offset:40; size:2; signed:1; print fmt: "pid=%d comm=%s clone_flags=%lx oom_score_adj=%hd", REC->pid, REC->comm, REC->clone_flags, REC->oom_score_adj
The first four fields are a part of a tracepoint’s call context and are of no interest. Below we see how the trace_event_raw_task_newtask structure can be declared:
#define TASK_COMM_LEN 16 struct trace_event_raw_task_newtask { __u64 unused; // context padding __u32 pid; char comm[TASK_COMM_LEN]; long unsigned int clone_flags; };
We don’t have to define the entire structure, only the needed fields (pid, clone_flags) and “paddings” (unused, comm) in between.
Here is a handler for the task_newtask tracepoint:
#define CLONE_THREAD 0x00010000 struct proc_event { __u32 type; // EVENT_TYPE_PROCESS_START | EVENT_TYPE_PROCESS_EXIT __u32 pid; __u32 reason; // 0 | EVENT_REASON_OOM_KILL }; struct { __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); __uint(key_size, sizeof(int)); __uint(value_size, sizeof(int)); } proc_events SEC(".maps"); SEC("tracepoint/task/task_newtask") int task_newtask(struct trace_event_raw_task_newtask *args) { if (args->clone_flags & CLONE_THREAD) { // ignoring threads return 0; } struct proc_event e = { .type = EVENT_TYPE_PROCESS_START, .pid = args->pid, }; bpf_perf_event_output(args, &proc_events, BPF_F_CURRENT_CPU, &e, sizeof(e)); return 0; }
The “task” in the task_newtask tracepoint name means both process and thread. Thread creation can be filtered out by checking clone_flags.
The next program is a bit more complicated because we want to catch if a process was killed by the Out-of-Memory killer. This example shows how to share data between tracepoints.
We can save the intention of the kernel to kill a process over the oom/mark_victim tracepoint.
struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(key_size, sizeof(__u32)); __uint(value_size, sizeof(__u32)); __uint(max_entries, 10240); } oom_info SEC(".maps"); struct trace_event_raw_mark_victim { __u64 unused; int pid; }; SEC("tracepoint/oom/mark_victim") int oom_mark_victim(struct trace_event_raw_mark_victim *args) { __u32 pid = args->pid; bpf_map_update_elem(&oom_info, &pid, &pid, BPF_ANY); return 0; }
The oom_mark_victim handler doesn’t send any events. It just saves the pid into BPF_MAP_TYPE_HASH.
In the sched_process_exit handler, we enrich the EVENT_TYPE_PROCESS_EXIT event with the EVENT_REASON_OOM_KILL reason if we have seen the pid in the oom_mark_victim.
struct trace_event_raw_sched_process_template { __u64 unused; char comm[TASK_COMM_LEN]; __u32 pid; }; SEC("tracepoint/sched/sched_process_exit") int sched_process_exit(struct trace_event_raw_sched_process_template *args) { __u64 id = bpf_get_current_pid_tgid(); if (id >> 32 != (__u32)id) { // ignoring threads return 0; } struct proc_event e = { .type = EVENT_TYPE_PROCESS_EXIT, .pid = args->pid, }; if (bpf_map_lookup_elem(&oom_info, &e.pid)) { e.reason = EVENT_REASON_OOM_KILL; bpf_map_delete_elem(&oom_info, &e.pid); } bpf_perf_event_output(args, &proc_events, BPF_F_CURRENT_CPU, &e, sizeof(e)); return 0; }
Since we are developing a tool that will run on many nodes, it should have as few dependencies as possible. This is the main reason why we can not use the BCC (BPF Compiler Collection) approach when an eBPF program is compiled on a target machine. This requires a compiler, libraries, and kernel headers to be installed on the target machine.
The main challenge in delivering a compiled eBPF program is that this program has to support multiple kernel versions. We aimed to support kernel versions starting from 4.16, because it covers the most currently supported Linux distributions. This kernel requirement allows us to use tracepoints as the more stable API as opposed to kprobes.
In fact, tracepoint arguments can still change from one kernel version to another. For example, arguments of the tcp_retransmit_skb tracepoint had the following changes:
To handle these changes we build a separate eBPF program for each structure variant:
struct trace_event_raw_tcp_event_sk_skb { __u64 unused; void *sbkaddr; void *skaddr; #if __KERNEL >= 420 int state; #endif __u16 sport; __u16 dport; #if __KERNEL >= 512 __u16 family; #endif __u8 saddr[4]; __u8 daddr[4]; __u8 saddr_v6[16]; __u8 daddr_v6[16]; };
clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -D__KERNEL=416 -c ebpf.c -o ebpf416.o && llvm-strip --strip-debug ebpf416.o clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -D__KERNEL=420 -c ebpf.c -o ebpf420.o && llvm-strip --strip-debug ebpf420.o clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -D__KERNEL=512 -c ebpf.c -o ebpf512.o && llvm-strip --strip-debug ebpf512.o
Node-agent loads the specific ebpf*.o ELF file corresponding to the kernel version on a particular node.
The main disadvantage of this approach is the need to track changes in kernel structures.
BPF CO-RE (Compile Once – Run Everywhere) is a modern approach that allows an eBPF program to read kernel structures regardless of the actual memory layout of a struct.
Unfortunately, the requirement to support 4.* kernel versions doesn’t allow us to solyly rely on CO-RE…yet, let’s see how the tcp_retransmit_skb handler could look like:
struct trace_event_raw_tcp_event_sk_skb { __u16 sport; __u16 dport; __u8 saddr_v6[16]; __u8 daddr_v6[16]; }; SEC("tracepoint/tcp/tcp_retransmit_skb") int tcp_retransmit_skb(struct trace_event_raw_tcp_event_sk_skb *args) { struct tcp_event e = { .type = EVENT_TYPE_TCP_RETRANSMIT, }; BPF_CORE_READ_INTO(&e.sport, args, sport); BPF_CORE_READ_INTO(&e.dport, args, dport); BPF_CORE_READ_INTO(&e.saddr, args, saddr_v6); BPF_CORE_READ_INTO(&e.daddr, args, daddr_v6); bpf_perf_event_output(args, &tcp_retransmit_events, BPF_F_CURRENT_CPU, &e, sizeof(e)); return 0; }
As seen here, we declared only the relevant fields. Under the hood, the eBPF program loader corrects kernel struct field offsets by using BTF (BPF Type Format). So, we no longer need to worry about any changes in underlying structures.
To assure that the node-agent is able to catch needed events, we wrote a bunch of integration tests. The tests are running in dedicated VMs. This allows us to test the code on different kernel versions and reliably reproduce events, such as out-of-memory and packet-loss.
For example, for testing tcp_retransmit_skb we use the tc tool:
c, err = net.DialTimeout("tcp", remoteAddr, 100*time.Millisecond) require.NoError(t, err) localAddr = c.LocalAddr().String() nextIs(EventTypeConnectionOpen, localAddr, remoteAddr, pid) require.NoError(t, exec.Command("tc", "qdisc", "add", "dev", "lo", "root", "netem", "loss", "100%").Run()) c.Write([]byte("hello")) nextIs(EventTypeTCPRetransmit, localAddr, remoteAddr, 0)
Having such tests allows us to easily check our eBPF programs for portability. Currently, we are testing on several Ubuntu versions: 18.10, 20.04, 20.10, 21.10.
~/coroot-node-agent/ebpftracer$ make test docker build -t ebpftracer . ... docker cp 0dac5915c72737aa57ef81bc69b51ec486bb035c3e6f12a68a8040a550a167cc:/tmp/ebpf.go ./ebpf.go vagrant ssh ubuntu1810 -c "uname -r && cd /tmp/src && sudo go test -p 1 -count 1 -v ./ebpftracer/..." 4.18.0-25-generic --- PASS: TestProcessEvents (6.33s) --- PASS: TestTcpEvents (3.86s) --- PASS: TestFileEvents (6.60s) vagrant ssh ubuntu2004 -c "uname -r && cd /tmp/src && sudo go test -p 1 -count 1 -v ./ebpftracer/..." 5.4.0-97-generic --- PASS: TestProcessEvents (5.65s) --- PASS: TestTcpEvents (3.80s) --- PASS: TestFileEvents (6.69s) vagrant ssh ubuntu2010 -c "uname -r && cd /tmp/src && sudo go test -p 1 -count 1 -v ./ebpftracer/..." 5.8.0-63-generic --- PASS: TestProcessEvents (5.13s) --- PASS: TestTcpEvents (3.65s) --- PASS: TestFileEvents (6.57s) vagrant ssh ubuntu2110 -c "uname -r && cd /tmp/src && sudo go test -p 1 -count 1 -v ./ebpftracer/..." 5.13.0-28-generic --- PASS: TestProcessEvents (6.20s) --- PASS: TestTcpEvents (4.48s) --- PASS: TestFileEvents (7.01s)