Containers metrics
Each container metric has the container_id label. This is a compound identifier and its
format varies between container types, e.g.,
/docker/upbeat_borg, /k8s/namespace-1/pod-2/container-3 or /system.slice/nginx.service.
- CPU
- Memory
- Disk
- Network
- Application layer protocol metrics
- JVM
- .NET runtime
- Logs
- Node metrics
- Other
CPU
CPU limit of the container.
CPU cgroup,
cpu.cfs_quota_us and cpu.cfs_period_us files.
Total CPU time consumed by the container.
CPU accounting cgroup,
cpuacct.usage file.
Total time duration the container has been throttled for.
CPU cgroup,
cpu.stat file.
Total time duration the container has been waiting for a CPU (while being runnable).
Delay accounting.
Memory
Memory limit of the container.
Memory cgroup, file memory.limit_in_bytes.
Amount of physical memory used by the container (doesn't include page cache).
Memory cgroup, file memory.stats.
Amount of page cache memory allocated by the container.
Memory cgroup, file memory.stats.
Total number of times the container has been terminated by the OOM killer.
eBPF: tracepoint/oom/mark_victim
Disk
Total time duration the container has been waiting for I/Os to complete.
Delay accounting.
Total capacity of the volume.
statfs()
mount_point - path in the mount namespace of the container
device - device name, e.g., vda, nvme1n1
Used capacity of the volume.
statfs()
mount_point, device, volume
Total number of reads or writes completed successfully by the container.
Blkio cgroup,
blkio.throttle.io_serviced file.
mount_point, device, volume
Total number of bytes read from the disk or written to the disk by the container.
Blkio cgroup,
blkio.throttle.io_service_bytes file.
mount_point, device, volume
Network
A TCP listen address of the container.
eBPF: tracepoint/sock/inet_sock_set_state, /proc/<pid>/net/tcp, /proc/<pid>/net/tcp6
listen_addr - ip:port
proxy - dockerd if the container publishes its ports through the docker daemon. Undefined for other cases.
Total number of successful TCP connects.
eBPF: tracepoint/sock/inet_sock_set_state
destination - IP:port of the connection destination
actual_destination - actual IP:port of the connection.
For example, the container was establishing a connection to port 80 of a Kubernetes service IP 10.96.1.1.
The destination address was translated by iptables to a pod IP, e.g., 10.40.1.5.
In this case actual_destination would be 10.40.1.5:80.
Total number of retransmitted TCP segments.
This metric is collected only for outbound TCP connections.
eBPF: tracepoint/tcp/tcp_retransmit_skb
destination, actual_destination
Total number of failed TCP connects to a particular endpoint.
The agent takes into account only TCP failures, so this metric doesn't reflect DNS errors.
eBPF: tracepoint/sock/inet_sock_set_state
destination - IP:port of the connection destination
Number of active outbound connections between the container and a particular endpoint.
eBPF: tracepoint/sock/inet_sock_set_state
destination, actual_destination
Round-trip time between the container and a remote IP.
The agent measures the round-trip time of an ICMP request sent to an IP address the container is currently working with.
destination_ip
Application layer protocol metrics
Total number of outbound HTTP requests made by the container
eBPF
destination_ip, actual_destination, status
Histogram of the response time for each outbound HTTP request
eBPF
destination_ip, actual_destination, le
Total number of outbound Postgres queries made by the container
eBPF
destination_ip, actual_destination, status
Histogram of the response time for each outbound Postgres query
eBPF
destination_ip, actual_destination, le
Total number of outbound Redis queries made by the container
eBPF
destination_ip, actual_destination, status
Histogram of the response time for each outbound Redis query
eBPF
destination_ip, actual_destination, le
Total number of outbound Memcached queries made by the container
eBPF
destination_ip, actual_destination, status
Histogram of the response time for each outbound Memcached query
eBPF
destination_ip, actual_destination, le
Total number of outbound Mysql queries made by the container
eBPF
destination_ip, actual_destination, status
Histogram of the response time for each outbound Mysql query
eBPF
destination_ip, actual_destination, le
Total number of outbound Mongo queries made by the container
eBPF
destination_ip, actual_destination, status
Histogram of the response time for each outbound Mongo query
eBPF
destination_ip, actual_destination, le
Total number of outbound Kafka requests made by the container
eBPF
destination_ip, actual_destination, status
Histogram of the response time for each outbound Kafka request
eBPF
destination_ip, actual_destination, le
Total number of outbound Cassandra queries made by the container
eBPF
destination_ip, actual_destination, status
Histogram of the response time for each outbound Cassandra query
eBPF
destination_ip, actual_destination, le
Total number of Rabbitmq messages produced or consumed by the container
eBPF
destination_ip, actual_destination, status, method
Total number of NATS messages produced or consumed by the container
eBPF
destination_ip, actual_destination, status, method
Total number of outbound DUBBO requests
eBPF
destination_ip, actual_destination, status
Histogram of the response time for each outbound DUBBO request
eBPF
destination_ip, actual_destination, le
JVM
Each JVM metric has the jvm label which refers to the main class or path to the .jar file.
Meta information about the JVM
hsperfdata
jvm, java_version
Total heap size in bytes
hsperfdata
jvm
Used heap size in bytes
hsperfdata
jvm
Time spent in the given JVM garbage collector in seconds
hsperfdata
jvm, gc
Time the application has been stopped for safepoint operations in seconds
hsperfdata
jvm
Time spent getting to safepoints in seconds
hsperfdata
jvm
.NET runtime
Each .NET runtime metric has the application label, which allows distinguishing multiple
applications in the same container.
Meta information about the Common Language Runtime (CLR)
.NET diagnostic port
application, runtime_version
The number of bytes allocated
.NET diagnostic port
application
The number of exceptions that have occurred
.NET diagnostic port
application
Total size of the heap generation in bytes
.NET diagnostic port
application, generation
The number of times GC has occurred for the generation
.NET diagnostic port
application, generation
The heap fragmentation
.NET diagnostic port
application
The number of times there was contention when trying to take the monitor's lock
.NET diagnostic port
application
The number of work items that have been processed in the ThreadPool
.NET diagnostic port
application
The number of work items that are currently queued to be processed in the ThreadPool
.NET diagnostic port
application
The number of thread pool threads that currently exist in the ThreadPool
.NET diagnostic port
application
Other
Meta information about the container.
dockerd / containerd
image
Number of times the container has been restarted.
eBPF: tracepoint/task/task_newtask and tracepoint/sched/sched_process_exit
Type of application running in the container (e.g., memcached, postgres, mysql).
/proc/<pid>/cmdline of the processes running in the container.
application_type
Logs
Number of messages grouped by the automatically extracted repeated patterns.
The container's log. The following logging methods are supported:
- stdout/stderr streams are captured by Dockerd (json file driver)
- stdout/stderr streams are captured by Containerd (CRI)
- A container sends its log to Journald
- A container logs to a file in the /var/log directory
source - journald, stdout/stderr, or path to the file in the /var/log directory.
level - <unknown | debug | info | warning | error | critical>
pattern_hash - the ID of the automatically extracted repeated pattern
sample - a sample message of the group
Node metrics
Amount of CPU time spent in each mode.
/proc/stat
mode - <user | nice | system | idle | iowait | irq | softirq | steal >
Number of logical CPU cores.
/proc/stat
Total amount of physical memory.
/proc/meminfo
Amount of unassigned memory.
/proc/meminfo
An estimate of how much memory is available for allocations, without swapping.
Roughly speaking, this is the sum of the free memory and a part of the page cache that can be reclaimed.
You can learn more about how this estimate is calculated here.
/proc/meminfo
Amount of memory used as page cache.
The memory used for page cache might be reclaimed on memory pressure. This can increase the number of disk reads.
/proc/meminfo
Total number of reads or writes completed successfully.
Any disk has the maximum IOPS it can serve. Below are the reference values for the different storage types:
Type | Max IOPS |
Amazon EBS sc1 | 250 |
Amazon EBS st1 | 500 |
Amazon EBS gp2/gp3 | 16,000 |
Amazon EBS io1/io2 | 64,000 |
Amazon EBS io2 Block Express | 256,000 |
HDD | 200 |
SATA SSD | 100,000 |
NVMe SSD | 10,000,000 |
/proc/diskstats
device
Total number of bytes read from the disk or written to the disk respectively.
In additional to the maximum number of IOPS a disk can serve, there is a throughput limit. For example,
Type | Max throughput |
Amazon EBS sc1 | 250 MB/s |
Amazon EBS st1 | 500 MB/s |
Amazon EBS gp2 | 250 MB/s |
Amazon EBS gp3 | 1,000 MB/s |
Amazon EBS io1/io2 | 1,000 MB/s |
Amazon EBS io2 Block Express | 4,000 MB/s |
SATA | 600 MB/s |
SAS | 1,200 MB/s |
NVMe | 4,000 MB/s |
/proc/diskstats
device
Total number of seconds spent reading and writing respectively, including queue wait.
To get the average I/O latency, the sum of these two should be normalized by the number of the executed I/O requests.
Below is the reference average I/O latency for the different storage types:
Type | Avg latency |
Amazon EBS gp2/gp3/io1/io2 | "single-digit millisecond" |
Amazon EBS io2 Block Express | "sub-millisecond" |
HDD | 2-4ms |
NVMe SSD | 0.1-0.3ms |
/proc/diskstats
device
Total number of seconds the disk spent doing I/O. It doesn't include queue wait, only service time.
E.g., if the derivative of this metric for a minute interval is 60s, this means that the disk was busy 100% of that interval.
/proc/diskstats
device
Total number of bytes and packets received.
interface
Total number of bytes and packets transmitted.
interface
Status of the interface (0:down, 1:up).
interface
IP address assigned to the interface.
interface, ip
Uptime of the node in seconds.
Meta information about the node.
hostname, kernel_version, agent_version
Meta information about the cloud instance.
The agent detects the cloud provider using sysfs.
Then it uses cloud-specific metadata services to retrieve additional information about the instance (
AWS,
GCP,
Azure).
For unsupported providers, you can use the --provider, --region, and --availability-zone
command line arguments of the agent to define the labels manually.
provider - < aws | gcp | azure >
instance_id
instance_type
instance_life_cycle - < on-demand | spot | preemtible > (always empty for Azure instances)
region
availability_zone
availability_zone_id -
ID of the availability zone (AWS only)
local_ipv4
public_ipv4