github.com/portworx/docker@v1.12.1/docs/admin/runmetrics.md (about) 1 <!--[metadata]> 2 +++ 3 aliases = ["/engine/articles/run_metrics"] 4 title = "Runtime metrics" 5 description = "Measure the behavior of running containers" 6 keywords = ["docker, metrics, CPU, memory, disk, IO, run, runtime, stats"] 7 [menu.main] 8 parent = "engine_admin" 9 weight = 14 10 +++ 11 <![end-metadata]--> 12 13 # Runtime metrics 14 15 16 ## Docker stats 17 18 You can use the `docker stats` command to live stream a container's 19 runtime metrics. The command supports CPU, memory usage, memory limit, 20 and network IO metrics. 21 22 The following is a sample output from the `docker stats` command 23 24 ```bash 25 $ docker stats redis1 redis2 26 27 CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O 28 redis1 0.07% 796 KB / 64 MB 1.21% 788 B / 648 B 3.568 MB / 512 KB 29 redis2 0.07% 2.746 MB / 64 MB 4.29% 1.266 KB / 648 B 12.4 MB / 0 B 30 ``` 31 32 The [docker stats](../reference/commandline/stats.md) reference page has 33 more details about the `docker stats` command. 34 35 ## Control groups 36 37 Linux Containers rely on [control groups]( 38 https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt) 39 which not only track groups of processes, but also expose metrics about 40 CPU, memory, and block I/O usage. You can access those metrics and 41 obtain network usage metrics as well. This is relevant for "pure" LXC 42 containers, as well as for Docker containers. 43 44 Control groups are exposed through a pseudo-filesystem. In recent 45 distros, you should find this filesystem under `/sys/fs/cgroup`. Under 46 that directory, you will see multiple sub-directories, called devices, 47 freezer, blkio, etc.; each sub-directory actually corresponds to a different 48 cgroup hierarchy. 49 50 On older systems, the control groups might be mounted on `/cgroup`, without 51 distinct hierarchies. In that case, instead of seeing the sub-directories, 52 you will see a bunch of files in that directory, and possibly some directories 53 corresponding to existing containers. 54 55 To figure out where your control groups are mounted, you can run: 56 57 ```bash 58 $ grep cgroup /proc/mounts 59 ``` 60 61 ## Enumerating cgroups 62 63 You can look into `/proc/cgroups` to see the different control group subsystems 64 known to the system, the hierarchy they belong to, and how many groups they contain. 65 66 You can also look at `/proc/<pid>/cgroup` to see which control groups a process 67 belongs to. The control group will be shown as a path relative to the root of 68 the hierarchy mountpoint; e.g., `/` means “this process has not been assigned into 69 a particular group”, while `/lxc/pumpkin` means that the process is likely to be 70 a member of a container named `pumpkin`. 71 72 ## Finding the cgroup for a given container 73 74 For each container, one cgroup will be created in each hierarchy. On 75 older systems with older versions of the LXC userland tools, the name of 76 the cgroup will be the name of the container. With more recent versions 77 of the LXC tools, the cgroup will be `lxc/<container_name>.` 78 79 For Docker containers using cgroups, the container name will be the full 80 ID or long ID of the container. If a container shows up as ae836c95b4c3 81 in `docker ps`, its long ID might be something like 82 `ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79`. You can 83 look it up with `docker inspect` or `docker ps --no-trunc`. 84 85 Putting everything together to look at the memory metrics for a Docker 86 container, take a look at `/sys/fs/cgroup/memory/docker/<longid>/`. 87 88 ## Metrics from cgroups: memory, CPU, block I/O 89 90 For each subsystem (memory, CPU, and block I/O), you will find one or 91 more pseudo-files containing statistics. 92 93 ### Memory metrics: `memory.stat` 94 95 Memory metrics are found in the "memory" cgroup. Note that the memory 96 control group adds a little overhead, because it does very fine-grained 97 accounting of the memory usage on your host. Therefore, many distros 98 chose to not enable it by default. Generally, to enable it, all you have 99 to do is to add some kernel command-line parameters: 100 `cgroup_enable=memory swapaccount=1`. 101 102 The metrics are in the pseudo-file `memory.stat`. 103 Here is what it will look like: 104 105 cache 11492564992 106 rss 1930993664 107 mapped_file 306728960 108 pgpgin 406632648 109 pgpgout 403355412 110 swap 0 111 pgfault 728281223 112 pgmajfault 1724 113 inactive_anon 46608384 114 active_anon 1884520448 115 inactive_file 7003344896 116 active_file 4489052160 117 unevictable 32768 118 hierarchical_memory_limit 9223372036854775807 119 hierarchical_memsw_limit 9223372036854775807 120 total_cache 11492564992 121 total_rss 1930993664 122 total_mapped_file 306728960 123 total_pgpgin 406632648 124 total_pgpgout 403355412 125 total_swap 0 126 total_pgfault 728281223 127 total_pgmajfault 1724 128 total_inactive_anon 46608384 129 total_active_anon 1884520448 130 total_inactive_file 7003344896 131 total_active_file 4489052160 132 total_unevictable 32768 133 134 The first half (without the `total_` prefix) contains statistics relevant 135 to the processes within the cgroup, excluding sub-cgroups. The second half 136 (with the `total_` prefix) includes sub-cgroups as well. 137 138 Some metrics are "gauges", i.e., values that can increase or decrease 139 (e.g., swap, the amount of swap space used by the members of the cgroup). 140 Some others are "counters", i.e., values that can only go up, because 141 they represent occurrences of a specific event (e.g., pgfault, which 142 indicates the number of page faults which happened since the creation of 143 the cgroup; this number can never decrease). 144 145 <style>table tr > td:first-child { white-space: nowrap;}</style> 146 147 Metric | Description 148 --------------------------------------|----------------------------------------------------------- 149 **cache** | The amount of memory used by the processes of this control group that can be associated precisely with a block on a block device. When you read from and write to files on disk, this amount will increase. This will be the case if you use "conventional" I/O (`open`, `read`, `write` syscalls) as well as mapped files (with `mmap`). It also accounts for the memory used by `tmpfs` mounts, though the reasons are unclear. 150 **rss** | The amount of memory that *doesn't* correspond to anything on disk: stacks, heaps, and anonymous memory maps. 151 **mapped_file** | Indicates the amount of memory mapped by the processes in the control group. It doesn't give you information about *how much* memory is used; it rather tells you *how* it is used. 152 **pgfault**, **pgmajfault** | Indicate the number of times that a process of the cgroup triggered a "page fault" and a "major fault", respectively. A page fault happens when a process accesses a part of its virtual memory space which is nonexistent or protected. The former can happen if the process is buggy and tries to access an invalid address (it will then be sent a `SIGSEGV` signal, typically killing it with the famous `Segmentation fault` message). The latter can happen when the process reads from a memory zone which has been swapped out, or which corresponds to a mapped file: in that case, the kernel will load the page from disk, and let the CPU complete the memory access. It can also happen when the process writes to a copy-on-write memory zone: likewise, the kernel will preempt the process, duplicate the memory page, and resume the write operation on the process` own copy of the page. "Major" faults happen when the kernel actually has to read the data from disk. When it just has to duplicate an existing page, or allocate an empty page, it's a regular (or "minor") fault. 153 **swap** | The amount of swap currently used by the processes in this cgroup. 154 **active_anon**, **inactive_anon** | The amount of *anonymous* memory that has been identified has respectively *active* and *inactive* by the kernel. "Anonymous" memory is the memory that is *not* linked to disk pages. In other words, that's the equivalent of the rss counter described above. In fact, the very definition of the rss counter is **active_anon** + **inactive_anon** - **tmpfs** (where tmpfs is the amount of memory used up by `tmpfs` filesystems mounted by this control group). Now, what's the difference between "active" and "inactive"? Pages are initially "active"; and at regular intervals, the kernel sweeps over the memory, and tags some pages as "inactive". Whenever they are accessed again, they are immediately retagged "active". When the kernel is almost out of memory, and time comes to swap out to disk, the kernel will swap "inactive" pages. 155 **active_file**, **inactive_file** | Cache memory, with *active* and *inactive* similar to the *anon* memory above. The exact formula is **cache** = **active_file** + **inactive_file** + **tmpfs**. The exact rules used by the kernel to move memory pages between active and inactive sets are different from the ones used for anonymous memory, but the general principle is the same. Note that when the kernel needs to reclaim memory, it is cheaper to reclaim a clean (=non modified) page from this pool, since it can be reclaimed immediately (while anonymous pages and dirty/modified pages have to be written to disk first). 156 **unevictable** | The amount of memory that cannot be reclaimed; generally, it will account for memory that has been "locked" with `mlock`. It is often used by crypto frameworks to make sure that secret keys and other sensitive material never gets swapped out to disk. 157 **memory_limit**, **memsw_limit** | These are not really metrics, but a reminder of the limits applied to this cgroup. The first one indicates the maximum amount of physical memory that can be used by the processes of this control group; the second one indicates the maximum amount of RAM+swap. 158 159 Accounting for memory in the page cache is very complex. If two 160 processes in different control groups both read the same file 161 (ultimately relying on the same blocks on disk), the corresponding 162 memory charge will be split between the control groups. It's nice, but 163 it also means that when a cgroup is terminated, it could increase the 164 memory usage of another cgroup, because they are not splitting the cost 165 anymore for those memory pages. 166 167 ### CPU metrics: `cpuacct.stat` 168 169 Now that we've covered memory metrics, everything else will look very 170 simple in comparison. CPU metrics will be found in the 171 `cpuacct` controller. 172 173 For each container, you will find a pseudo-file `cpuacct.stat`, 174 containing the CPU usage accumulated by the processes of the container, 175 broken down between `user` and `system` time. If you're not familiar 176 with the distinction, `user` is the time during which the processes were 177 in direct control of the CPU (i.e., executing process code), and `system` 178 is the time during which the CPU was executing system calls on behalf of 179 those processes. 180 181 Those times are expressed in ticks of 1/100th of a second. Actually, 182 they are expressed in "user jiffies". There are `USER_HZ` 183 *"jiffies"* per second, and on x86 systems, 184 `USER_HZ` is 100. This used to map exactly to the 185 number of scheduler "ticks" per second; but with the advent of higher 186 frequency scheduling, as well as [tickless kernels]( 187 http://lwn.net/Articles/549580/), the number of kernel ticks 188 wasn't relevant anymore. It stuck around anyway, mainly for legacy and 189 compatibility reasons. 190 191 ### Block I/O metrics 192 193 Block I/O is accounted in the `blkio` controller. 194 Different metrics are scattered across different files. While you can 195 find in-depth details in the [blkio-controller]( 196 https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-controller.txt) 197 file in the kernel documentation, here is a short list of the most 198 relevant ones: 199 200 201 Metric | Description 202 ----------------------------|----------------------------------------------------------- 203 **blkio.sectors** | contains the number of 512-bytes sectors read and written by the processes member of the cgroup, device by device. Reads and writes are merged in a single counter. 204 **blkio.io_service_bytes** | indicates the number of bytes read and written by the cgroup. It has 4 counters per device, because for each device, it differentiates between synchronous vs. asynchronous I/O, and reads vs. writes. 205 **blkio.io_serviced** | the number of I/O operations performed, regardless of their size. It also has 4 counters per device. 206 **blkio.io_queued** | indicates the number of I/O operations currently queued for this cgroup. In other words, if the cgroup isn't doing any I/O, this will be zero. Note that the opposite is not true. In other words, if there is no I/O queued, it does not mean that the cgroup is idle (I/O-wise). It could be doing purely synchronous reads on an otherwise quiescent device, which is therefore able to handle them immediately, without queuing. Also, while it is helpful to figure out which cgroup is putting stress on the I/O subsystem, keep in mind that it is a relative quantity. Even if a process group does not perform more I/O, its queue size can increase just because the device load increases because of other devices. 207 208 ## Network metrics 209 210 Network metrics are not exposed directly by control groups. There is a 211 good explanation for that: network interfaces exist within the context 212 of *network namespaces*. The kernel could probably accumulate metrics 213 about packets and bytes sent and received by a group of processes, but 214 those metrics wouldn't be very useful. You want per-interface metrics 215 (because traffic happening on the local `lo` 216 interface doesn't really count). But since processes in a single cgroup 217 can belong to multiple network namespaces, those metrics would be harder 218 to interpret: multiple network namespaces means multiple `lo` 219 interfaces, potentially multiple `eth0` 220 interfaces, etc.; so this is why there is no easy way to gather network 221 metrics with control groups. 222 223 Instead we can gather network metrics from other sources: 224 225 ### IPtables 226 227 IPtables (or rather, the netfilter framework for which iptables is just 228 an interface) can do some serious accounting. 229 230 For instance, you can setup a rule to account for the outbound HTTP 231 traffic on a web server: 232 233 ```bash 234 $ iptables -I OUTPUT -p tcp --sport 80 235 ``` 236 237 There is no `-j` or `-g` flag, 238 so the rule will just count matched packets and go to the following 239 rule. 240 241 Later, you can check the values of the counters, with: 242 243 ```bash 244 $ iptables -nxvL OUTPUT 245 ``` 246 247 Technically, `-n` is not required, but it will 248 prevent iptables from doing DNS reverse lookups, which are probably 249 useless in this scenario. 250 251 Counters include packets and bytes. If you want to setup metrics for 252 container traffic like this, you could execute a `for` 253 loop to add two `iptables` rules per 254 container IP address (one in each direction), in the `FORWARD` 255 chain. This will only meter traffic going through the NAT 256 layer; you will also have to add traffic going through the userland 257 proxy. 258 259 Then, you will need to check those counters on a regular basis. If you 260 happen to use `collectd`, there is a [nice plugin](https://collectd.org/wiki/index.php/Table_of_Plugins) 261 to automate iptables counters collection. 262 263 ### Interface-level counters 264 265 Since each container has a virtual Ethernet interface, you might want to 266 check directly the TX and RX counters of this interface. You will notice 267 that each container is associated to a virtual Ethernet interface in 268 your host, with a name like `vethKk8Zqi`. Figuring 269 out which interface corresponds to which container is, unfortunately, 270 difficult. 271 272 But for now, the best way is to check the metrics *from within the 273 containers*. To accomplish this, you can run an executable from the host 274 environment within the network namespace of a container using **ip-netns 275 magic**. 276 277 The `ip-netns exec` command will let you execute any 278 program (present in the host system) within any network namespace 279 visible to the current process. This means that your host will be able 280 to enter the network namespace of your containers, but your containers 281 won't be able to access the host, nor their sibling containers. 282 Containers will be able to “see” and affect their sub-containers, 283 though. 284 285 The exact format of the command is: 286 287 ```bash 288 $ ip netns exec <nsname> <command...> 289 ``` 290 291 For example: 292 293 ```bash 294 $ ip netns exec mycontainer netstat -i 295 ``` 296 297 `ip netns` finds the "mycontainer" container by 298 using namespaces pseudo-files. Each process belongs to one network 299 namespace, one PID namespace, one `mnt` namespace, 300 etc., and those namespaces are materialized under 301 `/proc/<pid>/ns/`. For example, the network 302 namespace of PID 42 is materialized by the pseudo-file 303 `/proc/42/ns/net`. 304 305 When you run `ip netns exec mycontainer ...`, it 306 expects `/var/run/netns/mycontainer` to be one of 307 those pseudo-files. (Symlinks are accepted.) 308 309 In other words, to execute a command within the network namespace of a 310 container, we need to: 311 312 - Find out the PID of any process within the container that we want to investigate; 313 - Create a symlink from `/var/run/netns/<somename>` to `/proc/<thepid>/ns/net` 314 - Execute `ip netns exec <somename> ....` 315 316 Please review [Enumerating Cgroups](#enumerating-cgroups) to learn how to find 317 the cgroup of a process running in the container of which you want to 318 measure network usage. From there, you can examine the pseudo-file named 319 `tasks`, which contains the PIDs that are in the 320 control group (i.e., in the container). Pick any one of them. 321 322 Putting everything together, if the "short ID" of a container is held in 323 the environment variable `$CID`, then you can do this: 324 325 ```bash 326 $ TASKS=/sys/fs/cgroup/devices/docker/$CID*/tasks 327 $ PID=$(head -n 1 $TASKS) 328 $ mkdir -p /var/run/netns 329 $ ln -sf /proc/$PID/ns/net /var/run/netns/$CID 330 $ ip netns exec $CID netstat -i 331 ``` 332 333 ## Tips for high-performance metric collection 334 335 Note that running a new process each time you want to update metrics is 336 (relatively) expensive. If you want to collect metrics at high 337 resolutions, and/or over a large number of containers (think 1000 338 containers on a single host), you do not want to fork a new process each 339 time. 340 341 Here is how to collect metrics from a single process. You will have to 342 write your metric collector in C (or any language that lets you do 343 low-level system calls). You need to use a special system call, 344 `setns()`, which lets the current process enter any 345 arbitrary namespace. It requires, however, an open file descriptor to 346 the namespace pseudo-file (remember: that's the pseudo-file in 347 `/proc/<pid>/ns/net`). 348 349 However, there is a catch: you must not keep this file descriptor open. 350 If you do, when the last process of the control group exits, the 351 namespace will not be destroyed, and its network resources (like the 352 virtual interface of the container) will stay around for ever (or until 353 you close that file descriptor). 354 355 The right approach would be to keep track of the first PID of each 356 container, and re-open the namespace pseudo-file each time. 357 358 ## Collecting metrics when a container exits 359 360 Sometimes, you do not care about real time metric collection, but when a 361 container exits, you want to know how much CPU, memory, etc. it has 362 used. 363 364 Docker makes this difficult because it relies on `lxc-start`, which 365 carefully cleans up after itself, but it is still possible. It is 366 usually easier to collect metrics at regular intervals (e.g., every 367 minute, with the collectd LXC plugin) and rely on that instead. 368 369 But, if you'd still like to gather the stats when a container stops, 370 here is how: 371 372 For each container, start a collection process, and move it to the 373 control groups that you want to monitor by writing its PID to the tasks 374 file of the cgroup. The collection process should periodically re-read 375 the tasks file to check if it's the last process of the control group. 376 (If you also want to collect network statistics as explained in the 377 previous section, you should also move the process to the appropriate 378 network namespace.) 379 380 When the container exits, `lxc-start` will try to 381 delete the control groups. It will fail, since the control group is 382 still in use; but that's fine. You process should now detect that it is 383 the only one remaining in the group. Now is the right time to collect 384 all the metrics you need! 385 386 Finally, your process should move itself back to the root control group, 387 and remove the container control group. To remove a control group, just 388 `rmdir` its directory. It's counter-intuitive to 389 `rmdir` a directory as it still contains files; but 390 remember that this is a pseudo-filesystem, so usual rules don't apply. 391 After the cleanup is done, the collection process can exit safely.