github.com/slene/docker@v1.8.0-rc1/docs/articles/runmetrics.md (about) 1 <!--[metadata]> 2 +++ 3 title = "Runtime metrics" 4 description = "Measure the behavior of running containers" 5 keywords = ["docker, metrics, CPU, memory, disk, IO, run, runtime, stats"] 6 [menu.main] 7 parent = "smn_administrate" 8 weight = 4 9 +++ 10 <![end-metadata]--> 11 12 # Runtime metrics 13 14 15 ## Docker stats 16 17 You can use the `docker stats` command to live stream a container's 18 runtime metrics. The command supports CPU, memory usage, memory limit, 19 and network IO metrics. 20 21 The following is a sample output from the `docker stats` command 22 23 $ docker stats redis1 redis2 24 CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O 25 redis1 0.07% 796 KB/64 MB 1.21% 788 B/648 B 26 redis2 0.07% 2.746 MB/64 MB 4.29% 1.266 KB/648 B 27 28 29 The [docker stats](/reference/commandline/stats/) reference page has 30 more details about the `docker stats` command. 31 32 ## Control groups 33 34 Linux Containers rely on [control groups]( 35 https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt) 36 which not only track groups of processes, but also expose metrics about 37 CPU, memory, and block I/O usage. You can access those metrics and 38 obtain network usage metrics as well. This is relevant for "pure" LXC 39 containers, as well as for Docker containers. 40 41 Control groups are exposed through a pseudo-filesystem. In recent 42 distros, you should find this filesystem under `/sys/fs/cgroup`. Under 43 that directory, you will see multiple sub-directories, called devices, 44 freezer, blkio, etc.; each sub-directory actually corresponds to a different 45 cgroup hierarchy. 46 47 On older systems, the control groups might be mounted on `/cgroup`, without 48 distinct hierarchies. In that case, instead of seeing the sub-directories, 49 you will see a bunch of files in that directory, and possibly some directories 50 corresponding to existing containers. 51 52 To figure out where your control groups are mounted, you can run: 53 54 $ grep cgroup /proc/mounts 55 56 ## Enumerating cgroups 57 58 You can look into `/proc/cgroups` to see the different control group subsystems 59 known to the system, the hierarchy they belong to, and how many groups they contain. 60 61 You can also look at `/proc/<pid>/cgroup` to see which control groups a process 62 belongs to. The control group will be shown as a path relative to the root of 63 the hierarchy mountpoint; e.g., `/` means “this process has not been assigned into 64 a particular group”, while `/lxc/pumpkin` means that the process is likely to be 65 a member of a container named `pumpkin`. 66 67 ## Finding the cgroup for a given container 68 69 For each container, one cgroup will be created in each hierarchy. On 70 older systems with older versions of the LXC userland tools, the name of 71 the cgroup will be the name of the container. With more recent versions 72 of the LXC tools, the cgroup will be `lxc/<container_name>.` 73 74 For Docker containers using cgroups, the container name will be the full 75 ID or long ID of the container. If a container shows up as ae836c95b4c3 76 in `docker ps`, its long ID might be something like 77 `ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79`. You can 78 look it up with `docker inspect` or `docker ps --no-trunc`. 79 80 Putting everything together to look at the memory metrics for a Docker 81 container, take a look at `/sys/fs/cgroup/memory/lxc/<longid>/`. 82 83 ## Metrics from cgroups: memory, CPU, block I/O 84 85 For each subsystem (memory, CPU, and block I/O), you will find one or 86 more pseudo-files containing statistics. 87 88 ### Memory metrics: `memory.stat` 89 90 Memory metrics are found in the "memory" cgroup. Note that the memory 91 control group adds a little overhead, because it does very fine-grained 92 accounting of the memory usage on your host. Therefore, many distros 93 chose to not enable it by default. Generally, to enable it, all you have 94 to do is to add some kernel command-line parameters: 95 `cgroup_enable=memory swapaccount=1`. 96 97 The metrics are in the pseudo-file `memory.stat`. 98 Here is what it will look like: 99 100 cache 11492564992 101 rss 1930993664 102 mapped_file 306728960 103 pgpgin 406632648 104 pgpgout 403355412 105 swap 0 106 pgfault 728281223 107 pgmajfault 1724 108 inactive_anon 46608384 109 active_anon 1884520448 110 inactive_file 7003344896 111 active_file 4489052160 112 unevictable 32768 113 hierarchical_memory_limit 9223372036854775807 114 hierarchical_memsw_limit 9223372036854775807 115 total_cache 11492564992 116 total_rss 1930993664 117 total_mapped_file 306728960 118 total_pgpgin 406632648 119 total_pgpgout 403355412 120 total_swap 0 121 total_pgfault 728281223 122 total_pgmajfault 1724 123 total_inactive_anon 46608384 124 total_active_anon 1884520448 125 total_inactive_file 7003344896 126 total_active_file 4489052160 127 total_unevictable 32768 128 129 The first half (without the `total_` prefix) contains statistics relevant 130 to the processes within the cgroup, excluding sub-cgroups. The second half 131 (with the `total_` prefix) includes sub-cgroups as well. 132 133 Some metrics are "gauges", i.e., values that can increase or decrease 134 (e.g., swap, the amount of swap space used by the members of the cgroup). 135 Some others are "counters", i.e., values that can only go up, because 136 they represent occurrences of a specific event (e.g., pgfault, which 137 indicates the number of page faults which happened since the creation of 138 the cgroup; this number can never decrease). 139 140 141 - **cache:** 142 the amount of memory used by the processes of this control group 143 that can be associated precisely with a block on a block device. 144 When you read from and write to files on disk, this amount will 145 increase. This will be the case if you use "conventional" I/O 146 (`open`, `read`, 147 `write` syscalls) as well as mapped files (with 148 `mmap`). It also accounts for the memory used by 149 `tmpfs` mounts, though the reasons are unclear. 150 151 - **rss:** 152 the amount of memory that *doesn't* correspond to anything on disk: 153 stacks, heaps, and anonymous memory maps. 154 155 - **mapped_file:** 156 indicates the amount of memory mapped by the processes in the 157 control group. It doesn't give you information about *how much* 158 memory is used; it rather tells you *how* it is used. 159 160 - **pgfault and pgmajfault:** 161 indicate the number of times that a process of the cgroup triggered 162 a "page fault" and a "major fault", respectively. A page fault 163 happens when a process accesses a part of its virtual memory space 164 which is nonexistent or protected. The former can happen if the 165 process is buggy and tries to access an invalid address (it will 166 then be sent a `SIGSEGV` signal, typically 167 killing it with the famous `Segmentation fault` 168 message). The latter can happen when the process reads from a memory 169 zone which has been swapped out, or which corresponds to a mapped 170 file: in that case, the kernel will load the page from disk, and let 171 the CPU complete the memory access. It can also happen when the 172 process writes to a copy-on-write memory zone: likewise, the kernel 173 will preempt the process, duplicate the memory page, and resume the 174 write operation on the process` own copy of the page. "Major" faults 175 happen when the kernel actually has to read the data from disk. When 176 it just has to duplicate an existing page, or allocate an empty 177 page, it's a regular (or "minor") fault. 178 179 - **swap:** 180 the amount of swap currently used by the processes in this cgroup. 181 182 - **active_anon and inactive_anon:** 183 the amount of *anonymous* memory that has been identified has 184 respectively *active* and *inactive* by the kernel. "Anonymous" 185 memory is the memory that is *not* linked to disk pages. In other 186 words, that's the equivalent of the rss counter described above. In 187 fact, the very definition of the rss counter is **active_anon** + 188 **inactive_anon** - **tmpfs** (where tmpfs is the amount of memory 189 used up by `tmpfs` filesystems mounted by this 190 control group). Now, what's the difference between "active" and 191 "inactive"? Pages are initially "active"; and at regular intervals, 192 the kernel sweeps over the memory, and tags some pages as 193 "inactive". Whenever they are accessed again, they are immediately 194 retagged "active". When the kernel is almost out of memory, and time 195 comes to swap out to disk, the kernel will swap "inactive" pages. 196 197 - **active_file and inactive_file:** 198 cache memory, with *active* and *inactive* similar to the *anon* 199 memory above. The exact formula is cache = **active_file** + 200 **inactive_file** + **tmpfs**. The exact rules used by the kernel 201 to move memory pages between active and inactive sets are different 202 from the ones used for anonymous memory, but the general principle 203 is the same. Note that when the kernel needs to reclaim memory, it 204 is cheaper to reclaim a clean (=non modified) page from this pool, 205 since it can be reclaimed immediately (while anonymous pages and 206 dirty/modified pages have to be written to disk first). 207 208 - **unevictable:** 209 the amount of memory that cannot be reclaimed; generally, it will 210 account for memory that has been "locked" with `mlock`. 211 It is often used by crypto frameworks to make sure that 212 secret keys and other sensitive material never gets swapped out to 213 disk. 214 215 - **memory and memsw limits:** 216 These are not really metrics, but a reminder of the limits applied 217 to this cgroup. The first one indicates the maximum amount of 218 physical memory that can be used by the processes of this control 219 group; the second one indicates the maximum amount of RAM+swap. 220 221 Accounting for memory in the page cache is very complex. If two 222 processes in different control groups both read the same file 223 (ultimately relying on the same blocks on disk), the corresponding 224 memory charge will be split between the control groups. It's nice, but 225 it also means that when a cgroup is terminated, it could increase the 226 memory usage of another cgroup, because they are not splitting the cost 227 anymore for those memory pages. 228 229 ### CPU metrics: `cpuacct.stat` 230 231 Now that we've covered memory metrics, everything else will look very 232 simple in comparison. CPU metrics will be found in the 233 `cpuacct` controller. 234 235 For each container, you will find a pseudo-file `cpuacct.stat`, 236 containing the CPU usage accumulated by the processes of the container, 237 broken down between `user` and `system` time. If you're not familiar 238 with the distinction, `user` is the time during which the processes were 239 in direct control of the CPU (i.e., executing process code), and `system` 240 is the time during which the CPU was executing system calls on behalf of 241 those processes. 242 243 Those times are expressed in ticks of 1/100th of a second. Actually, 244 they are expressed in "user jiffies". There are `USER_HZ` 245 *"jiffies"* per second, and on x86 systems, 246 `USER_HZ` is 100. This used to map exactly to the 247 number of scheduler "ticks" per second; but with the advent of higher 248 frequency scheduling, as well as [tickless kernels]( 249 http://lwn.net/Articles/549580/), the number of kernel ticks 250 wasn't relevant anymore. It stuck around anyway, mainly for legacy and 251 compatibility reasons. 252 253 ### Block I/O metrics 254 255 Block I/O is accounted in the `blkio` controller. 256 Different metrics are scattered across different files. While you can 257 find in-depth details in the [blkio-controller]( 258 https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt) 259 file in the kernel documentation, here is a short list of the most 260 relevant ones: 261 262 263 - **blkio.sectors:** 264 contain the number of 512-bytes sectors read and written by the 265 processes member of the cgroup, device by device. Reads and writes 266 are merged in a single counter. 267 268 - **blkio.io_service_bytes:** 269 indicates the number of bytes read and written by the cgroup. It has 270 4 counters per device, because for each device, it differentiates 271 between synchronous vs. asynchronous I/O, and reads vs. writes. 272 273 - **blkio.io_serviced:** 274 the number of I/O operations performed, regardless of their size. It 275 also has 4 counters per device. 276 277 - **blkio.io_queued:** 278 indicates the number of I/O operations currently queued for this 279 cgroup. In other words, if the cgroup isn't doing any I/O, this will 280 be zero. Note that the opposite is not true. In other words, if 281 there is no I/O queued, it does not mean that the cgroup is idle 282 (I/O-wise). It could be doing purely synchronous reads on an 283 otherwise quiescent device, which is therefore able to handle them 284 immediately, without queuing. Also, while it is helpful to figure 285 out which cgroup is putting stress on the I/O subsystem, keep in 286 mind that is is a relative quantity. Even if a process group does 287 not perform more I/O, its queue size can increase just because the 288 device load increases because of other devices. 289 290 ## Network metrics 291 292 Network metrics are not exposed directly by control groups. There is a 293 good explanation for that: network interfaces exist within the context 294 of *network namespaces*. The kernel could probably accumulate metrics 295 about packets and bytes sent and received by a group of processes, but 296 those metrics wouldn't be very useful. You want per-interface metrics 297 (because traffic happening on the local `lo` 298 interface doesn't really count). But since processes in a single cgroup 299 can belong to multiple network namespaces, those metrics would be harder 300 to interpret: multiple network namespaces means multiple `lo` 301 interfaces, potentially multiple `eth0` 302 interfaces, etc.; so this is why there is no easy way to gather network 303 metrics with control groups. 304 305 Instead we can gather network metrics from other sources: 306 307 ### IPtables 308 309 IPtables (or rather, the netfilter framework for which iptables is just 310 an interface) can do some serious accounting. 311 312 For instance, you can setup a rule to account for the outbound HTTP 313 traffic on a web server: 314 315 $ iptables -I OUTPUT -p tcp --sport 80 316 317 There is no `-j` or `-g` flag, 318 so the rule will just count matched packets and go to the following 319 rule. 320 321 Later, you can check the values of the counters, with: 322 323 $ iptables -nxvL OUTPUT 324 325 Technically, `-n` is not required, but it will 326 prevent iptables from doing DNS reverse lookups, which are probably 327 useless in this scenario. 328 329 Counters include packets and bytes. If you want to setup metrics for 330 container traffic like this, you could execute a `for` 331 loop to add two `iptables` rules per 332 container IP address (one in each direction), in the `FORWARD` 333 chain. This will only meter traffic going through the NAT 334 layer; you will also have to add traffic going through the userland 335 proxy. 336 337 Then, you will need to check those counters on a regular basis. If you 338 happen to use `collectd`, there is a [nice plugin](https://collectd.org/wiki/index.php/Plugin:IPTables) 339 to automate iptables counters collection. 340 341 ### Interface-level counters 342 343 Since each container has a virtual Ethernet interface, you might want to 344 check directly the TX and RX counters of this interface. You will notice 345 that each container is associated to a virtual Ethernet interface in 346 your host, with a name like `vethKk8Zqi`. Figuring 347 out which interface corresponds to which container is, unfortunately, 348 difficult. 349 350 But for now, the best way is to check the metrics *from within the 351 containers*. To accomplish this, you can run an executable from the host 352 environment within the network namespace of a container using **ip-netns 353 magic**. 354 355 The `ip-netns exec` command will let you execute any 356 program (present in the host system) within any network namespace 357 visible to the current process. This means that your host will be able 358 to enter the network namespace of your containers, but your containers 359 won't be able to access the host, nor their sibling containers. 360 Containers will be able to “see” and affect their sub-containers, 361 though. 362 363 The exact format of the command is: 364 365 $ ip netns exec <nsname> <command...> 366 367 For example: 368 369 $ ip netns exec mycontainer netstat -i 370 371 `ip netns` finds the "mycontainer" container by 372 using namespaces pseudo-files. Each process belongs to one network 373 namespace, one PID namespace, one `mnt` namespace, 374 etc., and those namespaces are materialized under 375 `/proc/<pid>/ns/`. For example, the network 376 namespace of PID 42 is materialized by the pseudo-file 377 `/proc/42/ns/net`. 378 379 When you run `ip netns exec mycontainer ...`, it 380 expects `/var/run/netns/mycontainer` to be one of 381 those pseudo-files. (Symlinks are accepted.) 382 383 In other words, to execute a command within the network namespace of a 384 container, we need to: 385 386 - Find out the PID of any process within the container that we want to investigate; 387 - Create a symlink from `/var/run/netns/<somename>` to `/proc/<thepid>/ns/net` 388 - Execute `ip netns exec <somename> ....` 389 390 Please review [*Enumerating Cgroups*](#enumerating-cgroups) to learn how to find 391 the cgroup of a process running in the container of which you want to 392 measure network usage. From there, you can examine the pseudo-file named 393 `tasks`, which contains the PIDs that are in the 394 control group (i.e., in the container). Pick any one of them. 395 396 Putting everything together, if the "short ID" of a container is held in 397 the environment variable `$CID`, then you can do this: 398 399 $ TASKS=/sys/fs/cgroup/devices/$CID*/tasks 400 $ PID=$(head -n 1 $TASKS) 401 $ mkdir -p /var/run/netns 402 $ ln -sf /proc/$PID/ns/net /var/run/netns/$CID 403 $ ip netns exec $CID netstat -i 404 405 ## Tips for high-performance metric collection 406 407 Note that running a new process each time you want to update metrics is 408 (relatively) expensive. If you want to collect metrics at high 409 resolutions, and/or over a large number of containers (think 1000 410 containers on a single host), you do not want to fork a new process each 411 time. 412 413 Here is how to collect metrics from a single process. You will have to 414 write your metric collector in C (or any language that lets you do 415 low-level system calls). You need to use a special system call, 416 `setns()`, which lets the current process enter any 417 arbitrary namespace. It requires, however, an open file descriptor to 418 the namespace pseudo-file (remember: that's the pseudo-file in 419 `/proc/<pid>/ns/net`). 420 421 However, there is a catch: you must not keep this file descriptor open. 422 If you do, when the last process of the control group exits, the 423 namespace will not be destroyed, and its network resources (like the 424 virtual interface of the container) will stay around for ever (or until 425 you close that file descriptor). 426 427 The right approach would be to keep track of the first PID of each 428 container, and re-open the namespace pseudo-file each time. 429 430 ## Collecting metrics when a container exits 431 432 Sometimes, you do not care about real time metric collection, but when a 433 container exits, you want to know how much CPU, memory, etc. it has 434 used. 435 436 Docker makes this difficult because it relies on `lxc-start`, which 437 carefully cleans up after itself, but it is still possible. It is 438 usually easier to collect metrics at regular intervals (e.g., every 439 minute, with the collectd LXC plugin) and rely on that instead. 440 441 But, if you'd still like to gather the stats when a container stops, 442 here is how: 443 444 For each container, start a collection process, and move it to the 445 control groups that you want to monitor by writing its PID to the tasks 446 file of the cgroup. The collection process should periodically re-read 447 the tasks file to check if it's the last process of the control group. 448 (If you also want to collect network statistics as explained in the 449 previous section, you should also move the process to the appropriate 450 network namespace.) 451 452 When the container exits, `lxc-start` will try to 453 delete the control groups. It will fail, since the control group is 454 still in use; but that's fine. You process should now detect that it is 455 the only one remaining in the group. Now is the right time to collect 456 all the metrics you need! 457 458 Finally, your process should move itself back to the root control group, 459 and remove the container control group. To remove a control group, just 460 `rmdir` its directory. It's counter-intuitive to 461 `rmdir` a directory as it still contains files; but 462 remember that this is a pseudo-filesystem, so usual rules don't apply. 463 After the cleanup is done, the collection process can exit safely.