github.com/hustcat/docker@v1.3.3-0.20160314103604-901c67a8eeab/docs/admin/runmetrics.md (about) 1 <!--[metadata]> 2 +++ 3 aliases = ["/engine/articles/run_metrics"] 4 title = "Runtime metrics" 5 description = "Measure the behavior of running containers" 6 keywords = ["docker, metrics, CPU, memory, disk, IO, run, runtime, stats"] 7 [menu.main] 8 parent = "engine_admin" 9 weight = 4 10 +++ 11 <![end-metadata]--> 12 13 # Runtime metrics 14 15 16 ## Docker stats 17 18 You can use the `docker stats` command to live stream a container's 19 runtime metrics. The command supports CPU, memory usage, memory limit, 20 and network IO metrics. 21 22 The following is a sample output from the `docker stats` command 23 24 $ docker stats redis1 redis2 25 CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O 26 redis1 0.07% 796 KB / 64 MB 1.21% 788 B / 648 B 3.568 MB / 512 KB 27 redis2 0.07% 2.746 MB / 64 MB 4.29% 1.266 KB / 648 B 12.4 MB / 0 B 28 29 30 The [docker stats](../reference/commandline/stats.md) reference page has 31 more details about the `docker stats` command. 32 33 ## Control groups 34 35 Linux Containers rely on [control groups]( 36 https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt) 37 which not only track groups of processes, but also expose metrics about 38 CPU, memory, and block I/O usage. You can access those metrics and 39 obtain network usage metrics as well. This is relevant for "pure" LXC 40 containers, as well as for Docker containers. 41 42 Control groups are exposed through a pseudo-filesystem. In recent 43 distros, you should find this filesystem under `/sys/fs/cgroup`. Under 44 that directory, you will see multiple sub-directories, called devices, 45 freezer, blkio, etc.; each sub-directory actually corresponds to a different 46 cgroup hierarchy. 47 48 On older systems, the control groups might be mounted on `/cgroup`, without 49 distinct hierarchies. In that case, instead of seeing the sub-directories, 50 you will see a bunch of files in that directory, and possibly some directories 51 corresponding to existing containers. 52 53 To figure out where your control groups are mounted, you can run: 54 55 $ grep cgroup /proc/mounts 56 57 ## Enumerating cgroups 58 59 You can look into `/proc/cgroups` to see the different control group subsystems 60 known to the system, the hierarchy they belong to, and how many groups they contain. 61 62 You can also look at `/proc/<pid>/cgroup` to see which control groups a process 63 belongs to. The control group will be shown as a path relative to the root of 64 the hierarchy mountpoint; e.g., `/` means “this process has not been assigned into 65 a particular group”, while `/lxc/pumpkin` means that the process is likely to be 66 a member of a container named `pumpkin`. 67 68 ## Finding the cgroup for a given container 69 70 For each container, one cgroup will be created in each hierarchy. On 71 older systems with older versions of the LXC userland tools, the name of 72 the cgroup will be the name of the container. With more recent versions 73 of the LXC tools, the cgroup will be `lxc/<container_name>.` 74 75 For Docker containers using cgroups, the container name will be the full 76 ID or long ID of the container. If a container shows up as ae836c95b4c3 77 in `docker ps`, its long ID might be something like 78 `ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79`. You can 79 look it up with `docker inspect` or `docker ps --no-trunc`. 80 81 Putting everything together to look at the memory metrics for a Docker 82 container, take a look at `/sys/fs/cgroup/memory/docker/<longid>/`. 83 84 ## Metrics from cgroups: memory, CPU, block I/O 85 86 For each subsystem (memory, CPU, and block I/O), you will find one or 87 more pseudo-files containing statistics. 88 89 ### Memory metrics: `memory.stat` 90 91 Memory metrics are found in the "memory" cgroup. Note that the memory 92 control group adds a little overhead, because it does very fine-grained 93 accounting of the memory usage on your host. Therefore, many distros 94 chose to not enable it by default. Generally, to enable it, all you have 95 to do is to add some kernel command-line parameters: 96 `cgroup_enable=memory swapaccount=1`. 97 98 The metrics are in the pseudo-file `memory.stat`. 99 Here is what it will look like: 100 101 cache 11492564992 102 rss 1930993664 103 mapped_file 306728960 104 pgpgin 406632648 105 pgpgout 403355412 106 swap 0 107 pgfault 728281223 108 pgmajfault 1724 109 inactive_anon 46608384 110 active_anon 1884520448 111 inactive_file 7003344896 112 active_file 4489052160 113 unevictable 32768 114 hierarchical_memory_limit 9223372036854775807 115 hierarchical_memsw_limit 9223372036854775807 116 total_cache 11492564992 117 total_rss 1930993664 118 total_mapped_file 306728960 119 total_pgpgin 406632648 120 total_pgpgout 403355412 121 total_swap 0 122 total_pgfault 728281223 123 total_pgmajfault 1724 124 total_inactive_anon 46608384 125 total_active_anon 1884520448 126 total_inactive_file 7003344896 127 total_active_file 4489052160 128 total_unevictable 32768 129 130 The first half (without the `total_` prefix) contains statistics relevant 131 to the processes within the cgroup, excluding sub-cgroups. The second half 132 (with the `total_` prefix) includes sub-cgroups as well. 133 134 Some metrics are "gauges", i.e., values that can increase or decrease 135 (e.g., swap, the amount of swap space used by the members of the cgroup). 136 Some others are "counters", i.e., values that can only go up, because 137 they represent occurrences of a specific event (e.g., pgfault, which 138 indicates the number of page faults which happened since the creation of 139 the cgroup; this number can never decrease). 140 141 142 - **cache:** 143 the amount of memory used by the processes of this control group 144 that can be associated precisely with a block on a block device. 145 When you read from and write to files on disk, this amount will 146 increase. This will be the case if you use "conventional" I/O 147 (`open`, `read`, 148 `write` syscalls) as well as mapped files (with 149 `mmap`). It also accounts for the memory used by 150 `tmpfs` mounts, though the reasons are unclear. 151 152 - **rss:** 153 the amount of memory that *doesn't* correspond to anything on disk: 154 stacks, heaps, and anonymous memory maps. 155 156 - **mapped_file:** 157 indicates the amount of memory mapped by the processes in the 158 control group. It doesn't give you information about *how much* 159 memory is used; it rather tells you *how* it is used. 160 161 - **pgfault and pgmajfault:** 162 indicate the number of times that a process of the cgroup triggered 163 a "page fault" and a "major fault", respectively. A page fault 164 happens when a process accesses a part of its virtual memory space 165 which is nonexistent or protected. The former can happen if the 166 process is buggy and tries to access an invalid address (it will 167 then be sent a `SIGSEGV` signal, typically 168 killing it with the famous `Segmentation fault` 169 message). The latter can happen when the process reads from a memory 170 zone which has been swapped out, or which corresponds to a mapped 171 file: in that case, the kernel will load the page from disk, and let 172 the CPU complete the memory access. It can also happen when the 173 process writes to a copy-on-write memory zone: likewise, the kernel 174 will preempt the process, duplicate the memory page, and resume the 175 write operation on the process` own copy of the page. "Major" faults 176 happen when the kernel actually has to read the data from disk. When 177 it just has to duplicate an existing page, or allocate an empty 178 page, it's a regular (or "minor") fault. 179 180 - **swap:** 181 the amount of swap currently used by the processes in this cgroup. 182 183 - **active_anon and inactive_anon:** 184 the amount of *anonymous* memory that has been identified has 185 respectively *active* and *inactive* by the kernel. "Anonymous" 186 memory is the memory that is *not* linked to disk pages. In other 187 words, that's the equivalent of the rss counter described above. In 188 fact, the very definition of the rss counter is **active_anon** + 189 **inactive_anon** - **tmpfs** (where tmpfs is the amount of memory 190 used up by `tmpfs` filesystems mounted by this 191 control group). Now, what's the difference between "active" and 192 "inactive"? Pages are initially "active"; and at regular intervals, 193 the kernel sweeps over the memory, and tags some pages as 194 "inactive". Whenever they are accessed again, they are immediately 195 retagged "active". When the kernel is almost out of memory, and time 196 comes to swap out to disk, the kernel will swap "inactive" pages. 197 198 - **active_file and inactive_file:** 199 cache memory, with *active* and *inactive* similar to the *anon* 200 memory above. The exact formula is cache = **active_file** + 201 **inactive_file** + **tmpfs**. The exact rules used by the kernel 202 to move memory pages between active and inactive sets are different 203 from the ones used for anonymous memory, but the general principle 204 is the same. Note that when the kernel needs to reclaim memory, it 205 is cheaper to reclaim a clean (=non modified) page from this pool, 206 since it can be reclaimed immediately (while anonymous pages and 207 dirty/modified pages have to be written to disk first). 208 209 - **unevictable:** 210 the amount of memory that cannot be reclaimed; generally, it will 211 account for memory that has been "locked" with `mlock`. 212 It is often used by crypto frameworks to make sure that 213 secret keys and other sensitive material never gets swapped out to 214 disk. 215 216 - **memory and memsw limits:** 217 These are not really metrics, but a reminder of the limits applied 218 to this cgroup. The first one indicates the maximum amount of 219 physical memory that can be used by the processes of this control 220 group; the second one indicates the maximum amount of RAM+swap. 221 222 Accounting for memory in the page cache is very complex. If two 223 processes in different control groups both read the same file 224 (ultimately relying on the same blocks on disk), the corresponding 225 memory charge will be split between the control groups. It's nice, but 226 it also means that when a cgroup is terminated, it could increase the 227 memory usage of another cgroup, because they are not splitting the cost 228 anymore for those memory pages. 229 230 ### CPU metrics: `cpuacct.stat` 231 232 Now that we've covered memory metrics, everything else will look very 233 simple in comparison. CPU metrics will be found in the 234 `cpuacct` controller. 235 236 For each container, you will find a pseudo-file `cpuacct.stat`, 237 containing the CPU usage accumulated by the processes of the container, 238 broken down between `user` and `system` time. If you're not familiar 239 with the distinction, `user` is the time during which the processes were 240 in direct control of the CPU (i.e., executing process code), and `system` 241 is the time during which the CPU was executing system calls on behalf of 242 those processes. 243 244 Those times are expressed in ticks of 1/100th of a second. Actually, 245 they are expressed in "user jiffies". There are `USER_HZ` 246 *"jiffies"* per second, and on x86 systems, 247 `USER_HZ` is 100. This used to map exactly to the 248 number of scheduler "ticks" per second; but with the advent of higher 249 frequency scheduling, as well as [tickless kernels]( 250 http://lwn.net/Articles/549580/), the number of kernel ticks 251 wasn't relevant anymore. It stuck around anyway, mainly for legacy and 252 compatibility reasons. 253 254 ### Block I/O metrics 255 256 Block I/O is accounted in the `blkio` controller. 257 Different metrics are scattered across different files. While you can 258 find in-depth details in the [blkio-controller]( 259 https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt) 260 file in the kernel documentation, here is a short list of the most 261 relevant ones: 262 263 264 - **blkio.sectors:** 265 contain the number of 512-bytes sectors read and written by the 266 processes member of the cgroup, device by device. Reads and writes 267 are merged in a single counter. 268 269 - **blkio.io_service_bytes:** 270 indicates the number of bytes read and written by the cgroup. It has 271 4 counters per device, because for each device, it differentiates 272 between synchronous vs. asynchronous I/O, and reads vs. writes. 273 274 - **blkio.io_serviced:** 275 the number of I/O operations performed, regardless of their size. It 276 also has 4 counters per device. 277 278 - **blkio.io_queued:** 279 indicates the number of I/O operations currently queued for this 280 cgroup. In other words, if the cgroup isn't doing any I/O, this will 281 be zero. Note that the opposite is not true. In other words, if 282 there is no I/O queued, it does not mean that the cgroup is idle 283 (I/O-wise). It could be doing purely synchronous reads on an 284 otherwise quiescent device, which is therefore able to handle them 285 immediately, without queuing. Also, while it is helpful to figure 286 out which cgroup is putting stress on the I/O subsystem, keep in 287 mind that it is a relative quantity. Even if a process group does 288 not perform more I/O, its queue size can increase just because the 289 device load increases because of other devices. 290 291 ## Network metrics 292 293 Network metrics are not exposed directly by control groups. There is a 294 good explanation for that: network interfaces exist within the context 295 of *network namespaces*. The kernel could probably accumulate metrics 296 about packets and bytes sent and received by a group of processes, but 297 those metrics wouldn't be very useful. You want per-interface metrics 298 (because traffic happening on the local `lo` 299 interface doesn't really count). But since processes in a single cgroup 300 can belong to multiple network namespaces, those metrics would be harder 301 to interpret: multiple network namespaces means multiple `lo` 302 interfaces, potentially multiple `eth0` 303 interfaces, etc.; so this is why there is no easy way to gather network 304 metrics with control groups. 305 306 Instead we can gather network metrics from other sources: 307 308 ### IPtables 309 310 IPtables (or rather, the netfilter framework for which iptables is just 311 an interface) can do some serious accounting. 312 313 For instance, you can setup a rule to account for the outbound HTTP 314 traffic on a web server: 315 316 $ iptables -I OUTPUT -p tcp --sport 80 317 318 There is no `-j` or `-g` flag, 319 so the rule will just count matched packets and go to the following 320 rule. 321 322 Later, you can check the values of the counters, with: 323 324 $ iptables -nxvL OUTPUT 325 326 Technically, `-n` is not required, but it will 327 prevent iptables from doing DNS reverse lookups, which are probably 328 useless in this scenario. 329 330 Counters include packets and bytes. If you want to setup metrics for 331 container traffic like this, you could execute a `for` 332 loop to add two `iptables` rules per 333 container IP address (one in each direction), in the `FORWARD` 334 chain. This will only meter traffic going through the NAT 335 layer; you will also have to add traffic going through the userland 336 proxy. 337 338 Then, you will need to check those counters on a regular basis. If you 339 happen to use `collectd`, there is a [nice plugin](https://collectd.org/wiki/index.php/Table_of_Plugins) 340 to automate iptables counters collection. 341 342 ### Interface-level counters 343 344 Since each container has a virtual Ethernet interface, you might want to 345 check directly the TX and RX counters of this interface. You will notice 346 that each container is associated to a virtual Ethernet interface in 347 your host, with a name like `vethKk8Zqi`. Figuring 348 out which interface corresponds to which container is, unfortunately, 349 difficult. 350 351 But for now, the best way is to check the metrics *from within the 352 containers*. To accomplish this, you can run an executable from the host 353 environment within the network namespace of a container using **ip-netns 354 magic**. 355 356 The `ip-netns exec` command will let you execute any 357 program (present in the host system) within any network namespace 358 visible to the current process. This means that your host will be able 359 to enter the network namespace of your containers, but your containers 360 won't be able to access the host, nor their sibling containers. 361 Containers will be able to “see” and affect their sub-containers, 362 though. 363 364 The exact format of the command is: 365 366 $ ip netns exec <nsname> <command...> 367 368 For example: 369 370 $ ip netns exec mycontainer netstat -i 371 372 `ip netns` finds the "mycontainer" container by 373 using namespaces pseudo-files. Each process belongs to one network 374 namespace, one PID namespace, one `mnt` namespace, 375 etc., and those namespaces are materialized under 376 `/proc/<pid>/ns/`. For example, the network 377 namespace of PID 42 is materialized by the pseudo-file 378 `/proc/42/ns/net`. 379 380 When you run `ip netns exec mycontainer ...`, it 381 expects `/var/run/netns/mycontainer` to be one of 382 those pseudo-files. (Symlinks are accepted.) 383 384 In other words, to execute a command within the network namespace of a 385 container, we need to: 386 387 - Find out the PID of any process within the container that we want to investigate; 388 - Create a symlink from `/var/run/netns/<somename>` to `/proc/<thepid>/ns/net` 389 - Execute `ip netns exec <somename> ....` 390 391 Please review [*Enumerating Cgroups*](#enumerating-cgroups) to learn how to find 392 the cgroup of a process running in the container of which you want to 393 measure network usage. From there, you can examine the pseudo-file named 394 `tasks`, which contains the PIDs that are in the 395 control group (i.e., in the container). Pick any one of them. 396 397 Putting everything together, if the "short ID" of a container is held in 398 the environment variable `$CID`, then you can do this: 399 400 $ TASKS=/sys/fs/cgroup/devices/docker/$CID*/tasks 401 $ PID=$(head -n 1 $TASKS) 402 $ mkdir -p /var/run/netns 403 $ ln -sf /proc/$PID/ns/net /var/run/netns/$CID 404 $ ip netns exec $CID netstat -i 405 406 ## Tips for high-performance metric collection 407 408 Note that running a new process each time you want to update metrics is 409 (relatively) expensive. If you want to collect metrics at high 410 resolutions, and/or over a large number of containers (think 1000 411 containers on a single host), you do not want to fork a new process each 412 time. 413 414 Here is how to collect metrics from a single process. You will have to 415 write your metric collector in C (or any language that lets you do 416 low-level system calls). You need to use a special system call, 417 `setns()`, which lets the current process enter any 418 arbitrary namespace. It requires, however, an open file descriptor to 419 the namespace pseudo-file (remember: that's the pseudo-file in 420 `/proc/<pid>/ns/net`). 421 422 However, there is a catch: you must not keep this file descriptor open. 423 If you do, when the last process of the control group exits, the 424 namespace will not be destroyed, and its network resources (like the 425 virtual interface of the container) will stay around for ever (or until 426 you close that file descriptor). 427 428 The right approach would be to keep track of the first PID of each 429 container, and re-open the namespace pseudo-file each time. 430 431 ## Collecting metrics when a container exits 432 433 Sometimes, you do not care about real time metric collection, but when a 434 container exits, you want to know how much CPU, memory, etc. it has 435 used. 436 437 Docker makes this difficult because it relies on `lxc-start`, which 438 carefully cleans up after itself, but it is still possible. It is 439 usually easier to collect metrics at regular intervals (e.g., every 440 minute, with the collectd LXC plugin) and rely on that instead. 441 442 But, if you'd still like to gather the stats when a container stops, 443 here is how: 444 445 For each container, start a collection process, and move it to the 446 control groups that you want to monitor by writing its PID to the tasks 447 file of the cgroup. The collection process should periodically re-read 448 the tasks file to check if it's the last process of the control group. 449 (If you also want to collect network statistics as explained in the 450 previous section, you should also move the process to the appropriate 451 network namespace.) 452 453 When the container exits, `lxc-start` will try to 454 delete the control groups. It will fail, since the control group is 455 still in use; but that's fine. You process should now detect that it is 456 the only one remaining in the group. Now is the right time to collect 457 all the metrics you need! 458 459 Finally, your process should move itself back to the root control group, 460 and remove the container control group. To remove a control group, just 461 `rmdir` its directory. It's counter-intuitive to 462 `rmdir` a directory as it still contains files; but 463 remember that this is a pseudo-filesystem, so usual rules don't apply. 464 After the cleanup is done, the collection process can exit safely.