github.com/circular-dark/docker@v1.7.0/docs/articles/runmetrics.md (about) 1 <!--[metadata]> 2 +++ 3 title = "Runtime metrics" 4 description = "Measure the behavior of running containers" 5 keywords = ["docker, metrics, CPU, memory, disk, IO, run, runtime"] 6 [menu.main] 7 parent = "smn_administrate" 8 weight = 4 9 +++ 10 <![end-metadata]--> 11 12 # Runtime metrics 13 14 Linux Containers rely on [control groups]( 15 https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt) 16 which not only track groups of processes, but also expose metrics about 17 CPU, memory, and block I/O usage. You can access those metrics and 18 obtain network usage metrics as well. This is relevant for "pure" LXC 19 containers, as well as for Docker containers. 20 21 ## Control groups 22 23 Control groups are exposed through a pseudo-filesystem. In recent 24 distros, you should find this filesystem under `/sys/fs/cgroup`. Under 25 that directory, you will see multiple sub-directories, called devices, 26 freezer, blkio, etc.; each sub-directory actually corresponds to a different 27 cgroup hierarchy. 28 29 On older systems, the control groups might be mounted on `/cgroup`, without 30 distinct hierarchies. In that case, instead of seeing the sub-directories, 31 you will see a bunch of files in that directory, and possibly some directories 32 corresponding to existing containers. 33 34 To figure out where your control groups are mounted, you can run: 35 36 $ grep cgroup /proc/mounts 37 38 ## Enumerating cgroups 39 40 You can look into `/proc/cgroups` to see the different control group subsystems 41 known to the system, the hierarchy they belong to, and how many groups they contain. 42 43 You can also look at `/proc/<pid>/cgroup` to see which control groups a process 44 belongs to. The control group will be shown as a path relative to the root of 45 the hierarchy mountpoint; e.g., `/` means “this process has not been assigned into 46 a particular group”, while `/lxc/pumpkin` means that the process is likely to be 47 a member of a container named `pumpkin`. 48 49 ## Finding the cgroup for a given container 50 51 For each container, one cgroup will be created in each hierarchy. On 52 older systems with older versions of the LXC userland tools, the name of 53 the cgroup will be the name of the container. With more recent versions 54 of the LXC tools, the cgroup will be `lxc/<container_name>.` 55 56 For Docker containers using cgroups, the container name will be the full 57 ID or long ID of the container. If a container shows up as ae836c95b4c3 58 in `docker ps`, its long ID might be something like 59 `ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79`. You can 60 look it up with `docker inspect` or `docker ps --no-trunc`. 61 62 Putting everything together to look at the memory metrics for a Docker 63 container, take a look at `/sys/fs/cgroup/memory/lxc/<longid>/`. 64 65 ## Metrics from cgroups: memory, CPU, block I/O 66 67 For each subsystem (memory, CPU, and block I/O), you will find one or 68 more pseudo-files containing statistics. 69 70 ### Memory metrics: `memory.stat` 71 72 Memory metrics are found in the "memory" cgroup. Note that the memory 73 control group adds a little overhead, because it does very fine-grained 74 accounting of the memory usage on your host. Therefore, many distros 75 chose to not enable it by default. Generally, to enable it, all you have 76 to do is to add some kernel command-line parameters: 77 `cgroup_enable=memory swapaccount=1`. 78 79 The metrics are in the pseudo-file `memory.stat`. 80 Here is what it will look like: 81 82 cache 11492564992 83 rss 1930993664 84 mapped_file 306728960 85 pgpgin 406632648 86 pgpgout 403355412 87 swap 0 88 pgfault 728281223 89 pgmajfault 1724 90 inactive_anon 46608384 91 active_anon 1884520448 92 inactive_file 7003344896 93 active_file 4489052160 94 unevictable 32768 95 hierarchical_memory_limit 9223372036854775807 96 hierarchical_memsw_limit 9223372036854775807 97 total_cache 11492564992 98 total_rss 1930993664 99 total_mapped_file 306728960 100 total_pgpgin 406632648 101 total_pgpgout 403355412 102 total_swap 0 103 total_pgfault 728281223 104 total_pgmajfault 1724 105 total_inactive_anon 46608384 106 total_active_anon 1884520448 107 total_inactive_file 7003344896 108 total_active_file 4489052160 109 total_unevictable 32768 110 111 The first half (without the `total_` prefix) contains statistics relevant 112 to the processes within the cgroup, excluding sub-cgroups. The second half 113 (with the `total_` prefix) includes sub-cgroups as well. 114 115 Some metrics are "gauges", i.e., values that can increase or decrease 116 (e.g., swap, the amount of swap space used by the members of the cgroup). 117 Some others are "counters", i.e., values that can only go up, because 118 they represent occurrences of a specific event (e.g., pgfault, which 119 indicates the number of page faults which happened since the creation of 120 the cgroup; this number can never decrease). 121 122 123 - **cache:** 124 the amount of memory used by the processes of this control group 125 that can be associated precisely with a block on a block device. 126 When you read from and write to files on disk, this amount will 127 increase. This will be the case if you use "conventional" I/O 128 (`open`, `read`, 129 `write` syscalls) as well as mapped files (with 130 `mmap`). It also accounts for the memory used by 131 `tmpfs` mounts, though the reasons are unclear. 132 133 - **rss:** 134 the amount of memory that *doesn't* correspond to anything on disk: 135 stacks, heaps, and anonymous memory maps. 136 137 - **mapped_file:** 138 indicates the amount of memory mapped by the processes in the 139 control group. It doesn't give you information about *how much* 140 memory is used; it rather tells you *how* it is used. 141 142 - **pgfault and pgmajfault:** 143 indicate the number of times that a process of the cgroup triggered 144 a "page fault" and a "major fault", respectively. A page fault 145 happens when a process accesses a part of its virtual memory space 146 which is nonexistent or protected. The former can happen if the 147 process is buggy and tries to access an invalid address (it will 148 then be sent a `SIGSEGV` signal, typically 149 killing it with the famous `Segmentation fault` 150 message). The latter can happen when the process reads from a memory 151 zone which has been swapped out, or which corresponds to a mapped 152 file: in that case, the kernel will load the page from disk, and let 153 the CPU complete the memory access. It can also happen when the 154 process writes to a copy-on-write memory zone: likewise, the kernel 155 will preempt the process, duplicate the memory page, and resume the 156 write operation on the process` own copy of the page. "Major" faults 157 happen when the kernel actually has to read the data from disk. When 158 it just has to duplicate an existing page, or allocate an empty 159 page, it's a regular (or "minor") fault. 160 161 - **swap:** 162 the amount of swap currently used by the processes in this cgroup. 163 164 - **active_anon and inactive_anon:** 165 the amount of *anonymous* memory that has been identified has 166 respectively *active* and *inactive* by the kernel. "Anonymous" 167 memory is the memory that is *not* linked to disk pages. In other 168 words, that's the equivalent of the rss counter described above. In 169 fact, the very definition of the rss counter is **active_anon** + 170 **inactive_anon** - **tmpfs** (where tmpfs is the amount of memory 171 used up by `tmpfs` filesystems mounted by this 172 control group). Now, what's the difference between "active" and 173 "inactive"? Pages are initially "active"; and at regular intervals, 174 the kernel sweeps over the memory, and tags some pages as 175 "inactive". Whenever they are accessed again, they are immediately 176 retagged "active". When the kernel is almost out of memory, and time 177 comes to swap out to disk, the kernel will swap "inactive" pages. 178 179 - **active_file and inactive_file:** 180 cache memory, with *active* and *inactive* similar to the *anon* 181 memory above. The exact formula is cache = **active_file** + 182 **inactive_file** + **tmpfs**. The exact rules used by the kernel 183 to move memory pages between active and inactive sets are different 184 from the ones used for anonymous memory, but the general principle 185 is the same. Note that when the kernel needs to reclaim memory, it 186 is cheaper to reclaim a clean (=non modified) page from this pool, 187 since it can be reclaimed immediately (while anonymous pages and 188 dirty/modified pages have to be written to disk first). 189 190 - **unevictable:** 191 the amount of memory that cannot be reclaimed; generally, it will 192 account for memory that has been "locked" with `mlock`. 193 It is often used by crypto frameworks to make sure that 194 secret keys and other sensitive material never gets swapped out to 195 disk. 196 197 - **memory and memsw limits:** 198 These are not really metrics, but a reminder of the limits applied 199 to this cgroup. The first one indicates the maximum amount of 200 physical memory that can be used by the processes of this control 201 group; the second one indicates the maximum amount of RAM+swap. 202 203 Accounting for memory in the page cache is very complex. If two 204 processes in different control groups both read the same file 205 (ultimately relying on the same blocks on disk), the corresponding 206 memory charge will be split between the control groups. It's nice, but 207 it also means that when a cgroup is terminated, it could increase the 208 memory usage of another cgroup, because they are not splitting the cost 209 anymore for those memory pages. 210 211 ### CPU metrics: `cpuacct.stat` 212 213 Now that we've covered memory metrics, everything else will look very 214 simple in comparison. CPU metrics will be found in the 215 `cpuacct` controller. 216 217 For each container, you will find a pseudo-file `cpuacct.stat`, 218 containing the CPU usage accumulated by the processes of the container, 219 broken down between `user` and `system` time. If you're not familiar 220 with the distinction, `user` is the time during which the processes were 221 in direct control of the CPU (i.e., executing process code), and `system` 222 is the time during which the CPU was executing system calls on behalf of 223 those processes. 224 225 Those times are expressed in ticks of 1/100th of a second. Actually, 226 they are expressed in "user jiffies". There are `USER_HZ` 227 *"jiffies"* per second, and on x86 systems, 228 `USER_HZ` is 100. This used to map exactly to the 229 number of scheduler "ticks" per second; but with the advent of higher 230 frequency scheduling, as well as [tickless kernels]( 231 http://lwn.net/Articles/549580/), the number of kernel ticks 232 wasn't relevant anymore. It stuck around anyway, mainly for legacy and 233 compatibility reasons. 234 235 ### Block I/O metrics 236 237 Block I/O is accounted in the `blkio` controller. 238 Different metrics are scattered across different files. While you can 239 find in-depth details in the [blkio-controller]( 240 https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt) 241 file in the kernel documentation, here is a short list of the most 242 relevant ones: 243 244 245 - **blkio.sectors:** 246 contain the number of 512-bytes sectors read and written by the 247 processes member of the cgroup, device by device. Reads and writes 248 are merged in a single counter. 249 250 - **blkio.io_service_bytes:** 251 indicates the number of bytes read and written by the cgroup. It has 252 4 counters per device, because for each device, it differentiates 253 between synchronous vs. asynchronous I/O, and reads vs. writes. 254 255 - **blkio.io_serviced:** 256 the number of I/O operations performed, regardless of their size. It 257 also has 4 counters per device. 258 259 - **blkio.io_queued:** 260 indicates the number of I/O operations currently queued for this 261 cgroup. In other words, if the cgroup isn't doing any I/O, this will 262 be zero. Note that the opposite is not true. In other words, if 263 there is no I/O queued, it does not mean that the cgroup is idle 264 (I/O-wise). It could be doing purely synchronous reads on an 265 otherwise quiescent device, which is therefore able to handle them 266 immediately, without queuing. Also, while it is helpful to figure 267 out which cgroup is putting stress on the I/O subsystem, keep in 268 mind that is is a relative quantity. Even if a process group does 269 not perform more I/O, its queue size can increase just because the 270 device load increases because of other devices. 271 272 ## Network metrics 273 274 Network metrics are not exposed directly by control groups. There is a 275 good explanation for that: network interfaces exist within the context 276 of *network namespaces*. The kernel could probably accumulate metrics 277 about packets and bytes sent and received by a group of processes, but 278 those metrics wouldn't be very useful. You want per-interface metrics 279 (because traffic happening on the local `lo` 280 interface doesn't really count). But since processes in a single cgroup 281 can belong to multiple network namespaces, those metrics would be harder 282 to interpret: multiple network namespaces means multiple `lo` 283 interfaces, potentially multiple `eth0` 284 interfaces, etc.; so this is why there is no easy way to gather network 285 metrics with control groups. 286 287 Instead we can gather network metrics from other sources: 288 289 ### IPtables 290 291 IPtables (or rather, the netfilter framework for which iptables is just 292 an interface) can do some serious accounting. 293 294 For instance, you can setup a rule to account for the outbound HTTP 295 traffic on a web server: 296 297 $ iptables -I OUTPUT -p tcp --sport 80 298 299 There is no `-j` or `-g` flag, 300 so the rule will just count matched packets and go to the following 301 rule. 302 303 Later, you can check the values of the counters, with: 304 305 $ iptables -nxvL OUTPUT 306 307 Technically, `-n` is not required, but it will 308 prevent iptables from doing DNS reverse lookups, which are probably 309 useless in this scenario. 310 311 Counters include packets and bytes. If you want to setup metrics for 312 container traffic like this, you could execute a `for` 313 loop to add two `iptables` rules per 314 container IP address (one in each direction), in the `FORWARD` 315 chain. This will only meter traffic going through the NAT 316 layer; you will also have to add traffic going through the userland 317 proxy. 318 319 Then, you will need to check those counters on a regular basis. If you 320 happen to use `collectd`, there is a [nice plugin](https://collectd.org/wiki/index.php/Plugin:IPTables) 321 to automate iptables counters collection. 322 323 ### Interface-level counters 324 325 Since each container has a virtual Ethernet interface, you might want to 326 check directly the TX and RX counters of this interface. You will notice 327 that each container is associated to a virtual Ethernet interface in 328 your host, with a name like `vethKk8Zqi`. Figuring 329 out which interface corresponds to which container is, unfortunately, 330 difficult. 331 332 But for now, the best way is to check the metrics *from within the 333 containers*. To accomplish this, you can run an executable from the host 334 environment within the network namespace of a container using **ip-netns 335 magic**. 336 337 The `ip-netns exec` command will let you execute any 338 program (present in the host system) within any network namespace 339 visible to the current process. This means that your host will be able 340 to enter the network namespace of your containers, but your containers 341 won't be able to access the host, nor their sibling containers. 342 Containers will be able to “see” and affect their sub-containers, 343 though. 344 345 The exact format of the command is: 346 347 $ ip netns exec <nsname> <command...> 348 349 For example: 350 351 $ ip netns exec mycontainer netstat -i 352 353 `ip netns` finds the "mycontainer" container by 354 using namespaces pseudo-files. Each process belongs to one network 355 namespace, one PID namespace, one `mnt` namespace, 356 etc., and those namespaces are materialized under 357 `/proc/<pid>/ns/`. For example, the network 358 namespace of PID 42 is materialized by the pseudo-file 359 `/proc/42/ns/net`. 360 361 When you run `ip netns exec mycontainer ...`, it 362 expects `/var/run/netns/mycontainer` to be one of 363 those pseudo-files. (Symlinks are accepted.) 364 365 In other words, to execute a command within the network namespace of a 366 container, we need to: 367 368 - Find out the PID of any process within the container that we want to investigate; 369 - Create a symlink from `/var/run/netns/<somename>` to `/proc/<thepid>/ns/net` 370 - Execute `ip netns exec <somename> ....` 371 372 Please review [*Enumerating Cgroups*](#enumerating-cgroups) to learn how to find 373 the cgroup of a process running in the container of which you want to 374 measure network usage. From there, you can examine the pseudo-file named 375 `tasks`, which contains the PIDs that are in the 376 control group (i.e., in the container). Pick any one of them. 377 378 Putting everything together, if the "short ID" of a container is held in 379 the environment variable `$CID`, then you can do this: 380 381 $ TASKS=/sys/fs/cgroup/devices/$CID*/tasks 382 $ PID=$(head -n 1 $TASKS) 383 $ mkdir -p /var/run/netns 384 $ ln -sf /proc/$PID/ns/net /var/run/netns/$CID 385 $ ip netns exec $CID netstat -i 386 387 ## Tips for high-performance metric collection 388 389 Note that running a new process each time you want to update metrics is 390 (relatively) expensive. If you want to collect metrics at high 391 resolutions, and/or over a large number of containers (think 1000 392 containers on a single host), you do not want to fork a new process each 393 time. 394 395 Here is how to collect metrics from a single process. You will have to 396 write your metric collector in C (or any language that lets you do 397 low-level system calls). You need to use a special system call, 398 `setns()`, which lets the current process enter any 399 arbitrary namespace. It requires, however, an open file descriptor to 400 the namespace pseudo-file (remember: that's the pseudo-file in 401 `/proc/<pid>/ns/net`). 402 403 However, there is a catch: you must not keep this file descriptor open. 404 If you do, when the last process of the control group exits, the 405 namespace will not be destroyed, and its network resources (like the 406 virtual interface of the container) will stay around for ever (or until 407 you close that file descriptor). 408 409 The right approach would be to keep track of the first PID of each 410 container, and re-open the namespace pseudo-file each time. 411 412 ## Collecting metrics when a container exits 413 414 Sometimes, you do not care about real time metric collection, but when a 415 container exits, you want to know how much CPU, memory, etc. it has 416 used. 417 418 Docker makes this difficult because it relies on `lxc-start`, which 419 carefully cleans up after itself, but it is still possible. It is 420 usually easier to collect metrics at regular intervals (e.g., every 421 minute, with the collectd LXC plugin) and rely on that instead. 422 423 But, if you'd still like to gather the stats when a container stops, 424 here is how: 425 426 For each container, start a collection process, and move it to the 427 control groups that you want to monitor by writing its PID to the tasks 428 file of the cgroup. The collection process should periodically re-read 429 the tasks file to check if it's the last process of the control group. 430 (If you also want to collect network statistics as explained in the 431 previous section, you should also move the process to the appropriate 432 network namespace.) 433 434 When the container exits, `lxc-start` will try to 435 delete the control groups. It will fail, since the control group is 436 still in use; but that's fine. You process should now detect that it is 437 the only one remaining in the group. Now is the right time to collect 438 all the metrics you need! 439 440 Finally, your process should move itself back to the root control group, 441 and remove the container control group. To remove a control group, just 442 `rmdir` its directory. It's counter-intuitive to 443 `rmdir` a directory as it still contains files; but 444 remember that this is a pseudo-filesystem, so usual rules don't apply. 445 After the cleanup is done, the collection process can exit safely.