github.com/feiyang21687/docker@v1.5.0/docs/sources/articles/runmetrics.md

github.com/feiyang21687/docker@v1.5.0/docs/sources/articles/runmetrics.md (about)

1 page_title: Runtime Metrics
2 page_description: Measure the behavior of running containers
3 page_keywords: docker, metrics, CPU, memory, disk, IO, run, runtime
4
5 # Runtime Metrics
6
7 Linux Containers rely on [control groups](
8 https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt)
9 which not only track groups of processes, but also expose metrics about
10 CPU, memory, and block I/O usage. You can access those metrics and
11 obtain network usage metrics as well. This is relevant for "pure" LXC
12 containers, as well as for Docker containers.
13
14 ## Control Groups
15
16 Control groups are exposed through a pseudo-filesystem. In recent
17 distros, you should find this filesystem under `/sys/fs/cgroup`. Under
18 that directory, you will see multiple sub-directories, called devices,
19 freezer, blkio, etc.; each sub-directory actually corresponds to a different
20 cgroup hierarchy.
21
22 On older systems, the control groups might be mounted on `/cgroup`, without
23 distinct hierarchies. In that case, instead of seeing the sub-directories,
24 you will see a bunch of files in that directory, and possibly some directories
25 corresponding to existing containers.
26
27 To figure out where your control groups are mounted, you can run:
28
29 $ grep cgroup /proc/mounts
30
31 ## Enumerating Cgroups
32
33 You can look into `/proc/cgroups` to see the different control group subsystems
34 known to the system, the hierarchy they belong to, and how many groups they contain.
35
36 You can also look at `/proc/<pid>/cgroup` to see which control groups a process
37 belongs to. The control group will be shown as a path relative to the root of
38 the hierarchy mountpoint; e.g., `/` means “this process has not been assigned into
39 a particular group”, while `/lxc/pumpkin` means that the process is likely to be
40 a member of a container named `pumpkin`.
41
42 ## Finding the Cgroup for a Given Container
43
44 For each container, one cgroup will be created in each hierarchy. On
45 older systems with older versions of the LXC userland tools, the name of
46 the cgroup will be the name of the container. With more recent versions
47 of the LXC tools, the cgroup will be `lxc/<container_name>.`
48
49 For Docker containers using cgroups, the container name will be the full
50 ID or long ID of the container. If a container shows up as ae836c95b4c3
51 in `docker ps`, its long ID might be something like
52 `ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79`. You can
53 look it up with `docker inspect` or `docker ps --no-trunc`.
54
55 Putting everything together to look at the memory metrics for a Docker
56 container, take a look at `/sys/fs/cgroup/memory/lxc/<longid>/`.
57
58 ## Metrics from Cgroups: Memory, CPU, Block IO
59
60 For each subsystem (memory, CPU, and block I/O), you will find one or
61 more pseudo-files containing statistics.
62
63 ### Memory Metrics: `memory.stat`
64
65 Memory metrics are found in the "memory" cgroup. Note that the memory
66 control group adds a little overhead, because it does very fine-grained
67 accounting of the memory usage on your host. Therefore, many distros
68 chose to not enable it by default. Generally, to enable it, all you have
69 to do is to add some kernel command-line parameters:
70 `cgroup_enable=memory swapaccount=1`.
71
72 The metrics are in the pseudo-file `memory.stat`.
73 Here is what it will look like:
74
75 cache 11492564992
76 rss 1930993664
77 mapped_file 306728960
78 pgpgin 406632648
79 pgpgout 403355412
80 swap 0
81 pgfault 728281223
82 pgmajfault 1724
83 inactive_anon 46608384
84 active_anon 1884520448
85 inactive_file 7003344896
86 active_file 4489052160
87 unevictable 32768
88 hierarchical_memory_limit 9223372036854775807
89 hierarchical_memsw_limit 9223372036854775807
90 total_cache 11492564992
91 total_rss 1930993664
92 total_mapped_file 306728960
93 total_pgpgin 406632648
94 total_pgpgout 403355412
95 total_swap 0
96 total_pgfault 728281223
97 total_pgmajfault 1724
98 total_inactive_anon 46608384
99 total_active_anon 1884520448
100 total_inactive_file 7003344896
101 total_active_file 4489052160
102 total_unevictable 32768
103
104 The first half (without the `total_` prefix) contains statistics relevant
105 to the processes within the cgroup, excluding sub-cgroups. The second half
106 (with the `total_` prefix) includes sub-cgroups as well.
107
108 Some metrics are "gauges", i.e., values that can increase or decrease
109 (e.g., swap, the amount of swap space used by the members of the cgroup).
110 Some others are "counters", i.e., values that can only go up, because
111 they represent occurrences of a specific event (e.g., pgfault, which
112 indicates the number of page faults which happened since the creation of
113 the cgroup; this number can never decrease).
114
115
116 - **cache:**
117 the amount of memory used by the processes of this control group
118 that can be associated precisely with a block on a block device.
119 When you read from and write to files on disk, this amount will
120 increase. This will be the case if you use "conventional" I/O
121 (`open`, `read`,
122 `write` syscalls) as well as mapped files (with
123 `mmap`). It also accounts for the memory used by
124 `tmpfs` mounts, though the reasons are unclear.
125
126 - **rss:**
127 the amount of memory that *doesn't* correspond to anything on disk:
128 stacks, heaps, and anonymous memory maps.
129
130 - **mapped_file:**
131 indicates the amount of memory mapped by the processes in the
132 control group. It doesn't give you information about *how much*
133 memory is used; it rather tells you *how* it is used.
134
135 - **pgfault and pgmajfault:**
136 indicate the number of times that a process of the cgroup triggered
137 a "page fault" and a "major fault", respectively. A page fault
138 happens when a process accesses a part of its virtual memory space
139 which is nonexistent or protected. The former can happen if the
140 process is buggy and tries to access an invalid address (it will
141 then be sent a `SIGSEGV` signal, typically
142 killing it with the famous `Segmentation fault`
143 message). The latter can happen when the process reads from a memory
144 zone which has been swapped out, or which corresponds to a mapped
145 file: in that case, the kernel will load the page from disk, and let
146 the CPU complete the memory access. It can also happen when the
147 process writes to a copy-on-write memory zone: likewise, the kernel
148 will preempt the process, duplicate the memory page, and resume the
149 write operation on the process` own copy of the page. "Major" faults
150 happen when the kernel actually has to read the data from disk. When
151 it just has to duplicate an existing page, or allocate an empty
152 page, it's a regular (or "minor") fault.
153
154 - **swap:**
155 the amount of swap currently used by the processes in this cgroup.
156
157 - **active_anon and inactive_anon:**
158 the amount of *anonymous* memory that has been identified has
159 respectively *active* and *inactive* by the kernel. "Anonymous"
160 memory is the memory that is *not* linked to disk pages. In other
161 words, that's the equivalent of the rss counter described above. In
162 fact, the very definition of the rss counter is **active_anon** +
163 **inactive_anon** - **tmpfs** (where tmpfs is the amount of memory
164 used up by `tmpfs` filesystems mounted by this
165 control group). Now, what's the difference between "active" and
166 "inactive"? Pages are initially "active"; and at regular intervals,
167 the kernel sweeps over the memory, and tags some pages as
168 "inactive". Whenever they are accessed again, they are immediately
169 retagged "active". When the kernel is almost out of memory, and time
170 comes to swap out to disk, the kernel will swap "inactive" pages.
171
172 - **active_file and inactive_file:**
173 cache memory, with *active* and *inactive* similar to the *anon*
174 memory above. The exact formula is cache = **active_file** +
175 **inactive_file** + **tmpfs**. The exact rules used by the kernel
176 to move memory pages between active and inactive sets are different
177 from the ones used for anonymous memory, but the general principle
178 is the same. Note that when the kernel needs to reclaim memory, it
179 is cheaper to reclaim a clean (=non modified) page from this pool,
180 since it can be reclaimed immediately (while anonymous pages and
181 dirty/modified pages have to be written to disk first).
182
183 - **unevictable:**
184 the amount of memory that cannot be reclaimed; generally, it will
185 account for memory that has been "locked" with `mlock`.
186 It is often used by crypto frameworks to make sure that
187 secret keys and other sensitive material never gets swapped out to
188 disk.
189
190 - **memory and memsw limits:**
191 These are not really metrics, but a reminder of the limits applied
192 to this cgroup. The first one indicates the maximum amount of
193 physical memory that can be used by the processes of this control
194 group; the second one indicates the maximum amount of RAM+swap.
195
196 Accounting for memory in the page cache is very complex. If two
197 processes in different control groups both read the same file
198 (ultimately relying on the same blocks on disk), the corresponding
199 memory charge will be split between the control groups. It's nice, but
200 it also means that when a cgroup is terminated, it could increase the
201 memory usage of another cgroup, because they are not splitting the cost
202 anymore for those memory pages.
203
204 ### CPU metrics: `cpuacct.stat`
205
206 Now that we've covered memory metrics, everything else will look very
207 simple in comparison. CPU metrics will be found in the
208 `cpuacct` controller.
209
210 For each container, you will find a pseudo-file `cpuacct.stat`,
211 containing the CPU usage accumulated by the processes of the container,
212 broken down between `user` and `system` time. If you're not familiar
213 with the distinction, `user` is the time during which the processes were
214 in direct control of the CPU (i.e., executing process code), and `system`
215 is the time during which the CPU was executing system calls on behalf of
216 those processes.
217
218 Those times are expressed in ticks of 1/100th of a second. Actually,
219 they are expressed in "user jiffies". There are `USER_HZ`
220 *"jiffies"* per second, and on x86 systems,
221 `USER_HZ` is 100. This used to map exactly to the
222 number of scheduler "ticks" per second; but with the advent of higher
223 frequency scheduling, as well as [tickless kernels](
224 http://lwn.net/Articles/549580/), the number of kernel ticks
225 wasn't relevant anymore. It stuck around anyway, mainly for legacy and
226 compatibility reasons.
227
228 ### Block I/O metrics
229
230 Block I/O is accounted in the `blkio` controller.
231 Different metrics are scattered across different files. While you can
232 find in-depth details in the [blkio-controller](
233 https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt)
234 file in the kernel documentation, here is a short list of the most
235 relevant ones:
236
237
238 - **blkio.sectors:**
239 contain the number of 512-bytes sectors read and written by the
240 processes member of the cgroup, device by device. Reads and writes
241 are merged in a single counter.
242
243 - **blkio.io_service_bytes:**
244 indicates the number of bytes read and written by the cgroup. It has
245 4 counters per device, because for each device, it differentiates
246 between synchronous vs. asynchronous I/O, and reads vs. writes.
247
248 - **blkio.io_serviced:**
249 the number of I/O operations performed, regardless of their size. It
250 also has 4 counters per device.
251
252 - **blkio.io_queued:**
253 indicates the number of I/O operations currently queued for this
254 cgroup. In other words, if the cgroup isn't doing any I/O, this will
255 be zero. Note that the opposite is not true. In other words, if
256 there is no I/O queued, it does not mean that the cgroup is idle
257 (I/O-wise). It could be doing purely synchronous reads on an
258 otherwise quiescent device, which is therefore able to handle them
259 immediately, without queuing. Also, while it is helpful to figure
260 out which cgroup is putting stress on the I/O subsystem, keep in
261 mind that is is a relative quantity. Even if a process group does
262 not perform more I/O, its queue size can increase just because the
263 device load increases because of other devices.
264
265 ## Network Metrics
266
267 Network metrics are not exposed directly by control groups. There is a
268 good explanation for that: network interfaces exist within the context
269 of *network namespaces*. The kernel could probably accumulate metrics
270 about packets and bytes sent and received by a group of processes, but
271 those metrics wouldn't be very useful. You want per-interface metrics
272 (because traffic happening on the local `lo`
273 interface doesn't really count). But since processes in a single cgroup
274 can belong to multiple network namespaces, those metrics would be harder
275 to interpret: multiple network namespaces means multiple `lo`
276 interfaces, potentially multiple `eth0`
277 interfaces, etc.; so this is why there is no easy way to gather network
278 metrics with control groups.
279
280 Instead we can gather network metrics from other sources:
281
282 ### IPtables
283
284 IPtables (or rather, the netfilter framework for which iptables is just
285 an interface) can do some serious accounting.
286
287 For instance, you can setup a rule to account for the outbound HTTP
288 traffic on a web server:
289
290 $ iptables -I OUTPUT -p tcp --sport 80
291
292 There is no `-j` or `-g` flag,
293 so the rule will just count matched packets and go to the following
294 rule.
295
296 Later, you can check the values of the counters, with:
297
298 $ iptables -nxvL OUTPUT
299
300 Technically, `-n` is not required, but it will
301 prevent iptables from doing DNS reverse lookups, which are probably
302 useless in this scenario.
303
304 Counters include packets and bytes. If you want to setup metrics for
305 container traffic like this, you could execute a `for`
306 loop to add two `iptables` rules per
307 container IP address (one in each direction), in the `FORWARD`
308 chain. This will only meter traffic going through the NAT
309 layer; you will also have to add traffic going through the userland
310 proxy.
311
312 Then, you will need to check those counters on a regular basis. If you
313 happen to use `collectd`, there is a [nice plugin](https://collectd.org/wiki/index.php/Plugin:IPTables)
314 to automate iptables counters collection.
315
316 ### Interface-level counters
317
318 Since each container has a virtual Ethernet interface, you might want to
319 check directly the TX and RX counters of this interface. You will notice
320 that each container is associated to a virtual Ethernet interface in
321 your host, with a name like `vethKk8Zqi`. Figuring
322 out which interface corresponds to which container is, unfortunately,
323 difficult.
324
325 But for now, the best way is to check the metrics *from within the
326 containers*. To accomplish this, you can run an executable from the host
327 environment within the network namespace of a container using **ip-netns
328 magic**.
329
330 The `ip-netns exec` command will let you execute any
331 program (present in the host system) within any network namespace
332 visible to the current process. This means that your host will be able
333 to enter the network namespace of your containers, but your containers
334 won't be able to access the host, nor their sibling containers.
335 Containers will be able to “see” and affect their sub-containers,
336 though.
337
338 The exact format of the command is:
339
340 $ ip netns exec <nsname> <command...>
341
342 For example:
343
344 $ ip netns exec mycontainer netstat -i
345
346 `ip netns` finds the "mycontainer" container by
347 using namespaces pseudo-files. Each process belongs to one network
348 namespace, one PID namespace, one `mnt` namespace,
349 etc., and those namespaces are materialized under
350 `/proc/<pid>/ns/`. For example, the network
351 namespace of PID 42 is materialized by the pseudo-file
352 `/proc/42/ns/net`.
353
354 When you run `ip netns exec mycontainer ...`, it
355 expects `/var/run/netns/mycontainer` to be one of
356 those pseudo-files. (Symlinks are accepted.)
357
358 In other words, to execute a command within the network namespace of a
359 container, we need to:
360
361 - Find out the PID of any process within the container that we want to investigate;
362 - Create a symlink from `/var/run/netns/<somename>` to `/proc/<thepid>/ns/net`
363 - Execute `ip netns exec <somename> ....`
364
365 Please review [*Enumerating Cgroups*](#enumerating-cgroups) to learn how to find
366 the cgroup of a process running in the container of which you want to
367 measure network usage. From there, you can examine the pseudo-file named
368 `tasks`, which contains the PIDs that are in the
369 control group (i.e., in the container). Pick any one of them.
370
371 Putting everything together, if the "short ID" of a container is held in
372 the environment variable `$CID`, then you can do this:
373
374 $ TASKS=/sys/fs/cgroup/devices/$CID*/tasks
375 $ PID=$(head -n 1 $TASKS)
376 $ mkdir -p /var/run/netns
377 $ ln -sf /proc/$PID/ns/net /var/run/netns/$CID
378 $ ip netns exec $CID netstat -i
379
380 ## Tips for high-performance metric collection
381
382 Note that running a new process each time you want to update metrics is
383 (relatively) expensive. If you want to collect metrics at high
384 resolutions, and/or over a large number of containers (think 1000
385 containers on a single host), you do not want to fork a new process each
386 time.
387
388 Here is how to collect metrics from a single process. You will have to
389 write your metric collector in C (or any language that lets you do
390 low-level system calls). You need to use a special system call,
391 `setns()`, which lets the current process enter any
392 arbitrary namespace. It requires, however, an open file descriptor to
393 the namespace pseudo-file (remember: that's the pseudo-file in
394 `/proc/<pid>/ns/net`).
395
396 However, there is a catch: you must not keep this file descriptor open.
397 If you do, when the last process of the control group exits, the
398 namespace will not be destroyed, and its network resources (like the
399 virtual interface of the container) will stay around for ever (or until
400 you close that file descriptor).
401
402 The right approach would be to keep track of the first PID of each
403 container, and re-open the namespace pseudo-file each time.
404
405 ## Collecting metrics when a container exits
406
407 Sometimes, you do not care about real time metric collection, but when a
408 container exits, you want to know how much CPU, memory, etc. it has
409 used.
410
411 Docker makes this difficult because it relies on `lxc-start`, which
412 carefully cleans up after itself, but it is still possible. It is
413 usually easier to collect metrics at regular intervals (e.g., every
414 minute, with the collectd LXC plugin) and rely on that instead.
415
416 But, if you'd still like to gather the stats when a container stops,
417 here is how:
418
419 For each container, start a collection process, and move it to the
420 control groups that you want to monitor by writing its PID to the tasks
421 file of the cgroup. The collection process should periodically re-read
422 the tasks file to check if it's the last process of the control group.
423 (If you also want to collect network statistics as explained in the
424 previous section, you should also move the process to the appropriate
425 network namespace.)
426
427 When the container exits, `lxc-start` will try to
428 delete the control groups. It will fail, since the control group is
429 still in use; but that's fine. You process should now detect that it is
430 the only one remaining in the group. Now is the right time to collect
431 all the metrics you need!
432
433 Finally, your process should move itself back to the root control group,
434 and remove the container control group. To remove a control group, just
435 `rmdir` its directory. It's counter-intuitive to
436 `rmdir` a directory as it still contains files; but
437 remember that this is a pseudo-filesystem, so usual rules don't apply.
438 After the cleanup is done, the collection process can exit safely.