github.com/kobeld/docker@v1.12.0-rc1/docs/admin/runmetrics.md (about)

     1  <!--[metadata]>
     2  +++
     3  aliases = ["/engine/articles/run_metrics"]
     4  title = "Runtime metrics"
     5  description = "Measure the behavior of running containers"
     6  keywords = ["docker, metrics, CPU, memory, disk, IO, run,  runtime, stats"]
     7  [menu.main]
     8  parent = "engine_admin"
     9  weight = 4
    10  +++
    11  <![end-metadata]-->
    12  
    13  # Runtime metrics
    14  
    15  
    16  ## Docker stats
    17  
    18  You can use the `docker stats` command to live stream a container's
    19  runtime metrics. The command supports CPU, memory usage, memory limit,
    20  and network IO metrics.
    21  
    22  The following is a sample output from the `docker stats` command
    23  
    24      $ docker stats redis1 redis2
    25      CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O
    26      redis1              0.07%               796 KB / 64 MB        1.21%               788 B / 648 B       3.568 MB / 512 KB
    27      redis2              0.07%               2.746 MB / 64 MB      4.29%               1.266 KB / 648 B    12.4 MB / 0 B
    28  
    29  
    30  The [docker stats](../reference/commandline/stats.md) reference page has
    31  more details about the `docker stats` command.
    32  
    33  ## Control groups
    34  
    35  Linux Containers rely on [control groups](
    36  https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt)
    37  which not only track groups of processes, but also expose metrics about
    38  CPU, memory, and block I/O usage. You can access those metrics and
    39  obtain network usage metrics as well. This is relevant for "pure" LXC
    40  containers, as well as for Docker containers.
    41  
    42  Control groups are exposed through a pseudo-filesystem. In recent
    43  distros, you should find this filesystem under `/sys/fs/cgroup`. Under
    44  that directory, you will see multiple sub-directories, called devices,
    45  freezer, blkio, etc.; each sub-directory actually corresponds to a different
    46  cgroup hierarchy.
    47  
    48  On older systems, the control groups might be mounted on `/cgroup`, without
    49  distinct hierarchies. In that case, instead of seeing the sub-directories,
    50  you will see a bunch of files in that directory, and possibly some directories
    51  corresponding to existing containers.
    52  
    53  To figure out where your control groups are mounted, you can run:
    54  
    55      $ grep cgroup /proc/mounts
    56  
    57  ## Enumerating cgroups
    58  
    59  You can look into `/proc/cgroups` to see the different control group subsystems
    60  known to the system, the hierarchy they belong to, and how many groups they contain.
    61  
    62  You can also look at `/proc/<pid>/cgroup` to see which control groups a process
    63  belongs to. The control group will be shown as a path relative to the root of
    64  the hierarchy mountpoint; e.g., `/` means “this process has not been assigned into
    65  a particular group”, while `/lxc/pumpkin` means that the process is likely to be
    66  a member of a container named `pumpkin`.
    67  
    68  ## Finding the cgroup for a given container
    69  
    70  For each container, one cgroup will be created in each hierarchy. On
    71  older systems with older versions of the LXC userland tools, the name of
    72  the cgroup will be the name of the container. With more recent versions
    73  of the LXC tools, the cgroup will be `lxc/<container_name>.`
    74  
    75  For Docker containers using cgroups, the container name will be the full
    76  ID or long ID of the container. If a container shows up as ae836c95b4c3
    77  in `docker ps`, its long ID might be something like
    78  `ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79`. You can
    79  look it up with `docker inspect` or `docker ps --no-trunc`.
    80  
    81  Putting everything together to look at the memory metrics for a Docker
    82  container, take a look at `/sys/fs/cgroup/memory/docker/<longid>/`.
    83  
    84  ## Metrics from cgroups: memory, CPU, block I/O
    85  
    86  For each subsystem (memory, CPU, and block I/O), you will find one or
    87  more pseudo-files containing statistics.
    88  
    89  ### Memory metrics: `memory.stat`
    90  
    91  Memory metrics are found in the "memory" cgroup. Note that the memory
    92  control group adds a little overhead, because it does very fine-grained
    93  accounting of the memory usage on your host. Therefore, many distros
    94  chose to not enable it by default. Generally, to enable it, all you have
    95  to do is to add some kernel command-line parameters:
    96  `cgroup_enable=memory swapaccount=1`.
    97  
    98  The metrics are in the pseudo-file `memory.stat`.
    99  Here is what it will look like:
   100  
   101      cache 11492564992
   102      rss 1930993664
   103      mapped_file 306728960
   104      pgpgin 406632648
   105      pgpgout 403355412
   106      swap 0
   107      pgfault 728281223
   108      pgmajfault 1724
   109      inactive_anon 46608384
   110      active_anon 1884520448
   111      inactive_file 7003344896
   112      active_file 4489052160
   113      unevictable 32768
   114      hierarchical_memory_limit 9223372036854775807
   115      hierarchical_memsw_limit 9223372036854775807
   116      total_cache 11492564992
   117      total_rss 1930993664
   118      total_mapped_file 306728960
   119      total_pgpgin 406632648
   120      total_pgpgout 403355412
   121      total_swap 0
   122      total_pgfault 728281223
   123      total_pgmajfault 1724
   124      total_inactive_anon 46608384
   125      total_active_anon 1884520448
   126      total_inactive_file 7003344896
   127      total_active_file 4489052160
   128      total_unevictable 32768
   129  
   130  The first half (without the `total_` prefix) contains statistics relevant
   131  to the processes within the cgroup, excluding sub-cgroups. The second half
   132  (with the `total_` prefix) includes sub-cgroups as well.
   133  
   134  Some metrics are "gauges", i.e., values that can increase or decrease
   135  (e.g., swap, the amount of swap space used by the members of the cgroup).
   136  Some others are "counters", i.e., values that can only go up, because
   137  they represent occurrences of a specific event (e.g., pgfault, which
   138  indicates the number of page faults which happened since the creation of
   139  the cgroup; this number can never decrease).
   140  
   141  
   142   - **cache:**  
   143     the amount of memory used by the processes of this control group
   144     that can be associated precisely with a block on a block device.
   145     When you read from and write to files on disk, this amount will
   146     increase. This will be the case if you use "conventional" I/O
   147     (`open`, `read`,
   148     `write` syscalls) as well as mapped files (with
   149     `mmap`). It also accounts for the memory used by
   150     `tmpfs` mounts, though the reasons are unclear.
   151  
   152   - **rss:**  
   153     the amount of memory that *doesn't* correspond to anything on disk:
   154     stacks, heaps, and anonymous memory maps.
   155  
   156   - **mapped_file:**  
   157     indicates the amount of memory mapped by the processes in the
   158     control group. It doesn't give you information about *how much*
   159     memory is used; it rather tells you *how* it is used.
   160  
   161   - **pgfault and pgmajfault:**  
   162     indicate the number of times that a process of the cgroup triggered
   163     a "page fault" and a "major fault", respectively. A page fault
   164     happens when a process accesses a part of its virtual memory space
   165     which is nonexistent or protected. The former can happen if the
   166     process is buggy and tries to access an invalid address (it will
   167     then be sent a `SIGSEGV` signal, typically
   168     killing it with the famous `Segmentation fault`
   169     message). The latter can happen when the process reads from a memory
   170     zone which has been swapped out, or which corresponds to a mapped
   171     file: in that case, the kernel will load the page from disk, and let
   172     the CPU complete the memory access. It can also happen when the
   173     process writes to a copy-on-write memory zone: likewise, the kernel
   174     will preempt the process, duplicate the memory page, and resume the
   175     write operation on the process` own copy of the page. "Major" faults
   176     happen when the kernel actually has to read the data from disk. When
   177     it just has to duplicate an existing page, or allocate an empty
   178     page, it's a regular (or "minor") fault.
   179  
   180   - **swap:**  
   181     the amount of swap currently used by the processes in this cgroup.
   182  
   183   - **active_anon and inactive_anon:**  
   184     the amount of *anonymous* memory that has been identified has
   185     respectively *active* and *inactive* by the kernel. "Anonymous"
   186     memory is the memory that is *not* linked to disk pages. In other
   187     words, that's the equivalent of the rss counter described above. In
   188     fact, the very definition of the rss counter is **active_anon** +
   189     **inactive_anon** - **tmpfs** (where tmpfs is the amount of memory
   190     used up by `tmpfs` filesystems mounted by this
   191     control group). Now, what's the difference between "active" and
   192     "inactive"? Pages are initially "active"; and at regular intervals,
   193     the kernel sweeps over the memory, and tags some pages as
   194     "inactive". Whenever they are accessed again, they are immediately
   195     retagged "active". When the kernel is almost out of memory, and time
   196     comes to swap out to disk, the kernel will swap "inactive" pages.
   197  
   198   - **active_file and inactive_file:**  
   199     cache memory, with *active* and *inactive* similar to the *anon*
   200     memory above. The exact formula is cache = **active_file** +
   201     **inactive_file** + **tmpfs**. The exact rules used by the kernel
   202     to move memory pages between active and inactive sets are different
   203     from the ones used for anonymous memory, but the general principle
   204     is the same. Note that when the kernel needs to reclaim memory, it
   205     is cheaper to reclaim a clean (=non modified) page from this pool,
   206     since it can be reclaimed immediately (while anonymous pages and
   207     dirty/modified pages have to be written to disk first).
   208  
   209   - **unevictable:**  
   210     the amount of memory that cannot be reclaimed; generally, it will
   211     account for memory that has been "locked" with `mlock`.
   212     It is often used by crypto frameworks to make sure that
   213     secret keys and other sensitive material never gets swapped out to
   214     disk.
   215  
   216   - **memory and memsw limits:**  
   217     These are not really metrics, but a reminder of the limits applied
   218     to this cgroup. The first one indicates the maximum amount of
   219     physical memory that can be used by the processes of this control
   220     group; the second one indicates the maximum amount of RAM+swap.
   221  
   222  Accounting for memory in the page cache is very complex. If two
   223  processes in different control groups both read the same file
   224  (ultimately relying on the same blocks on disk), the corresponding
   225  memory charge will be split between the control groups. It's nice, but
   226  it also means that when a cgroup is terminated, it could increase the
   227  memory usage of another cgroup, because they are not splitting the cost
   228  anymore for those memory pages.
   229  
   230  ### CPU metrics: `cpuacct.stat`
   231  
   232  Now that we've covered memory metrics, everything else will look very
   233  simple in comparison. CPU metrics will be found in the
   234  `cpuacct` controller.
   235  
   236  For each container, you will find a pseudo-file `cpuacct.stat`,
   237  containing the CPU usage accumulated by the processes of the container,
   238  broken down between `user` and `system` time. If you're not familiar
   239  with the distinction, `user` is the time during which the processes were
   240  in direct control of the CPU (i.e., executing process code), and `system`
   241  is the time during which the CPU was executing system calls on behalf of
   242  those processes.
   243  
   244  Those times are expressed in ticks of 1/100th of a second. Actually,
   245  they are expressed in "user jiffies". There are `USER_HZ`
   246  *"jiffies"* per second, and on x86 systems,
   247  `USER_HZ` is 100. This used to map exactly to the
   248  number of scheduler "ticks" per second; but with the advent of higher
   249  frequency scheduling, as well as [tickless kernels](
   250  http://lwn.net/Articles/549580/), the number of kernel ticks
   251  wasn't relevant anymore. It stuck around anyway, mainly for legacy and
   252  compatibility reasons.
   253  
   254  ### Block I/O metrics
   255  
   256  Block I/O is accounted in the `blkio` controller.
   257  Different metrics are scattered across different files. While you can
   258  find in-depth details in the [blkio-controller](
   259  https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-controller.txt)
   260  file in the kernel documentation, here is a short list of the most
   261  relevant ones:
   262  
   263  
   264   - **blkio.sectors:**  
   265     contain the number of 512-bytes sectors read and written by the
   266     processes member of the cgroup, device by device. Reads and writes
   267     are merged in a single counter.
   268  
   269   - **blkio.io_service_bytes:**  
   270     indicates the number of bytes read and written by the cgroup. It has
   271     4 counters per device, because for each device, it differentiates
   272     between synchronous vs. asynchronous I/O, and reads vs. writes.
   273  
   274   - **blkio.io_serviced:**  
   275     the number of I/O operations performed, regardless of their size. It
   276     also has 4 counters per device.
   277  
   278   - **blkio.io_queued:**  
   279     indicates the number of I/O operations currently queued for this
   280     cgroup. In other words, if the cgroup isn't doing any I/O, this will
   281     be zero. Note that the opposite is not true. In other words, if
   282     there is no I/O queued, it does not mean that the cgroup is idle
   283     (I/O-wise). It could be doing purely synchronous reads on an
   284     otherwise quiescent device, which is therefore able to handle them
   285     immediately, without queuing. Also, while it is helpful to figure
   286     out which cgroup is putting stress on the I/O subsystem, keep in
   287     mind that it is a relative quantity. Even if a process group does
   288     not perform more I/O, its queue size can increase just because the
   289     device load increases because of other devices.
   290  
   291  ## Network metrics
   292  
   293  Network metrics are not exposed directly by control groups. There is a
   294  good explanation for that: network interfaces exist within the context
   295  of *network namespaces*. The kernel could probably accumulate metrics
   296  about packets and bytes sent and received by a group of processes, but
   297  those metrics wouldn't be very useful. You want per-interface metrics
   298  (because traffic happening on the local `lo`
   299  interface doesn't really count). But since processes in a single cgroup
   300  can belong to multiple network namespaces, those metrics would be harder
   301  to interpret: multiple network namespaces means multiple `lo`
   302  interfaces, potentially multiple `eth0`
   303  interfaces, etc.; so this is why there is no easy way to gather network
   304  metrics with control groups.
   305  
   306  Instead we can gather network metrics from other sources:
   307  
   308  ### IPtables
   309  
   310  IPtables (or rather, the netfilter framework for which iptables is just
   311  an interface) can do some serious accounting.
   312  
   313  For instance, you can setup a rule to account for the outbound HTTP
   314  traffic on a web server:
   315  
   316      $ iptables -I OUTPUT -p tcp --sport 80
   317  
   318  There is no `-j` or `-g` flag,
   319  so the rule will just count matched packets and go to the following
   320  rule.
   321  
   322  Later, you can check the values of the counters, with:
   323  
   324      $ iptables -nxvL OUTPUT
   325  
   326  Technically, `-n` is not required, but it will
   327  prevent iptables from doing DNS reverse lookups, which are probably
   328  useless in this scenario.
   329  
   330  Counters include packets and bytes. If you want to setup metrics for
   331  container traffic like this, you could execute a `for`
   332  loop to add two `iptables` rules per
   333  container IP address (one in each direction), in the `FORWARD`
   334  chain. This will only meter traffic going through the NAT
   335  layer; you will also have to add traffic going through the userland
   336  proxy.
   337  
   338  Then, you will need to check those counters on a regular basis. If you
   339  happen to use `collectd`, there is a [nice plugin](https://collectd.org/wiki/index.php/Table_of_Plugins)
   340  to automate iptables counters collection.
   341  
   342  ### Interface-level counters
   343  
   344  Since each container has a virtual Ethernet interface, you might want to
   345  check directly the TX and RX counters of this interface. You will notice
   346  that each container is associated to a virtual Ethernet interface in
   347  your host, with a name like `vethKk8Zqi`. Figuring
   348  out which interface corresponds to which container is, unfortunately,
   349  difficult.
   350  
   351  But for now, the best way is to check the metrics *from within the
   352  containers*. To accomplish this, you can run an executable from the host
   353  environment within the network namespace of a container using **ip-netns
   354  magic**.
   355  
   356  The `ip-netns exec` command will let you execute any
   357  program (present in the host system) within any network namespace
   358  visible to the current process. This means that your host will be able
   359  to enter the network namespace of your containers, but your containers
   360  won't be able to access the host, nor their sibling containers.
   361  Containers will be able to “see” and affect their sub-containers,
   362  though.
   363  
   364  The exact format of the command is:
   365  
   366      $ ip netns exec <nsname> <command...>
   367  
   368  For example:
   369  
   370      $ ip netns exec mycontainer netstat -i
   371  
   372  `ip netns` finds the "mycontainer" container by
   373  using namespaces pseudo-files. Each process belongs to one network
   374  namespace, one PID namespace, one `mnt` namespace,
   375  etc., and those namespaces are materialized under
   376  `/proc/<pid>/ns/`. For example, the network
   377  namespace of PID 42 is materialized by the pseudo-file
   378  `/proc/42/ns/net`.
   379  
   380  When you run `ip netns exec mycontainer ...`, it
   381  expects `/var/run/netns/mycontainer` to be one of
   382  those pseudo-files. (Symlinks are accepted.)
   383  
   384  In other words, to execute a command within the network namespace of a
   385  container, we need to:
   386  
   387  - Find out the PID of any process within the container that we want to investigate;
   388  - Create a symlink from `/var/run/netns/<somename>` to `/proc/<thepid>/ns/net`
   389  - Execute `ip netns exec <somename> ....`
   390  
   391  Please review [*Enumerating Cgroups*](#enumerating-cgroups) to learn how to find
   392  the cgroup of a process running in the container of which you want to
   393  measure network usage. From there, you can examine the pseudo-file named
   394  `tasks`, which contains the PIDs that are in the
   395  control group (i.e., in the container). Pick any one of them.
   396  
   397  Putting everything together, if the "short ID" of a container is held in
   398  the environment variable `$CID`, then you can do this:
   399  
   400      $ TASKS=/sys/fs/cgroup/devices/docker/$CID*/tasks
   401      $ PID=$(head -n 1 $TASKS)
   402      $ mkdir -p /var/run/netns
   403      $ ln -sf /proc/$PID/ns/net /var/run/netns/$CID
   404      $ ip netns exec $CID netstat -i
   405  
   406  ## Tips for high-performance metric collection
   407  
   408  Note that running a new process each time you want to update metrics is
   409  (relatively) expensive. If you want to collect metrics at high
   410  resolutions, and/or over a large number of containers (think 1000
   411  containers on a single host), you do not want to fork a new process each
   412  time.
   413  
   414  Here is how to collect metrics from a single process. You will have to
   415  write your metric collector in C (or any language that lets you do
   416  low-level system calls). You need to use a special system call,
   417  `setns()`, which lets the current process enter any
   418  arbitrary namespace. It requires, however, an open file descriptor to
   419  the namespace pseudo-file (remember: that's the pseudo-file in
   420  `/proc/<pid>/ns/net`).
   421  
   422  However, there is a catch: you must not keep this file descriptor open.
   423  If you do, when the last process of the control group exits, the
   424  namespace will not be destroyed, and its network resources (like the
   425  virtual interface of the container) will stay around for ever (or until
   426  you close that file descriptor).
   427  
   428  The right approach would be to keep track of the first PID of each
   429  container, and re-open the namespace pseudo-file each time.
   430  
   431  ## Collecting metrics when a container exits
   432  
   433  Sometimes, you do not care about real time metric collection, but when a
   434  container exits, you want to know how much CPU, memory, etc. it has
   435  used.
   436  
   437  Docker makes this difficult because it relies on `lxc-start`, which
   438  carefully cleans up after itself, but it is still possible. It is
   439  usually easier to collect metrics at regular intervals (e.g., every
   440  minute, with the collectd LXC plugin) and rely on that instead.
   441  
   442  But, if you'd still like to gather the stats when a container stops,
   443  here is how:
   444  
   445  For each container, start a collection process, and move it to the
   446  control groups that you want to monitor by writing its PID to the tasks
   447  file of the cgroup. The collection process should periodically re-read
   448  the tasks file to check if it's the last process of the control group.
   449  (If you also want to collect network statistics as explained in the
   450  previous section, you should also move the process to the appropriate
   451  network namespace.)
   452  
   453  When the container exits, `lxc-start` will try to
   454  delete the control groups. It will fail, since the control group is
   455  still in use; but that's fine. You process should now detect that it is
   456  the only one remaining in the group. Now is the right time to collect
   457  all the metrics you need!
   458  
   459  Finally, your process should move itself back to the root control group,
   460  and remove the container control group. To remove a control group, just
   461  `rmdir` its directory. It's counter-intuitive to
   462  `rmdir` a directory as it still contains files; but
   463  remember that this is a pseudo-filesystem, so usual rules don't apply.
   464  After the cleanup is done, the collection process can exit safely.