github.com/portworx/docker@v1.12.1/docs/admin/runmetrics.md (about)

     1  <!--[metadata]>
     2  +++
     3  aliases = ["/engine/articles/run_metrics"]
     4  title = "Runtime metrics"
     5  description = "Measure the behavior of running containers"
     6  keywords = ["docker, metrics, CPU, memory, disk, IO, run,  runtime, stats"]
     7  [menu.main]
     8  parent = "engine_admin"
     9  weight = 14
    10  +++
    11  <![end-metadata]-->
    12  
    13  # Runtime metrics
    14  
    15  
    16  ## Docker stats
    17  
    18  You can use the `docker stats` command to live stream a container's
    19  runtime metrics. The command supports CPU, memory usage, memory limit,
    20  and network IO metrics.
    21  
    22  The following is a sample output from the `docker stats` command
    23  
    24  ```bash
    25  $ docker stats redis1 redis2
    26  
    27  CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O
    28  redis1              0.07%               796 KB / 64 MB        1.21%               788 B / 648 B       3.568 MB / 512 KB
    29  redis2              0.07%               2.746 MB / 64 MB      4.29%               1.266 KB / 648 B    12.4 MB / 0 B
    30  ```
    31  
    32  The [docker stats](../reference/commandline/stats.md) reference page has
    33  more details about the `docker stats` command.
    34  
    35  ## Control groups
    36  
    37  Linux Containers rely on [control groups](
    38  https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt)
    39  which not only track groups of processes, but also expose metrics about
    40  CPU, memory, and block I/O usage. You can access those metrics and
    41  obtain network usage metrics as well. This is relevant for "pure" LXC
    42  containers, as well as for Docker containers.
    43  
    44  Control groups are exposed through a pseudo-filesystem. In recent
    45  distros, you should find this filesystem under `/sys/fs/cgroup`. Under
    46  that directory, you will see multiple sub-directories, called devices,
    47  freezer, blkio, etc.; each sub-directory actually corresponds to a different
    48  cgroup hierarchy.
    49  
    50  On older systems, the control groups might be mounted on `/cgroup`, without
    51  distinct hierarchies. In that case, instead of seeing the sub-directories,
    52  you will see a bunch of files in that directory, and possibly some directories
    53  corresponding to existing containers.
    54  
    55  To figure out where your control groups are mounted, you can run:
    56  
    57  ```bash
    58  $ grep cgroup /proc/mounts
    59  ```
    60  
    61  ## Enumerating cgroups
    62  
    63  You can look into `/proc/cgroups` to see the different control group subsystems
    64  known to the system, the hierarchy they belong to, and how many groups they contain.
    65  
    66  You can also look at `/proc/<pid>/cgroup` to see which control groups a process
    67  belongs to. The control group will be shown as a path relative to the root of
    68  the hierarchy mountpoint; e.g., `/` means “this process has not been assigned into
    69  a particular group”, while `/lxc/pumpkin` means that the process is likely to be
    70  a member of a container named `pumpkin`.
    71  
    72  ## Finding the cgroup for a given container
    73  
    74  For each container, one cgroup will be created in each hierarchy. On
    75  older systems with older versions of the LXC userland tools, the name of
    76  the cgroup will be the name of the container. With more recent versions
    77  of the LXC tools, the cgroup will be `lxc/<container_name>.`
    78  
    79  For Docker containers using cgroups, the container name will be the full
    80  ID or long ID of the container. If a container shows up as ae836c95b4c3
    81  in `docker ps`, its long ID might be something like
    82  `ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79`. You can
    83  look it up with `docker inspect` or `docker ps --no-trunc`.
    84  
    85  Putting everything together to look at the memory metrics for a Docker
    86  container, take a look at `/sys/fs/cgroup/memory/docker/<longid>/`.
    87  
    88  ## Metrics from cgroups: memory, CPU, block I/O
    89  
    90  For each subsystem (memory, CPU, and block I/O), you will find one or
    91  more pseudo-files containing statistics.
    92  
    93  ### Memory metrics: `memory.stat`
    94  
    95  Memory metrics are found in the "memory" cgroup. Note that the memory
    96  control group adds a little overhead, because it does very fine-grained
    97  accounting of the memory usage on your host. Therefore, many distros
    98  chose to not enable it by default. Generally, to enable it, all you have
    99  to do is to add some kernel command-line parameters:
   100  `cgroup_enable=memory swapaccount=1`.
   101  
   102  The metrics are in the pseudo-file `memory.stat`.
   103  Here is what it will look like:
   104  
   105      cache 11492564992
   106      rss 1930993664
   107      mapped_file 306728960
   108      pgpgin 406632648
   109      pgpgout 403355412
   110      swap 0
   111      pgfault 728281223
   112      pgmajfault 1724
   113      inactive_anon 46608384
   114      active_anon 1884520448
   115      inactive_file 7003344896
   116      active_file 4489052160
   117      unevictable 32768
   118      hierarchical_memory_limit 9223372036854775807
   119      hierarchical_memsw_limit 9223372036854775807
   120      total_cache 11492564992
   121      total_rss 1930993664
   122      total_mapped_file 306728960
   123      total_pgpgin 406632648
   124      total_pgpgout 403355412
   125      total_swap 0
   126      total_pgfault 728281223
   127      total_pgmajfault 1724
   128      total_inactive_anon 46608384
   129      total_active_anon 1884520448
   130      total_inactive_file 7003344896
   131      total_active_file 4489052160
   132      total_unevictable 32768
   133  
   134  The first half (without the `total_` prefix) contains statistics relevant
   135  to the processes within the cgroup, excluding sub-cgroups. The second half
   136  (with the `total_` prefix) includes sub-cgroups as well.
   137  
   138  Some metrics are "gauges", i.e., values that can increase or decrease
   139  (e.g., swap, the amount of swap space used by the members of the cgroup).
   140  Some others are "counters", i.e., values that can only go up, because
   141  they represent occurrences of a specific event (e.g., pgfault, which
   142  indicates the number of page faults which happened since the creation of
   143  the cgroup; this number can never decrease).
   144  
   145  <style>table tr > td:first-child { white-space: nowrap;}</style>
   146  
   147  Metric                                | Description
   148  --------------------------------------|-----------------------------------------------------------
   149  **cache**                             | The amount of memory used by the processes of this control group that can be associated precisely with a block on a block device. When you read from and write to files on disk, this amount will increase. This will be the case if you use "conventional" I/O (`open`, `read`, `write` syscalls) as well as mapped files (with `mmap`). It also accounts for the memory used by `tmpfs` mounts, though the reasons are unclear.
   150  **rss**                               | The amount of memory that *doesn't* correspond to anything on disk: stacks, heaps, and anonymous memory maps.
   151  **mapped_file**                       | Indicates the amount of memory mapped by the processes in the control group. It doesn't give you information about *how much* memory is used; it rather tells you *how* it is used.
   152  **pgfault**, **pgmajfault**           | Indicate the number of times that a process of the cgroup triggered a "page fault" and a "major fault", respectively. A page fault happens when a process accesses a part of its virtual memory space which is nonexistent or protected. The former can happen if the process is buggy and tries to access an invalid address (it will then be sent a `SIGSEGV` signal, typically killing it with the famous `Segmentation fault` message). The latter can happen when the process reads from a memory zone which has been swapped out, or which corresponds to a mapped file: in that case, the kernel will load the page from disk, and let the CPU complete the memory access. It can also happen when the process writes to a copy-on-write memory zone: likewise, the kernel will preempt the process, duplicate the memory page, and resume the write operation on the process` own copy of the page. "Major" faults happen when the kernel actually has to read the data from disk. When it just has to duplicate an existing page, or allocate an empty page, it's a regular (or "minor") fault.
   153  **swap**                              | The amount of swap currently used by the processes in this cgroup.
   154  **active_anon**, **inactive_anon**    | The amount of *anonymous* memory that has been identified has respectively *active* and *inactive* by the kernel. "Anonymous" memory is the memory that is *not* linked to disk pages. In other words, that's the equivalent of the rss counter described above. In fact, the very definition of the rss counter is **active_anon** + **inactive_anon** - **tmpfs** (where tmpfs is the amount of memory used up by `tmpfs` filesystems mounted by this control group). Now, what's the difference between "active" and "inactive"? Pages are initially "active"; and at regular intervals, the kernel sweeps over the memory, and tags some pages as "inactive". Whenever they are accessed again, they are immediately retagged "active". When the kernel is almost out of memory, and time comes to swap out to disk, the kernel will swap "inactive" pages.
   155  **active_file**, **inactive_file**    | Cache memory, with *active* and *inactive* similar to the *anon* memory above. The exact formula is **cache** = **active_file** + **inactive_file** + **tmpfs**. The exact rules used by the kernel to move memory pages between active and inactive sets are different from the ones used for anonymous memory, but the general principle is the same. Note that when the kernel needs to reclaim memory, it is cheaper to reclaim a clean (=non modified) page from this pool, since it can be reclaimed immediately (while anonymous pages and dirty/modified pages have to be written to disk first).
   156  **unevictable**                       | The amount of memory that cannot be reclaimed; generally, it will account for memory that has been "locked" with `mlock`. It is often used by crypto frameworks to make sure that secret keys and other sensitive material never gets swapped out to disk.
   157  **memory_limit**, **memsw_limit**     | These are not really metrics, but a reminder of the limits applied to this cgroup. The first one indicates the maximum amount of physical memory that can be used by the processes of this control group; the second one indicates the maximum amount of RAM+swap.
   158  
   159  Accounting for memory in the page cache is very complex. If two
   160  processes in different control groups both read the same file
   161  (ultimately relying on the same blocks on disk), the corresponding
   162  memory charge will be split between the control groups. It's nice, but
   163  it also means that when a cgroup is terminated, it could increase the
   164  memory usage of another cgroup, because they are not splitting the cost
   165  anymore for those memory pages.
   166  
   167  ### CPU metrics: `cpuacct.stat`
   168  
   169  Now that we've covered memory metrics, everything else will look very
   170  simple in comparison. CPU metrics will be found in the
   171  `cpuacct` controller.
   172  
   173  For each container, you will find a pseudo-file `cpuacct.stat`,
   174  containing the CPU usage accumulated by the processes of the container,
   175  broken down between `user` and `system` time. If you're not familiar
   176  with the distinction, `user` is the time during which the processes were
   177  in direct control of the CPU (i.e., executing process code), and `system`
   178  is the time during which the CPU was executing system calls on behalf of
   179  those processes.
   180  
   181  Those times are expressed in ticks of 1/100th of a second. Actually,
   182  they are expressed in "user jiffies". There are `USER_HZ`
   183  *"jiffies"* per second, and on x86 systems,
   184  `USER_HZ` is 100. This used to map exactly to the
   185  number of scheduler "ticks" per second; but with the advent of higher
   186  frequency scheduling, as well as [tickless kernels](
   187  http://lwn.net/Articles/549580/), the number of kernel ticks
   188  wasn't relevant anymore. It stuck around anyway, mainly for legacy and
   189  compatibility reasons.
   190  
   191  ### Block I/O metrics
   192  
   193  Block I/O is accounted in the `blkio` controller.
   194  Different metrics are scattered across different files. While you can
   195  find in-depth details in the [blkio-controller](
   196  https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-controller.txt)
   197  file in the kernel documentation, here is a short list of the most
   198  relevant ones:
   199  
   200  
   201  Metric                      | Description
   202  ----------------------------|-----------------------------------------------------------
   203  **blkio.sectors**           | contains the number of 512-bytes sectors read and written by the processes member of the cgroup, device by device. Reads and writes are merged in a single counter.
   204  **blkio.io_service_bytes**  | indicates the number of bytes read and written by the cgroup. It has 4 counters per device, because for each device, it differentiates between synchronous vs. asynchronous I/O, and reads vs. writes.
   205  **blkio.io_serviced**       | the number of I/O operations performed, regardless of their size. It also has 4 counters per device.
   206  **blkio.io_queued**         | indicates the number of I/O operations currently queued for this cgroup. In other words, if the cgroup isn't doing any I/O, this will be zero. Note that the opposite is not true. In other words, if there is no I/O queued, it does not mean that the cgroup is idle (I/O-wise). It could be doing purely synchronous reads on an otherwise quiescent device, which is therefore able to handle them immediately, without queuing. Also, while it is helpful to figure out which cgroup is putting stress on the I/O subsystem, keep in mind that it is a relative quantity. Even if a process group does not perform more I/O, its queue size can increase just because the device load increases because of other devices.
   207  
   208  ## Network metrics
   209  
   210  Network metrics are not exposed directly by control groups. There is a
   211  good explanation for that: network interfaces exist within the context
   212  of *network namespaces*. The kernel could probably accumulate metrics
   213  about packets and bytes sent and received by a group of processes, but
   214  those metrics wouldn't be very useful. You want per-interface metrics
   215  (because traffic happening on the local `lo`
   216  interface doesn't really count). But since processes in a single cgroup
   217  can belong to multiple network namespaces, those metrics would be harder
   218  to interpret: multiple network namespaces means multiple `lo`
   219  interfaces, potentially multiple `eth0`
   220  interfaces, etc.; so this is why there is no easy way to gather network
   221  metrics with control groups.
   222  
   223  Instead we can gather network metrics from other sources:
   224  
   225  ### IPtables
   226  
   227  IPtables (or rather, the netfilter framework for which iptables is just
   228  an interface) can do some serious accounting.
   229  
   230  For instance, you can setup a rule to account for the outbound HTTP
   231  traffic on a web server:
   232  
   233  ```bash
   234  $ iptables -I OUTPUT -p tcp --sport 80
   235  ```
   236  
   237  There is no `-j` or `-g` flag,
   238  so the rule will just count matched packets and go to the following
   239  rule.
   240  
   241  Later, you can check the values of the counters, with:
   242  
   243  ```bash
   244  $ iptables -nxvL OUTPUT
   245  ```
   246  
   247  Technically, `-n` is not required, but it will
   248  prevent iptables from doing DNS reverse lookups, which are probably
   249  useless in this scenario.
   250  
   251  Counters include packets and bytes. If you want to setup metrics for
   252  container traffic like this, you could execute a `for`
   253  loop to add two `iptables` rules per
   254  container IP address (one in each direction), in the `FORWARD`
   255  chain. This will only meter traffic going through the NAT
   256  layer; you will also have to add traffic going through the userland
   257  proxy.
   258  
   259  Then, you will need to check those counters on a regular basis. If you
   260  happen to use `collectd`, there is a [nice plugin](https://collectd.org/wiki/index.php/Table_of_Plugins)
   261  to automate iptables counters collection.
   262  
   263  ### Interface-level counters
   264  
   265  Since each container has a virtual Ethernet interface, you might want to
   266  check directly the TX and RX counters of this interface. You will notice
   267  that each container is associated to a virtual Ethernet interface in
   268  your host, with a name like `vethKk8Zqi`. Figuring
   269  out which interface corresponds to which container is, unfortunately,
   270  difficult.
   271  
   272  But for now, the best way is to check the metrics *from within the
   273  containers*. To accomplish this, you can run an executable from the host
   274  environment within the network namespace of a container using **ip-netns
   275  magic**.
   276  
   277  The `ip-netns exec` command will let you execute any
   278  program (present in the host system) within any network namespace
   279  visible to the current process. This means that your host will be able
   280  to enter the network namespace of your containers, but your containers
   281  won't be able to access the host, nor their sibling containers.
   282  Containers will be able to “see” and affect their sub-containers,
   283  though.
   284  
   285  The exact format of the command is:
   286  
   287  ```bash
   288  $ ip netns exec <nsname> <command...>
   289  ```
   290  
   291  For example:
   292  
   293  ```bash
   294  $ ip netns exec mycontainer netstat -i
   295  ```
   296  
   297  `ip netns` finds the "mycontainer" container by
   298  using namespaces pseudo-files. Each process belongs to one network
   299  namespace, one PID namespace, one `mnt` namespace,
   300  etc., and those namespaces are materialized under
   301  `/proc/<pid>/ns/`. For example, the network
   302  namespace of PID 42 is materialized by the pseudo-file
   303  `/proc/42/ns/net`.
   304  
   305  When you run `ip netns exec mycontainer ...`, it
   306  expects `/var/run/netns/mycontainer` to be one of
   307  those pseudo-files. (Symlinks are accepted.)
   308  
   309  In other words, to execute a command within the network namespace of a
   310  container, we need to:
   311  
   312  - Find out the PID of any process within the container that we want to investigate;
   313  - Create a symlink from `/var/run/netns/<somename>` to `/proc/<thepid>/ns/net`
   314  - Execute `ip netns exec <somename> ....`
   315  
   316  Please review [Enumerating Cgroups](#enumerating-cgroups) to learn how to find
   317  the cgroup of a process running in the container of which you want to
   318  measure network usage. From there, you can examine the pseudo-file named
   319  `tasks`, which contains the PIDs that are in the
   320  control group (i.e., in the container). Pick any one of them.
   321  
   322  Putting everything together, if the "short ID" of a container is held in
   323  the environment variable `$CID`, then you can do this:
   324  
   325  ```bash
   326  $ TASKS=/sys/fs/cgroup/devices/docker/$CID*/tasks
   327  $ PID=$(head -n 1 $TASKS)
   328  $ mkdir -p /var/run/netns
   329  $ ln -sf /proc/$PID/ns/net /var/run/netns/$CID
   330  $ ip netns exec $CID netstat -i
   331  ```
   332  
   333  ## Tips for high-performance metric collection
   334  
   335  Note that running a new process each time you want to update metrics is
   336  (relatively) expensive. If you want to collect metrics at high
   337  resolutions, and/or over a large number of containers (think 1000
   338  containers on a single host), you do not want to fork a new process each
   339  time.
   340  
   341  Here is how to collect metrics from a single process. You will have to
   342  write your metric collector in C (or any language that lets you do
   343  low-level system calls). You need to use a special system call,
   344  `setns()`, which lets the current process enter any
   345  arbitrary namespace. It requires, however, an open file descriptor to
   346  the namespace pseudo-file (remember: that's the pseudo-file in
   347  `/proc/<pid>/ns/net`).
   348  
   349  However, there is a catch: you must not keep this file descriptor open.
   350  If you do, when the last process of the control group exits, the
   351  namespace will not be destroyed, and its network resources (like the
   352  virtual interface of the container) will stay around for ever (or until
   353  you close that file descriptor).
   354  
   355  The right approach would be to keep track of the first PID of each
   356  container, and re-open the namespace pseudo-file each time.
   357  
   358  ## Collecting metrics when a container exits
   359  
   360  Sometimes, you do not care about real time metric collection, but when a
   361  container exits, you want to know how much CPU, memory, etc. it has
   362  used.
   363  
   364  Docker makes this difficult because it relies on `lxc-start`, which
   365  carefully cleans up after itself, but it is still possible. It is
   366  usually easier to collect metrics at regular intervals (e.g., every
   367  minute, with the collectd LXC plugin) and rely on that instead.
   368  
   369  But, if you'd still like to gather the stats when a container stops,
   370  here is how:
   371  
   372  For each container, start a collection process, and move it to the
   373  control groups that you want to monitor by writing its PID to the tasks
   374  file of the cgroup. The collection process should periodically re-read
   375  the tasks file to check if it's the last process of the control group.
   376  (If you also want to collect network statistics as explained in the
   377  previous section, you should also move the process to the appropriate
   378  network namespace.)
   379  
   380  When the container exits, `lxc-start` will try to
   381  delete the control groups. It will fail, since the control group is
   382  still in use; but that's fine. You process should now detect that it is
   383  the only one remaining in the group. Now is the right time to collect
   384  all the metrics you need!
   385  
   386  Finally, your process should move itself back to the root control group,
   387  and remove the container control group. To remove a control group, just
   388  `rmdir` its directory. It's counter-intuitive to
   389  `rmdir` a directory as it still contains files; but
   390  remember that this is a pseudo-filesystem, so usual rules don't apply.
   391  After the cleanup is done, the collection process can exit safely.