github.com/toophy/docker@v1.8.2/docs/articles/runmetrics.md (about)

     1  <!--[metadata]>
     2  +++
     3  title = "Runtime metrics"
     4  description = "Measure the behavior of running containers"
     5  keywords = ["docker, metrics, CPU, memory, disk, IO, run,  runtime, stats"]
     6  [menu.main]
     7  parent = "smn_administrate"
     8  weight = 4
     9  +++
    10  <![end-metadata]-->
    11  
    12  # Runtime metrics
    13  
    14  
    15  ## Docker stats
    16  
    17  You can use the `docker stats` command to live stream a container's
    18  runtime metrics. The command supports CPU, memory usage, memory limit,
    19  and network IO metrics.
    20  
    21  The following is a sample output from the `docker stats` command
    22  
    23      $ docker stats redis1 redis2
    24      CONTAINER           CPU %               MEM USAGE/LIMIT     MEM %               NET I/O
    25      redis1              0.07%               796 KB/64 MB        1.21%               788 B/648 B
    26      redis2              0.07%               2.746 MB/64 MB      4.29%               1.266 KB/648 B
    27  
    28  
    29  The [docker stats](/reference/commandline/stats/) reference page has
    30  more details about the `docker stats` command. 
    31  
    32  ## Control groups
    33  
    34  Linux Containers rely on [control groups](
    35  https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt)
    36  which not only track groups of processes, but also expose metrics about
    37  CPU, memory, and block I/O usage. You can access those metrics and
    38  obtain network usage metrics as well. This is relevant for "pure" LXC
    39  containers, as well as for Docker containers.
    40  
    41  Control groups are exposed through a pseudo-filesystem. In recent
    42  distros, you should find this filesystem under `/sys/fs/cgroup`. Under
    43  that directory, you will see multiple sub-directories, called devices,
    44  freezer, blkio, etc.; each sub-directory actually corresponds to a different
    45  cgroup hierarchy.
    46  
    47  On older systems, the control groups might be mounted on `/cgroup`, without
    48  distinct hierarchies. In that case, instead of seeing the sub-directories,
    49  you will see a bunch of files in that directory, and possibly some directories
    50  corresponding to existing containers.
    51  
    52  To figure out where your control groups are mounted, you can run:
    53  
    54      $ grep cgroup /proc/mounts
    55  
    56  ## Enumerating cgroups
    57  
    58  You can look into `/proc/cgroups` to see the different control group subsystems
    59  known to the system, the hierarchy they belong to, and how many groups they contain.
    60  
    61  You can also look at `/proc/<pid>/cgroup` to see which control groups a process
    62  belongs to. The control group will be shown as a path relative to the root of
    63  the hierarchy mountpoint; e.g., `/` means “this process has not been assigned into
    64  a particular group”, while `/lxc/pumpkin` means that the process is likely to be
    65  a member of a container named `pumpkin`.
    66  
    67  ## Finding the cgroup for a given container
    68  
    69  For each container, one cgroup will be created in each hierarchy. On
    70  older systems with older versions of the LXC userland tools, the name of
    71  the cgroup will be the name of the container. With more recent versions
    72  of the LXC tools, the cgroup will be `lxc/<container_name>.`
    73  
    74  For Docker containers using cgroups, the container name will be the full
    75  ID or long ID of the container. If a container shows up as ae836c95b4c3
    76  in `docker ps`, its long ID might be something like
    77  `ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79`. You can
    78  look it up with `docker inspect` or `docker ps --no-trunc`.
    79  
    80  Putting everything together to look at the memory metrics for a Docker
    81  container, take a look at `/sys/fs/cgroup/memory/lxc/<longid>/`.
    82  
    83  ## Metrics from cgroups: memory, CPU, block I/O
    84  
    85  For each subsystem (memory, CPU, and block I/O), you will find one or
    86  more pseudo-files containing statistics.
    87  
    88  ### Memory metrics: `memory.stat`
    89  
    90  Memory metrics are found in the "memory" cgroup. Note that the memory
    91  control group adds a little overhead, because it does very fine-grained
    92  accounting of the memory usage on your host. Therefore, many distros
    93  chose to not enable it by default. Generally, to enable it, all you have
    94  to do is to add some kernel command-line parameters:
    95  `cgroup_enable=memory swapaccount=1`.
    96  
    97  The metrics are in the pseudo-file `memory.stat`.
    98  Here is what it will look like:
    99  
   100      cache 11492564992
   101      rss 1930993664
   102      mapped_file 306728960
   103      pgpgin 406632648
   104      pgpgout 403355412
   105      swap 0
   106      pgfault 728281223
   107      pgmajfault 1724
   108      inactive_anon 46608384
   109      active_anon 1884520448
   110      inactive_file 7003344896
   111      active_file 4489052160
   112      unevictable 32768
   113      hierarchical_memory_limit 9223372036854775807
   114      hierarchical_memsw_limit 9223372036854775807
   115      total_cache 11492564992
   116      total_rss 1930993664
   117      total_mapped_file 306728960
   118      total_pgpgin 406632648
   119      total_pgpgout 403355412
   120      total_swap 0
   121      total_pgfault 728281223
   122      total_pgmajfault 1724
   123      total_inactive_anon 46608384
   124      total_active_anon 1884520448
   125      total_inactive_file 7003344896
   126      total_active_file 4489052160
   127      total_unevictable 32768
   128  
   129  The first half (without the `total_` prefix) contains statistics relevant
   130  to the processes within the cgroup, excluding sub-cgroups. The second half
   131  (with the `total_` prefix) includes sub-cgroups as well.
   132  
   133  Some metrics are "gauges", i.e., values that can increase or decrease
   134  (e.g., swap, the amount of swap space used by the members of the cgroup).
   135  Some others are "counters", i.e., values that can only go up, because
   136  they represent occurrences of a specific event (e.g., pgfault, which
   137  indicates the number of page faults which happened since the creation of
   138  the cgroup; this number can never decrease).
   139  
   140  
   141   - **cache:**  
   142     the amount of memory used by the processes of this control group
   143     that can be associated precisely with a block on a block device.
   144     When you read from and write to files on disk, this amount will
   145     increase. This will be the case if you use "conventional" I/O
   146     (`open`, `read`,
   147     `write` syscalls) as well as mapped files (with
   148     `mmap`). It also accounts for the memory used by
   149     `tmpfs` mounts, though the reasons are unclear.
   150  
   151   - **rss:**  
   152     the amount of memory that *doesn't* correspond to anything on disk:
   153     stacks, heaps, and anonymous memory maps.
   154  
   155   - **mapped_file:**  
   156     indicates the amount of memory mapped by the processes in the
   157     control group. It doesn't give you information about *how much*
   158     memory is used; it rather tells you *how* it is used.
   159  
   160   - **pgfault and pgmajfault:**  
   161     indicate the number of times that a process of the cgroup triggered
   162     a "page fault" and a "major fault", respectively. A page fault
   163     happens when a process accesses a part of its virtual memory space
   164     which is nonexistent or protected. The former can happen if the
   165     process is buggy and tries to access an invalid address (it will
   166     then be sent a `SIGSEGV` signal, typically
   167     killing it with the famous `Segmentation fault`
   168     message). The latter can happen when the process reads from a memory
   169     zone which has been swapped out, or which corresponds to a mapped
   170     file: in that case, the kernel will load the page from disk, and let
   171     the CPU complete the memory access. It can also happen when the
   172     process writes to a copy-on-write memory zone: likewise, the kernel
   173     will preempt the process, duplicate the memory page, and resume the
   174     write operation on the process` own copy of the page. "Major" faults
   175     happen when the kernel actually has to read the data from disk. When
   176     it just has to duplicate an existing page, or allocate an empty
   177     page, it's a regular (or "minor") fault.
   178  
   179   - **swap:**  
   180     the amount of swap currently used by the processes in this cgroup.
   181  
   182   - **active_anon and inactive_anon:**  
   183     the amount of *anonymous* memory that has been identified has
   184     respectively *active* and *inactive* by the kernel. "Anonymous"
   185     memory is the memory that is *not* linked to disk pages. In other
   186     words, that's the equivalent of the rss counter described above. In
   187     fact, the very definition of the rss counter is **active_anon** +
   188     **inactive_anon** - **tmpfs** (where tmpfs is the amount of memory
   189     used up by `tmpfs` filesystems mounted by this
   190     control group). Now, what's the difference between "active" and
   191     "inactive"? Pages are initially "active"; and at regular intervals,
   192     the kernel sweeps over the memory, and tags some pages as
   193     "inactive". Whenever they are accessed again, they are immediately
   194     retagged "active". When the kernel is almost out of memory, and time
   195     comes to swap out to disk, the kernel will swap "inactive" pages.
   196  
   197   - **active_file and inactive_file:**  
   198     cache memory, with *active* and *inactive* similar to the *anon*
   199     memory above. The exact formula is cache = **active_file** +
   200     **inactive_file** + **tmpfs**. The exact rules used by the kernel
   201     to move memory pages between active and inactive sets are different
   202     from the ones used for anonymous memory, but the general principle
   203     is the same. Note that when the kernel needs to reclaim memory, it
   204     is cheaper to reclaim a clean (=non modified) page from this pool,
   205     since it can be reclaimed immediately (while anonymous pages and
   206     dirty/modified pages have to be written to disk first).
   207  
   208   - **unevictable:**  
   209     the amount of memory that cannot be reclaimed; generally, it will
   210     account for memory that has been "locked" with `mlock`.
   211     It is often used by crypto frameworks to make sure that
   212     secret keys and other sensitive material never gets swapped out to
   213     disk.
   214  
   215   - **memory and memsw limits:**  
   216     These are not really metrics, but a reminder of the limits applied
   217     to this cgroup. The first one indicates the maximum amount of
   218     physical memory that can be used by the processes of this control
   219     group; the second one indicates the maximum amount of RAM+swap.
   220  
   221  Accounting for memory in the page cache is very complex. If two
   222  processes in different control groups both read the same file
   223  (ultimately relying on the same blocks on disk), the corresponding
   224  memory charge will be split between the control groups. It's nice, but
   225  it also means that when a cgroup is terminated, it could increase the
   226  memory usage of another cgroup, because they are not splitting the cost
   227  anymore for those memory pages.
   228  
   229  ### CPU metrics: `cpuacct.stat`
   230  
   231  Now that we've covered memory metrics, everything else will look very
   232  simple in comparison. CPU metrics will be found in the
   233  `cpuacct` controller.
   234  
   235  For each container, you will find a pseudo-file `cpuacct.stat`,
   236  containing the CPU usage accumulated by the processes of the container,
   237  broken down between `user` and `system` time. If you're not familiar
   238  with the distinction, `user` is the time during which the processes were
   239  in direct control of the CPU (i.e., executing process code), and `system`
   240  is the time during which the CPU was executing system calls on behalf of
   241  those processes.
   242  
   243  Those times are expressed in ticks of 1/100th of a second. Actually,
   244  they are expressed in "user jiffies". There are `USER_HZ`
   245  *"jiffies"* per second, and on x86 systems,
   246  `USER_HZ` is 100. This used to map exactly to the
   247  number of scheduler "ticks" per second; but with the advent of higher
   248  frequency scheduling, as well as [tickless kernels](
   249  http://lwn.net/Articles/549580/), the number of kernel ticks
   250  wasn't relevant anymore. It stuck around anyway, mainly for legacy and
   251  compatibility reasons.
   252  
   253  ### Block I/O metrics
   254  
   255  Block I/O is accounted in the `blkio` controller.
   256  Different metrics are scattered across different files. While you can
   257  find in-depth details in the [blkio-controller](
   258  https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt)
   259  file in the kernel documentation, here is a short list of the most
   260  relevant ones:
   261  
   262  
   263   - **blkio.sectors:**  
   264     contain the number of 512-bytes sectors read and written by the
   265     processes member of the cgroup, device by device. Reads and writes
   266     are merged in a single counter.
   267  
   268   - **blkio.io_service_bytes:**  
   269     indicates the number of bytes read and written by the cgroup. It has
   270     4 counters per device, because for each device, it differentiates
   271     between synchronous vs. asynchronous I/O, and reads vs. writes.
   272  
   273   - **blkio.io_serviced:**  
   274     the number of I/O operations performed, regardless of their size. It
   275     also has 4 counters per device.
   276  
   277   - **blkio.io_queued:**  
   278     indicates the number of I/O operations currently queued for this
   279     cgroup. In other words, if the cgroup isn't doing any I/O, this will
   280     be zero. Note that the opposite is not true. In other words, if
   281     there is no I/O queued, it does not mean that the cgroup is idle
   282     (I/O-wise). It could be doing purely synchronous reads on an
   283     otherwise quiescent device, which is therefore able to handle them
   284     immediately, without queuing. Also, while it is helpful to figure
   285     out which cgroup is putting stress on the I/O subsystem, keep in
   286     mind that is is a relative quantity. Even if a process group does
   287     not perform more I/O, its queue size can increase just because the
   288     device load increases because of other devices.
   289  
   290  ## Network metrics
   291  
   292  Network metrics are not exposed directly by control groups. There is a
   293  good explanation for that: network interfaces exist within the context
   294  of *network namespaces*. The kernel could probably accumulate metrics
   295  about packets and bytes sent and received by a group of processes, but
   296  those metrics wouldn't be very useful. You want per-interface metrics
   297  (because traffic happening on the local `lo`
   298  interface doesn't really count). But since processes in a single cgroup
   299  can belong to multiple network namespaces, those metrics would be harder
   300  to interpret: multiple network namespaces means multiple `lo`
   301  interfaces, potentially multiple `eth0`
   302  interfaces, etc.; so this is why there is no easy way to gather network
   303  metrics with control groups.
   304  
   305  Instead we can gather network metrics from other sources:
   306  
   307  ### IPtables
   308  
   309  IPtables (or rather, the netfilter framework for which iptables is just
   310  an interface) can do some serious accounting.
   311  
   312  For instance, you can setup a rule to account for the outbound HTTP
   313  traffic on a web server:
   314  
   315      $ iptables -I OUTPUT -p tcp --sport 80
   316  
   317  There is no `-j` or `-g` flag,
   318  so the rule will just count matched packets and go to the following
   319  rule.
   320  
   321  Later, you can check the values of the counters, with:
   322  
   323      $ iptables -nxvL OUTPUT
   324  
   325  Technically, `-n` is not required, but it will
   326  prevent iptables from doing DNS reverse lookups, which are probably
   327  useless in this scenario.
   328  
   329  Counters include packets and bytes. If you want to setup metrics for
   330  container traffic like this, you could execute a `for`
   331  loop to add two `iptables` rules per
   332  container IP address (one in each direction), in the `FORWARD`
   333  chain. This will only meter traffic going through the NAT
   334  layer; you will also have to add traffic going through the userland
   335  proxy.
   336  
   337  Then, you will need to check those counters on a regular basis. If you
   338  happen to use `collectd`, there is a [nice plugin](https://collectd.org/wiki/index.php/Plugin:IPTables)
   339  to automate iptables counters collection.
   340  
   341  ### Interface-level counters
   342  
   343  Since each container has a virtual Ethernet interface, you might want to
   344  check directly the TX and RX counters of this interface. You will notice
   345  that each container is associated to a virtual Ethernet interface in
   346  your host, with a name like `vethKk8Zqi`. Figuring
   347  out which interface corresponds to which container is, unfortunately,
   348  difficult.
   349  
   350  But for now, the best way is to check the metrics *from within the
   351  containers*. To accomplish this, you can run an executable from the host
   352  environment within the network namespace of a container using **ip-netns
   353  magic**.
   354  
   355  The `ip-netns exec` command will let you execute any
   356  program (present in the host system) within any network namespace
   357  visible to the current process. This means that your host will be able
   358  to enter the network namespace of your containers, but your containers
   359  won't be able to access the host, nor their sibling containers.
   360  Containers will be able to “see” and affect their sub-containers,
   361  though.
   362  
   363  The exact format of the command is:
   364  
   365      $ ip netns exec <nsname> <command...>
   366  
   367  For example:
   368  
   369      $ ip netns exec mycontainer netstat -i
   370  
   371  `ip netns` finds the "mycontainer" container by
   372  using namespaces pseudo-files. Each process belongs to one network
   373  namespace, one PID namespace, one `mnt` namespace,
   374  etc., and those namespaces are materialized under
   375  `/proc/<pid>/ns/`. For example, the network
   376  namespace of PID 42 is materialized by the pseudo-file
   377  `/proc/42/ns/net`.
   378  
   379  When you run `ip netns exec mycontainer ...`, it
   380  expects `/var/run/netns/mycontainer` to be one of
   381  those pseudo-files. (Symlinks are accepted.)
   382  
   383  In other words, to execute a command within the network namespace of a
   384  container, we need to:
   385  
   386  - Find out the PID of any process within the container that we want to investigate;
   387  - Create a symlink from `/var/run/netns/<somename>` to `/proc/<thepid>/ns/net`
   388  - Execute `ip netns exec <somename> ....`
   389  
   390  Please review [*Enumerating Cgroups*](#enumerating-cgroups) to learn how to find
   391  the cgroup of a process running in the container of which you want to
   392  measure network usage. From there, you can examine the pseudo-file named
   393  `tasks`, which contains the PIDs that are in the
   394  control group (i.e., in the container). Pick any one of them.
   395  
   396  Putting everything together, if the "short ID" of a container is held in
   397  the environment variable `$CID`, then you can do this:
   398  
   399      $ TASKS=/sys/fs/cgroup/devices/$CID*/tasks
   400      $ PID=$(head -n 1 $TASKS)
   401      $ mkdir -p /var/run/netns
   402      $ ln -sf /proc/$PID/ns/net /var/run/netns/$CID
   403      $ ip netns exec $CID netstat -i
   404  
   405  ## Tips for high-performance metric collection
   406  
   407  Note that running a new process each time you want to update metrics is
   408  (relatively) expensive. If you want to collect metrics at high
   409  resolutions, and/or over a large number of containers (think 1000
   410  containers on a single host), you do not want to fork a new process each
   411  time.
   412  
   413  Here is how to collect metrics from a single process. You will have to
   414  write your metric collector in C (or any language that lets you do
   415  low-level system calls). You need to use a special system call,
   416  `setns()`, which lets the current process enter any
   417  arbitrary namespace. It requires, however, an open file descriptor to
   418  the namespace pseudo-file (remember: that's the pseudo-file in
   419  `/proc/<pid>/ns/net`).
   420  
   421  However, there is a catch: you must not keep this file descriptor open.
   422  If you do, when the last process of the control group exits, the
   423  namespace will not be destroyed, and its network resources (like the
   424  virtual interface of the container) will stay around for ever (or until
   425  you close that file descriptor).
   426  
   427  The right approach would be to keep track of the first PID of each
   428  container, and re-open the namespace pseudo-file each time.
   429  
   430  ## Collecting metrics when a container exits
   431  
   432  Sometimes, you do not care about real time metric collection, but when a
   433  container exits, you want to know how much CPU, memory, etc. it has
   434  used.
   435  
   436  Docker makes this difficult because it relies on `lxc-start`, which
   437  carefully cleans up after itself, but it is still possible. It is
   438  usually easier to collect metrics at regular intervals (e.g., every
   439  minute, with the collectd LXC plugin) and rely on that instead.
   440  
   441  But, if you'd still like to gather the stats when a container stops,
   442  here is how:
   443  
   444  For each container, start a collection process, and move it to the
   445  control groups that you want to monitor by writing its PID to the tasks
   446  file of the cgroup. The collection process should periodically re-read
   447  the tasks file to check if it's the last process of the control group.
   448  (If you also want to collect network statistics as explained in the
   449  previous section, you should also move the process to the appropriate
   450  network namespace.)
   451  
   452  When the container exits, `lxc-start` will try to
   453  delete the control groups. It will fail, since the control group is
   454  still in use; but that's fine. You process should now detect that it is
   455  the only one remaining in the group. Now is the right time to collect
   456  all the metrics you need!
   457  
   458  Finally, your process should move itself back to the root control group,
   459  and remove the container control group. To remove a control group, just
   460  `rmdir` its directory. It's counter-intuitive to
   461  `rmdir` a directory as it still contains files; but
   462  remember that this is a pseudo-filesystem, so usual rules don't apply.
   463  After the cleanup is done, the collection process can exit safely.