github.com/guilhermebr/docker@v1.4.2-0.20150428121140-67da055cebca/docs/sources/articles/runmetrics.md (about)

     1  page_title: Runtime metrics
     2  page_description: Measure the behavior of running containers
     3  page_keywords: docker, metrics, CPU, memory, disk, IO, run, runtime
     4  
     5  # Runtime metrics
     6  
     7  Linux Containers rely on [control groups](
     8  https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt)
     9  which not only track groups of processes, but also expose metrics about
    10  CPU, memory, and block I/O usage. You can access those metrics and
    11  obtain network usage metrics as well. This is relevant for "pure" LXC
    12  containers, as well as for Docker containers.
    13  
    14  ## Control groups
    15  
    16  Control groups are exposed through a pseudo-filesystem. In recent
    17  distros, you should find this filesystem under `/sys/fs/cgroup`. Under
    18  that directory, you will see multiple sub-directories, called devices,
    19  freezer, blkio, etc.; each sub-directory actually corresponds to a different
    20  cgroup hierarchy.
    21  
    22  On older systems, the control groups might be mounted on `/cgroup`, without
    23  distinct hierarchies. In that case, instead of seeing the sub-directories,
    24  you will see a bunch of files in that directory, and possibly some directories
    25  corresponding to existing containers.
    26  
    27  To figure out where your control groups are mounted, you can run:
    28  
    29      $ grep cgroup /proc/mounts
    30  
    31  ## Enumerating cgroups
    32  
    33  You can look into `/proc/cgroups` to see the different control group subsystems
    34  known to the system, the hierarchy they belong to, and how many groups they contain.
    35  
    36  You can also look at `/proc/<pid>/cgroup` to see which control groups a process
    37  belongs to. The control group will be shown as a path relative to the root of
    38  the hierarchy mountpoint; e.g., `/` means “this process has not been assigned into
    39  a particular group”, while `/lxc/pumpkin` means that the process is likely to be
    40  a member of a container named `pumpkin`.
    41  
    42  ## Finding the cgroup for a given container
    43  
    44  For each container, one cgroup will be created in each hierarchy. On
    45  older systems with older versions of the LXC userland tools, the name of
    46  the cgroup will be the name of the container. With more recent versions
    47  of the LXC tools, the cgroup will be `lxc/<container_name>.`
    48  
    49  For Docker containers using cgroups, the container name will be the full
    50  ID or long ID of the container. If a container shows up as ae836c95b4c3
    51  in `docker ps`, its long ID might be something like
    52  `ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79`. You can
    53  look it up with `docker inspect` or `docker ps --no-trunc`.
    54  
    55  Putting everything together to look at the memory metrics for a Docker
    56  container, take a look at `/sys/fs/cgroup/memory/lxc/<longid>/`.
    57  
    58  ## Metrics from cgroups: memory, CPU, block I/O
    59  
    60  For each subsystem (memory, CPU, and block I/O), you will find one or
    61  more pseudo-files containing statistics.
    62  
    63  ### Memory metrics: `memory.stat`
    64  
    65  Memory metrics are found in the "memory" cgroup. Note that the memory
    66  control group adds a little overhead, because it does very fine-grained
    67  accounting of the memory usage on your host. Therefore, many distros
    68  chose to not enable it by default. Generally, to enable it, all you have
    69  to do is to add some kernel command-line parameters:
    70  `cgroup_enable=memory swapaccount=1`.
    71  
    72  The metrics are in the pseudo-file `memory.stat`.
    73  Here is what it will look like:
    74  
    75      cache 11492564992
    76      rss 1930993664
    77      mapped_file 306728960
    78      pgpgin 406632648
    79      pgpgout 403355412
    80      swap 0
    81      pgfault 728281223
    82      pgmajfault 1724
    83      inactive_anon 46608384
    84      active_anon 1884520448
    85      inactive_file 7003344896
    86      active_file 4489052160
    87      unevictable 32768
    88      hierarchical_memory_limit 9223372036854775807
    89      hierarchical_memsw_limit 9223372036854775807
    90      total_cache 11492564992
    91      total_rss 1930993664
    92      total_mapped_file 306728960
    93      total_pgpgin 406632648
    94      total_pgpgout 403355412
    95      total_swap 0
    96      total_pgfault 728281223
    97      total_pgmajfault 1724
    98      total_inactive_anon 46608384
    99      total_active_anon 1884520448
   100      total_inactive_file 7003344896
   101      total_active_file 4489052160
   102      total_unevictable 32768
   103  
   104  The first half (without the `total_` prefix) contains statistics relevant
   105  to the processes within the cgroup, excluding sub-cgroups. The second half
   106  (with the `total_` prefix) includes sub-cgroups as well.
   107  
   108  Some metrics are "gauges", i.e., values that can increase or decrease
   109  (e.g., swap, the amount of swap space used by the members of the cgroup).
   110  Some others are "counters", i.e., values that can only go up, because
   111  they represent occurrences of a specific event (e.g., pgfault, which
   112  indicates the number of page faults which happened since the creation of
   113  the cgroup; this number can never decrease).
   114  
   115  
   116   - **cache:**  
   117     the amount of memory used by the processes of this control group
   118     that can be associated precisely with a block on a block device.
   119     When you read from and write to files on disk, this amount will
   120     increase. This will be the case if you use "conventional" I/O
   121     (`open`, `read`,
   122     `write` syscalls) as well as mapped files (with
   123     `mmap`). It also accounts for the memory used by
   124     `tmpfs` mounts, though the reasons are unclear.
   125  
   126   - **rss:**  
   127     the amount of memory that *doesn't* correspond to anything on disk:
   128     stacks, heaps, and anonymous memory maps.
   129  
   130   - **mapped_file:**  
   131     indicates the amount of memory mapped by the processes in the
   132     control group. It doesn't give you information about *how much*
   133     memory is used; it rather tells you *how* it is used.
   134  
   135   - **pgfault and pgmajfault:**  
   136     indicate the number of times that a process of the cgroup triggered
   137     a "page fault" and a "major fault", respectively. A page fault
   138     happens when a process accesses a part of its virtual memory space
   139     which is nonexistent or protected. The former can happen if the
   140     process is buggy and tries to access an invalid address (it will
   141     then be sent a `SIGSEGV` signal, typically
   142     killing it with the famous `Segmentation fault`
   143     message). The latter can happen when the process reads from a memory
   144     zone which has been swapped out, or which corresponds to a mapped
   145     file: in that case, the kernel will load the page from disk, and let
   146     the CPU complete the memory access. It can also happen when the
   147     process writes to a copy-on-write memory zone: likewise, the kernel
   148     will preempt the process, duplicate the memory page, and resume the
   149     write operation on the process` own copy of the page. "Major" faults
   150     happen when the kernel actually has to read the data from disk. When
   151     it just has to duplicate an existing page, or allocate an empty
   152     page, it's a regular (or "minor") fault.
   153  
   154   - **swap:**  
   155     the amount of swap currently used by the processes in this cgroup.
   156  
   157   - **active_anon and inactive_anon:**  
   158     the amount of *anonymous* memory that has been identified has
   159     respectively *active* and *inactive* by the kernel. "Anonymous"
   160     memory is the memory that is *not* linked to disk pages. In other
   161     words, that's the equivalent of the rss counter described above. In
   162     fact, the very definition of the rss counter is **active_anon** +
   163     **inactive_anon** - **tmpfs** (where tmpfs is the amount of memory
   164     used up by `tmpfs` filesystems mounted by this
   165     control group). Now, what's the difference between "active" and
   166     "inactive"? Pages are initially "active"; and at regular intervals,
   167     the kernel sweeps over the memory, and tags some pages as
   168     "inactive". Whenever they are accessed again, they are immediately
   169     retagged "active". When the kernel is almost out of memory, and time
   170     comes to swap out to disk, the kernel will swap "inactive" pages.
   171  
   172   - **active_file and inactive_file:**  
   173     cache memory, with *active* and *inactive* similar to the *anon*
   174     memory above. The exact formula is cache = **active_file** +
   175     **inactive_file** + **tmpfs**. The exact rules used by the kernel
   176     to move memory pages between active and inactive sets are different
   177     from the ones used for anonymous memory, but the general principle
   178     is the same. Note that when the kernel needs to reclaim memory, it
   179     is cheaper to reclaim a clean (=non modified) page from this pool,
   180     since it can be reclaimed immediately (while anonymous pages and
   181     dirty/modified pages have to be written to disk first).
   182  
   183   - **unevictable:**  
   184     the amount of memory that cannot be reclaimed; generally, it will
   185     account for memory that has been "locked" with `mlock`.
   186     It is often used by crypto frameworks to make sure that
   187     secret keys and other sensitive material never gets swapped out to
   188     disk.
   189  
   190   - **memory and memsw limits:**  
   191     These are not really metrics, but a reminder of the limits applied
   192     to this cgroup. The first one indicates the maximum amount of
   193     physical memory that can be used by the processes of this control
   194     group; the second one indicates the maximum amount of RAM+swap.
   195  
   196  Accounting for memory in the page cache is very complex. If two
   197  processes in different control groups both read the same file
   198  (ultimately relying on the same blocks on disk), the corresponding
   199  memory charge will be split between the control groups. It's nice, but
   200  it also means that when a cgroup is terminated, it could increase the
   201  memory usage of another cgroup, because they are not splitting the cost
   202  anymore for those memory pages.
   203  
   204  ### CPU metrics: `cpuacct.stat`
   205  
   206  Now that we've covered memory metrics, everything else will look very
   207  simple in comparison. CPU metrics will be found in the
   208  `cpuacct` controller.
   209  
   210  For each container, you will find a pseudo-file `cpuacct.stat`,
   211  containing the CPU usage accumulated by the processes of the container,
   212  broken down between `user` and `system` time. If you're not familiar
   213  with the distinction, `user` is the time during which the processes were
   214  in direct control of the CPU (i.e., executing process code), and `system`
   215  is the time during which the CPU was executing system calls on behalf of
   216  those processes.
   217  
   218  Those times are expressed in ticks of 1/100th of a second. Actually,
   219  they are expressed in "user jiffies". There are `USER_HZ`
   220  *"jiffies"* per second, and on x86 systems,
   221  `USER_HZ` is 100. This used to map exactly to the
   222  number of scheduler "ticks" per second; but with the advent of higher
   223  frequency scheduling, as well as [tickless kernels](
   224  http://lwn.net/Articles/549580/), the number of kernel ticks
   225  wasn't relevant anymore. It stuck around anyway, mainly for legacy and
   226  compatibility reasons.
   227  
   228  ### Block I/O metrics
   229  
   230  Block I/O is accounted in the `blkio` controller.
   231  Different metrics are scattered across different files. While you can
   232  find in-depth details in the [blkio-controller](
   233  https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt)
   234  file in the kernel documentation, here is a short list of the most
   235  relevant ones:
   236  
   237  
   238   - **blkio.sectors:**  
   239     contain the number of 512-bytes sectors read and written by the
   240     processes member of the cgroup, device by device. Reads and writes
   241     are merged in a single counter.
   242  
   243   - **blkio.io_service_bytes:**  
   244     indicates the number of bytes read and written by the cgroup. It has
   245     4 counters per device, because for each device, it differentiates
   246     between synchronous vs. asynchronous I/O, and reads vs. writes.
   247  
   248   - **blkio.io_serviced:**  
   249     the number of I/O operations performed, regardless of their size. It
   250     also has 4 counters per device.
   251  
   252   - **blkio.io_queued:**  
   253     indicates the number of I/O operations currently queued for this
   254     cgroup. In other words, if the cgroup isn't doing any I/O, this will
   255     be zero. Note that the opposite is not true. In other words, if
   256     there is no I/O queued, it does not mean that the cgroup is idle
   257     (I/O-wise). It could be doing purely synchronous reads on an
   258     otherwise quiescent device, which is therefore able to handle them
   259     immediately, without queuing. Also, while it is helpful to figure
   260     out which cgroup is putting stress on the I/O subsystem, keep in
   261     mind that is is a relative quantity. Even if a process group does
   262     not perform more I/O, its queue size can increase just because the
   263     device load increases because of other devices.
   264  
   265  ## Network metrics
   266  
   267  Network metrics are not exposed directly by control groups. There is a
   268  good explanation for that: network interfaces exist within the context
   269  of *network namespaces*. The kernel could probably accumulate metrics
   270  about packets and bytes sent and received by a group of processes, but
   271  those metrics wouldn't be very useful. You want per-interface metrics
   272  (because traffic happening on the local `lo`
   273  interface doesn't really count). But since processes in a single cgroup
   274  can belong to multiple network namespaces, those metrics would be harder
   275  to interpret: multiple network namespaces means multiple `lo`
   276  interfaces, potentially multiple `eth0`
   277  interfaces, etc.; so this is why there is no easy way to gather network
   278  metrics with control groups.
   279  
   280  Instead we can gather network metrics from other sources:
   281  
   282  ### IPtables
   283  
   284  IPtables (or rather, the netfilter framework for which iptables is just
   285  an interface) can do some serious accounting.
   286  
   287  For instance, you can setup a rule to account for the outbound HTTP
   288  traffic on a web server:
   289  
   290      $ iptables -I OUTPUT -p tcp --sport 80
   291  
   292  There is no `-j` or `-g` flag,
   293  so the rule will just count matched packets and go to the following
   294  rule.
   295  
   296  Later, you can check the values of the counters, with:
   297  
   298      $ iptables -nxvL OUTPUT
   299  
   300  Technically, `-n` is not required, but it will
   301  prevent iptables from doing DNS reverse lookups, which are probably
   302  useless in this scenario.
   303  
   304  Counters include packets and bytes. If you want to setup metrics for
   305  container traffic like this, you could execute a `for`
   306  loop to add two `iptables` rules per
   307  container IP address (one in each direction), in the `FORWARD`
   308  chain. This will only meter traffic going through the NAT
   309  layer; you will also have to add traffic going through the userland
   310  proxy.
   311  
   312  Then, you will need to check those counters on a regular basis. If you
   313  happen to use `collectd`, there is a [nice plugin](https://collectd.org/wiki/index.php/Plugin:IPTables)
   314  to automate iptables counters collection.
   315  
   316  ### Interface-level counters
   317  
   318  Since each container has a virtual Ethernet interface, you might want to
   319  check directly the TX and RX counters of this interface. You will notice
   320  that each container is associated to a virtual Ethernet interface in
   321  your host, with a name like `vethKk8Zqi`. Figuring
   322  out which interface corresponds to which container is, unfortunately,
   323  difficult.
   324  
   325  But for now, the best way is to check the metrics *from within the
   326  containers*. To accomplish this, you can run an executable from the host
   327  environment within the network namespace of a container using **ip-netns
   328  magic**.
   329  
   330  The `ip-netns exec` command will let you execute any
   331  program (present in the host system) within any network namespace
   332  visible to the current process. This means that your host will be able
   333  to enter the network namespace of your containers, but your containers
   334  won't be able to access the host, nor their sibling containers.
   335  Containers will be able to “see” and affect their sub-containers,
   336  though.
   337  
   338  The exact format of the command is:
   339  
   340      $ ip netns exec <nsname> <command...>
   341  
   342  For example:
   343  
   344      $ ip netns exec mycontainer netstat -i
   345  
   346  `ip netns` finds the "mycontainer" container by
   347  using namespaces pseudo-files. Each process belongs to one network
   348  namespace, one PID namespace, one `mnt` namespace,
   349  etc., and those namespaces are materialized under
   350  `/proc/<pid>/ns/`. For example, the network
   351  namespace of PID 42 is materialized by the pseudo-file
   352  `/proc/42/ns/net`.
   353  
   354  When you run `ip netns exec mycontainer ...`, it
   355  expects `/var/run/netns/mycontainer` to be one of
   356  those pseudo-files. (Symlinks are accepted.)
   357  
   358  In other words, to execute a command within the network namespace of a
   359  container, we need to:
   360  
   361  - Find out the PID of any process within the container that we want to investigate;
   362  - Create a symlink from `/var/run/netns/<somename>` to `/proc/<thepid>/ns/net`
   363  - Execute `ip netns exec <somename> ....`
   364  
   365  Please review [*Enumerating Cgroups*](#enumerating-cgroups) to learn how to find
   366  the cgroup of a process running in the container of which you want to
   367  measure network usage. From there, you can examine the pseudo-file named
   368  `tasks`, which contains the PIDs that are in the
   369  control group (i.e., in the container). Pick any one of them.
   370  
   371  Putting everything together, if the "short ID" of a container is held in
   372  the environment variable `$CID`, then you can do this:
   373  
   374      $ TASKS=/sys/fs/cgroup/devices/$CID*/tasks
   375      $ PID=$(head -n 1 $TASKS)
   376      $ mkdir -p /var/run/netns
   377      $ ln -sf /proc/$PID/ns/net /var/run/netns/$CID
   378      $ ip netns exec $CID netstat -i
   379  
   380  ## Tips for high-performance metric collection
   381  
   382  Note that running a new process each time you want to update metrics is
   383  (relatively) expensive. If you want to collect metrics at high
   384  resolutions, and/or over a large number of containers (think 1000
   385  containers on a single host), you do not want to fork a new process each
   386  time.
   387  
   388  Here is how to collect metrics from a single process. You will have to
   389  write your metric collector in C (or any language that lets you do
   390  low-level system calls). You need to use a special system call,
   391  `setns()`, which lets the current process enter any
   392  arbitrary namespace. It requires, however, an open file descriptor to
   393  the namespace pseudo-file (remember: that's the pseudo-file in
   394  `/proc/<pid>/ns/net`).
   395  
   396  However, there is a catch: you must not keep this file descriptor open.
   397  If you do, when the last process of the control group exits, the
   398  namespace will not be destroyed, and its network resources (like the
   399  virtual interface of the container) will stay around for ever (or until
   400  you close that file descriptor).
   401  
   402  The right approach would be to keep track of the first PID of each
   403  container, and re-open the namespace pseudo-file each time.
   404  
   405  ## Collecting metrics when a container exits
   406  
   407  Sometimes, you do not care about real time metric collection, but when a
   408  container exits, you want to know how much CPU, memory, etc. it has
   409  used.
   410  
   411  Docker makes this difficult because it relies on `lxc-start`, which
   412  carefully cleans up after itself, but it is still possible. It is
   413  usually easier to collect metrics at regular intervals (e.g., every
   414  minute, with the collectd LXC plugin) and rely on that instead.
   415  
   416  But, if you'd still like to gather the stats when a container stops,
   417  here is how:
   418  
   419  For each container, start a collection process, and move it to the
   420  control groups that you want to monitor by writing its PID to the tasks
   421  file of the cgroup. The collection process should periodically re-read
   422  the tasks file to check if it's the last process of the control group.
   423  (If you also want to collect network statistics as explained in the
   424  previous section, you should also move the process to the appropriate
   425  network namespace.)
   426  
   427  When the container exits, `lxc-start` will try to
   428  delete the control groups. It will fail, since the control group is
   429  still in use; but that's fine. You process should now detect that it is
   430  the only one remaining in the group. Now is the right time to collect
   431  all the metrics you need!
   432  
   433  Finally, your process should move itself back to the root control group,
   434  and remove the container control group. To remove a control group, just
   435  `rmdir` its directory. It's counter-intuitive to
   436  `rmdir` a directory as it still contains files; but
   437  remember that this is a pseudo-filesystem, so usual rules don't apply.
   438  After the cleanup is done, the collection process can exit safely.