github.com/SophiaGitHub/hello@v1.7.1-rc3/docs/articles/runmetrics.md (about)

     1  <!--[metadata]>
     2  +++
     3  title = "Runtime metrics"
     4  description = "Measure the behavior of running containers"
     5  keywords = ["docker, metrics, CPU, memory, disk, IO, run,  runtime"]
     6  [menu.main]
     7  parent = "smn_administrate"
     8  weight = 4
     9  +++
    10  <![end-metadata]-->
    11  
    12  # Runtime metrics
    13  
    14  Linux Containers rely on [control groups](
    15  https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt)
    16  which not only track groups of processes, but also expose metrics about
    17  CPU, memory, and block I/O usage. You can access those metrics and
    18  obtain network usage metrics as well. This is relevant for "pure" LXC
    19  containers, as well as for Docker containers.
    20  
    21  ## Control groups
    22  
    23  Control groups are exposed through a pseudo-filesystem. In recent
    24  distros, you should find this filesystem under `/sys/fs/cgroup`. Under
    25  that directory, you will see multiple sub-directories, called devices,
    26  freezer, blkio, etc.; each sub-directory actually corresponds to a different
    27  cgroup hierarchy.
    28  
    29  On older systems, the control groups might be mounted on `/cgroup`, without
    30  distinct hierarchies. In that case, instead of seeing the sub-directories,
    31  you will see a bunch of files in that directory, and possibly some directories
    32  corresponding to existing containers.
    33  
    34  To figure out where your control groups are mounted, you can run:
    35  
    36      $ grep cgroup /proc/mounts
    37  
    38  ## Enumerating cgroups
    39  
    40  You can look into `/proc/cgroups` to see the different control group subsystems
    41  known to the system, the hierarchy they belong to, and how many groups they contain.
    42  
    43  You can also look at `/proc/<pid>/cgroup` to see which control groups a process
    44  belongs to. The control group will be shown as a path relative to the root of
    45  the hierarchy mountpoint; e.g., `/` means “this process has not been assigned into
    46  a particular group”, while `/lxc/pumpkin` means that the process is likely to be
    47  a member of a container named `pumpkin`.
    48  
    49  ## Finding the cgroup for a given container
    50  
    51  For each container, one cgroup will be created in each hierarchy. On
    52  older systems with older versions of the LXC userland tools, the name of
    53  the cgroup will be the name of the container. With more recent versions
    54  of the LXC tools, the cgroup will be `lxc/<container_name>.`
    55  
    56  For Docker containers using cgroups, the container name will be the full
    57  ID or long ID of the container. If a container shows up as ae836c95b4c3
    58  in `docker ps`, its long ID might be something like
    59  `ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79`. You can
    60  look it up with `docker inspect` or `docker ps --no-trunc`.
    61  
    62  Putting everything together to look at the memory metrics for a Docker
    63  container, take a look at `/sys/fs/cgroup/memory/lxc/<longid>/`.
    64  
    65  ## Metrics from cgroups: memory, CPU, block I/O
    66  
    67  For each subsystem (memory, CPU, and block I/O), you will find one or
    68  more pseudo-files containing statistics.
    69  
    70  ### Memory metrics: `memory.stat`
    71  
    72  Memory metrics are found in the "memory" cgroup. Note that the memory
    73  control group adds a little overhead, because it does very fine-grained
    74  accounting of the memory usage on your host. Therefore, many distros
    75  chose to not enable it by default. Generally, to enable it, all you have
    76  to do is to add some kernel command-line parameters:
    77  `cgroup_enable=memory swapaccount=1`.
    78  
    79  The metrics are in the pseudo-file `memory.stat`.
    80  Here is what it will look like:
    81  
    82      cache 11492564992
    83      rss 1930993664
    84      mapped_file 306728960
    85      pgpgin 406632648
    86      pgpgout 403355412
    87      swap 0
    88      pgfault 728281223
    89      pgmajfault 1724
    90      inactive_anon 46608384
    91      active_anon 1884520448
    92      inactive_file 7003344896
    93      active_file 4489052160
    94      unevictable 32768
    95      hierarchical_memory_limit 9223372036854775807
    96      hierarchical_memsw_limit 9223372036854775807
    97      total_cache 11492564992
    98      total_rss 1930993664
    99      total_mapped_file 306728960
   100      total_pgpgin 406632648
   101      total_pgpgout 403355412
   102      total_swap 0
   103      total_pgfault 728281223
   104      total_pgmajfault 1724
   105      total_inactive_anon 46608384
   106      total_active_anon 1884520448
   107      total_inactive_file 7003344896
   108      total_active_file 4489052160
   109      total_unevictable 32768
   110  
   111  The first half (without the `total_` prefix) contains statistics relevant
   112  to the processes within the cgroup, excluding sub-cgroups. The second half
   113  (with the `total_` prefix) includes sub-cgroups as well.
   114  
   115  Some metrics are "gauges", i.e., values that can increase or decrease
   116  (e.g., swap, the amount of swap space used by the members of the cgroup).
   117  Some others are "counters", i.e., values that can only go up, because
   118  they represent occurrences of a specific event (e.g., pgfault, which
   119  indicates the number of page faults which happened since the creation of
   120  the cgroup; this number can never decrease).
   121  
   122  
   123   - **cache:**  
   124     the amount of memory used by the processes of this control group
   125     that can be associated precisely with a block on a block device.
   126     When you read from and write to files on disk, this amount will
   127     increase. This will be the case if you use "conventional" I/O
   128     (`open`, `read`,
   129     `write` syscalls) as well as mapped files (with
   130     `mmap`). It also accounts for the memory used by
   131     `tmpfs` mounts, though the reasons are unclear.
   132  
   133   - **rss:**  
   134     the amount of memory that *doesn't* correspond to anything on disk:
   135     stacks, heaps, and anonymous memory maps.
   136  
   137   - **mapped_file:**  
   138     indicates the amount of memory mapped by the processes in the
   139     control group. It doesn't give you information about *how much*
   140     memory is used; it rather tells you *how* it is used.
   141  
   142   - **pgfault and pgmajfault:**  
   143     indicate the number of times that a process of the cgroup triggered
   144     a "page fault" and a "major fault", respectively. A page fault
   145     happens when a process accesses a part of its virtual memory space
   146     which is nonexistent or protected. The former can happen if the
   147     process is buggy and tries to access an invalid address (it will
   148     then be sent a `SIGSEGV` signal, typically
   149     killing it with the famous `Segmentation fault`
   150     message). The latter can happen when the process reads from a memory
   151     zone which has been swapped out, or which corresponds to a mapped
   152     file: in that case, the kernel will load the page from disk, and let
   153     the CPU complete the memory access. It can also happen when the
   154     process writes to a copy-on-write memory zone: likewise, the kernel
   155     will preempt the process, duplicate the memory page, and resume the
   156     write operation on the process` own copy of the page. "Major" faults
   157     happen when the kernel actually has to read the data from disk. When
   158     it just has to duplicate an existing page, or allocate an empty
   159     page, it's a regular (or "minor") fault.
   160  
   161   - **swap:**  
   162     the amount of swap currently used by the processes in this cgroup.
   163  
   164   - **active_anon and inactive_anon:**  
   165     the amount of *anonymous* memory that has been identified has
   166     respectively *active* and *inactive* by the kernel. "Anonymous"
   167     memory is the memory that is *not* linked to disk pages. In other
   168     words, that's the equivalent of the rss counter described above. In
   169     fact, the very definition of the rss counter is **active_anon** +
   170     **inactive_anon** - **tmpfs** (where tmpfs is the amount of memory
   171     used up by `tmpfs` filesystems mounted by this
   172     control group). Now, what's the difference between "active" and
   173     "inactive"? Pages are initially "active"; and at regular intervals,
   174     the kernel sweeps over the memory, and tags some pages as
   175     "inactive". Whenever they are accessed again, they are immediately
   176     retagged "active". When the kernel is almost out of memory, and time
   177     comes to swap out to disk, the kernel will swap "inactive" pages.
   178  
   179   - **active_file and inactive_file:**  
   180     cache memory, with *active* and *inactive* similar to the *anon*
   181     memory above. The exact formula is cache = **active_file** +
   182     **inactive_file** + **tmpfs**. The exact rules used by the kernel
   183     to move memory pages between active and inactive sets are different
   184     from the ones used for anonymous memory, but the general principle
   185     is the same. Note that when the kernel needs to reclaim memory, it
   186     is cheaper to reclaim a clean (=non modified) page from this pool,
   187     since it can be reclaimed immediately (while anonymous pages and
   188     dirty/modified pages have to be written to disk first).
   189  
   190   - **unevictable:**  
   191     the amount of memory that cannot be reclaimed; generally, it will
   192     account for memory that has been "locked" with `mlock`.
   193     It is often used by crypto frameworks to make sure that
   194     secret keys and other sensitive material never gets swapped out to
   195     disk.
   196  
   197   - **memory and memsw limits:**  
   198     These are not really metrics, but a reminder of the limits applied
   199     to this cgroup. The first one indicates the maximum amount of
   200     physical memory that can be used by the processes of this control
   201     group; the second one indicates the maximum amount of RAM+swap.
   202  
   203  Accounting for memory in the page cache is very complex. If two
   204  processes in different control groups both read the same file
   205  (ultimately relying on the same blocks on disk), the corresponding
   206  memory charge will be split between the control groups. It's nice, but
   207  it also means that when a cgroup is terminated, it could increase the
   208  memory usage of another cgroup, because they are not splitting the cost
   209  anymore for those memory pages.
   210  
   211  ### CPU metrics: `cpuacct.stat`
   212  
   213  Now that we've covered memory metrics, everything else will look very
   214  simple in comparison. CPU metrics will be found in the
   215  `cpuacct` controller.
   216  
   217  For each container, you will find a pseudo-file `cpuacct.stat`,
   218  containing the CPU usage accumulated by the processes of the container,
   219  broken down between `user` and `system` time. If you're not familiar
   220  with the distinction, `user` is the time during which the processes were
   221  in direct control of the CPU (i.e., executing process code), and `system`
   222  is the time during which the CPU was executing system calls on behalf of
   223  those processes.
   224  
   225  Those times are expressed in ticks of 1/100th of a second. Actually,
   226  they are expressed in "user jiffies". There are `USER_HZ`
   227  *"jiffies"* per second, and on x86 systems,
   228  `USER_HZ` is 100. This used to map exactly to the
   229  number of scheduler "ticks" per second; but with the advent of higher
   230  frequency scheduling, as well as [tickless kernels](
   231  http://lwn.net/Articles/549580/), the number of kernel ticks
   232  wasn't relevant anymore. It stuck around anyway, mainly for legacy and
   233  compatibility reasons.
   234  
   235  ### Block I/O metrics
   236  
   237  Block I/O is accounted in the `blkio` controller.
   238  Different metrics are scattered across different files. While you can
   239  find in-depth details in the [blkio-controller](
   240  https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt)
   241  file in the kernel documentation, here is a short list of the most
   242  relevant ones:
   243  
   244  
   245   - **blkio.sectors:**  
   246     contain the number of 512-bytes sectors read and written by the
   247     processes member of the cgroup, device by device. Reads and writes
   248     are merged in a single counter.
   249  
   250   - **blkio.io_service_bytes:**  
   251     indicates the number of bytes read and written by the cgroup. It has
   252     4 counters per device, because for each device, it differentiates
   253     between synchronous vs. asynchronous I/O, and reads vs. writes.
   254  
   255   - **blkio.io_serviced:**  
   256     the number of I/O operations performed, regardless of their size. It
   257     also has 4 counters per device.
   258  
   259   - **blkio.io_queued:**  
   260     indicates the number of I/O operations currently queued for this
   261     cgroup. In other words, if the cgroup isn't doing any I/O, this will
   262     be zero. Note that the opposite is not true. In other words, if
   263     there is no I/O queued, it does not mean that the cgroup is idle
   264     (I/O-wise). It could be doing purely synchronous reads on an
   265     otherwise quiescent device, which is therefore able to handle them
   266     immediately, without queuing. Also, while it is helpful to figure
   267     out which cgroup is putting stress on the I/O subsystem, keep in
   268     mind that is is a relative quantity. Even if a process group does
   269     not perform more I/O, its queue size can increase just because the
   270     device load increases because of other devices.
   271  
   272  ## Network metrics
   273  
   274  Network metrics are not exposed directly by control groups. There is a
   275  good explanation for that: network interfaces exist within the context
   276  of *network namespaces*. The kernel could probably accumulate metrics
   277  about packets and bytes sent and received by a group of processes, but
   278  those metrics wouldn't be very useful. You want per-interface metrics
   279  (because traffic happening on the local `lo`
   280  interface doesn't really count). But since processes in a single cgroup
   281  can belong to multiple network namespaces, those metrics would be harder
   282  to interpret: multiple network namespaces means multiple `lo`
   283  interfaces, potentially multiple `eth0`
   284  interfaces, etc.; so this is why there is no easy way to gather network
   285  metrics with control groups.
   286  
   287  Instead we can gather network metrics from other sources:
   288  
   289  ### IPtables
   290  
   291  IPtables (or rather, the netfilter framework for which iptables is just
   292  an interface) can do some serious accounting.
   293  
   294  For instance, you can setup a rule to account for the outbound HTTP
   295  traffic on a web server:
   296  
   297      $ iptables -I OUTPUT -p tcp --sport 80
   298  
   299  There is no `-j` or `-g` flag,
   300  so the rule will just count matched packets and go to the following
   301  rule.
   302  
   303  Later, you can check the values of the counters, with:
   304  
   305      $ iptables -nxvL OUTPUT
   306  
   307  Technically, `-n` is not required, but it will
   308  prevent iptables from doing DNS reverse lookups, which are probably
   309  useless in this scenario.
   310  
   311  Counters include packets and bytes. If you want to setup metrics for
   312  container traffic like this, you could execute a `for`
   313  loop to add two `iptables` rules per
   314  container IP address (one in each direction), in the `FORWARD`
   315  chain. This will only meter traffic going through the NAT
   316  layer; you will also have to add traffic going through the userland
   317  proxy.
   318  
   319  Then, you will need to check those counters on a regular basis. If you
   320  happen to use `collectd`, there is a [nice plugin](https://collectd.org/wiki/index.php/Plugin:IPTables)
   321  to automate iptables counters collection.
   322  
   323  ### Interface-level counters
   324  
   325  Since each container has a virtual Ethernet interface, you might want to
   326  check directly the TX and RX counters of this interface. You will notice
   327  that each container is associated to a virtual Ethernet interface in
   328  your host, with a name like `vethKk8Zqi`. Figuring
   329  out which interface corresponds to which container is, unfortunately,
   330  difficult.
   331  
   332  But for now, the best way is to check the metrics *from within the
   333  containers*. To accomplish this, you can run an executable from the host
   334  environment within the network namespace of a container using **ip-netns
   335  magic**.
   336  
   337  The `ip-netns exec` command will let you execute any
   338  program (present in the host system) within any network namespace
   339  visible to the current process. This means that your host will be able
   340  to enter the network namespace of your containers, but your containers
   341  won't be able to access the host, nor their sibling containers.
   342  Containers will be able to “see” and affect their sub-containers,
   343  though.
   344  
   345  The exact format of the command is:
   346  
   347      $ ip netns exec <nsname> <command...>
   348  
   349  For example:
   350  
   351      $ ip netns exec mycontainer netstat -i
   352  
   353  `ip netns` finds the "mycontainer" container by
   354  using namespaces pseudo-files. Each process belongs to one network
   355  namespace, one PID namespace, one `mnt` namespace,
   356  etc., and those namespaces are materialized under
   357  `/proc/<pid>/ns/`. For example, the network
   358  namespace of PID 42 is materialized by the pseudo-file
   359  `/proc/42/ns/net`.
   360  
   361  When you run `ip netns exec mycontainer ...`, it
   362  expects `/var/run/netns/mycontainer` to be one of
   363  those pseudo-files. (Symlinks are accepted.)
   364  
   365  In other words, to execute a command within the network namespace of a
   366  container, we need to:
   367  
   368  - Find out the PID of any process within the container that we want to investigate;
   369  - Create a symlink from `/var/run/netns/<somename>` to `/proc/<thepid>/ns/net`
   370  - Execute `ip netns exec <somename> ....`
   371  
   372  Please review [*Enumerating Cgroups*](#enumerating-cgroups) to learn how to find
   373  the cgroup of a process running in the container of which you want to
   374  measure network usage. From there, you can examine the pseudo-file named
   375  `tasks`, which contains the PIDs that are in the
   376  control group (i.e., in the container). Pick any one of them.
   377  
   378  Putting everything together, if the "short ID" of a container is held in
   379  the environment variable `$CID`, then you can do this:
   380  
   381      $ TASKS=/sys/fs/cgroup/devices/$CID*/tasks
   382      $ PID=$(head -n 1 $TASKS)
   383      $ mkdir -p /var/run/netns
   384      $ ln -sf /proc/$PID/ns/net /var/run/netns/$CID
   385      $ ip netns exec $CID netstat -i
   386  
   387  ## Tips for high-performance metric collection
   388  
   389  Note that running a new process each time you want to update metrics is
   390  (relatively) expensive. If you want to collect metrics at high
   391  resolutions, and/or over a large number of containers (think 1000
   392  containers on a single host), you do not want to fork a new process each
   393  time.
   394  
   395  Here is how to collect metrics from a single process. You will have to
   396  write your metric collector in C (or any language that lets you do
   397  low-level system calls). You need to use a special system call,
   398  `setns()`, which lets the current process enter any
   399  arbitrary namespace. It requires, however, an open file descriptor to
   400  the namespace pseudo-file (remember: that's the pseudo-file in
   401  `/proc/<pid>/ns/net`).
   402  
   403  However, there is a catch: you must not keep this file descriptor open.
   404  If you do, when the last process of the control group exits, the
   405  namespace will not be destroyed, and its network resources (like the
   406  virtual interface of the container) will stay around for ever (or until
   407  you close that file descriptor).
   408  
   409  The right approach would be to keep track of the first PID of each
   410  container, and re-open the namespace pseudo-file each time.
   411  
   412  ## Collecting metrics when a container exits
   413  
   414  Sometimes, you do not care about real time metric collection, but when a
   415  container exits, you want to know how much CPU, memory, etc. it has
   416  used.
   417  
   418  Docker makes this difficult because it relies on `lxc-start`, which
   419  carefully cleans up after itself, but it is still possible. It is
   420  usually easier to collect metrics at regular intervals (e.g., every
   421  minute, with the collectd LXC plugin) and rely on that instead.
   422  
   423  But, if you'd still like to gather the stats when a container stops,
   424  here is how:
   425  
   426  For each container, start a collection process, and move it to the
   427  control groups that you want to monitor by writing its PID to the tasks
   428  file of the cgroup. The collection process should periodically re-read
   429  the tasks file to check if it's the last process of the control group.
   430  (If you also want to collect network statistics as explained in the
   431  previous section, you should also move the process to the appropriate
   432  network namespace.)
   433  
   434  When the container exits, `lxc-start` will try to
   435  delete the control groups. It will fail, since the control group is
   436  still in use; but that's fine. You process should now detect that it is
   437  the only one remaining in the group. Now is the right time to collect
   438  all the metrics you need!
   439  
   440  Finally, your process should move itself back to the root control group,
   441  and remove the container control group. To remove a control group, just
   442  `rmdir` its directory. It's counter-intuitive to
   443  `rmdir` a directory as it still contains files; but
   444  remember that this is a pseudo-filesystem, so usual rules don't apply.
   445  After the cleanup is done, the collection process can exit safely.