github.com/cilium/cilium@v1.16.2/Documentation/contributing/development/debugging.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      https://docs.cilium.io
     6  
     7  .. _gs_debugging:
     8  
     9  #########
    10  Debugging
    11  #########
    12  
    13  Attaching a Debugger
    14  --------------------
    15  
    16  Cilium comes with a set of Makefile targets for quickly deploying development
    17  builds to a local :ref:`Kind <gs_kind>` cluster. One of these targets is
    18  ``kind-debug-agent``, which generates a container image that wraps the Cilium
    19  agent with a `Delve (dlv) <https://github.com/go-delve/delve>`_ invocation. This
    20  causes the agent process to listen for connections from a debugger front-end on
    21  port 2345.
    22  
    23  To build and push a debug image to your local Kind cluster, run:
    24  
    25  .. code-block:: shell-session
    26  
    27      $ make kind-debug-agent
    28  
    29  .. note::
    30        The image is automatically pushed to the Kind nodes, but running Cilium
    31        Pods are not restarted. To do so, run:
    32  
    33        .. code-block:: shell-session
    34          
    35          $ kubectl delete pods -n kube-system -l app.kubernetes.io/name=cilium-agent
    36  
    37  If your Kind cluster was set up using ``make kind``, it will automatically
    38  be configured using with the following port mappings:
    39  
    40  - ``23401``: ``kind-control-plane-1``
    41  - ``2340*``: Subsequent ``kind-control-plane-*`` nodes, if defined
    42  - ``23411``: ``kind-worker-1``
    43  - ``2341*``: Subsequent ``kind-worker-*`` nodes, if defined
    44  
    45  The Delve listener supports multiple debugging protocols, so any IDEs or
    46  debugger front-ends that understand either the `Debug Adapter Protocol
    47  <https://microsoft.github.io/debug-adapter-protocol>`_ or Delve API v2 are
    48  supported.
    49  
    50  ~~~~~~~~~~~~~~~~~~
    51  Visual Studio Code
    52  ~~~~~~~~~~~~~~~~~~
    53  
    54  The Cilium repository contains a VS Code launch configuration
    55  (``.vscode/launch.json``) that includes debug targets for the Kind control
    56  plane, the first two ``kind-worker`` nodes and the :ref:`Cilium Operator
    57  <cilium_operator_internals>`.
    58  
    59  .. image:: _static/vscode-run-and-debug.png
    60      :align: center
    61  
    62  |
    63  
    64  The preceding screenshot is taken from the 'Run And Debug' section in VS Code.
    65  The default shortcut to access this section is ``Shift+Ctrl+D``. Select a target
    66  to attach to, start the debug session and set a breakpoint to halt the agent or
    67  operator on a specific code statement. This only works for Go code, BPF C code
    68  cannot be debugged this way.
    69  
    70  See `the VS Code debugging guide <https://code.visualstudio.com/docs/editor/debugging>`_
    71  for more details.
    72  
    73  ~~~~~~
    74  Neovim
    75  ~~~~~~
    76  
    77  The Cilium repository contains a `.nvim directory
    78  <https://github.com/cilium/cilium/tree/main/.nvim>`_ containing a DAP
    79  configuration as well as a README on how to configure ``nvim-dap``.
    80  
    81  toFQDNs and DNS Debugging
    82  -------------------------
    83  
    84  The interactions of L3 toFQDNs and L7 DNS rules can be difficult to debug
    85  around. Unlike many other policy rules, these are resolved at runtime with
    86  unknown data. Pods may create large numbers of IPs in the cache or the IPs
    87  returned may not be compatible with our datapath implementation. Sometimes
    88  we also just have bugs.
    89  
    90  
    91  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    92  Isolating the source of toFQDNs issues
    93  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    94  
    95  While there is no common culprit when debugging, the DNS Proxy shares the least
    96  code with other system and so is more likely the least audited in this chain.
    97  The cascading caching scheme is also complex in its behaviour. Determining
    98  whether an issue is caused by the DNS components, in the policy layer or in the
    99  datapath is often the first step when debugging toFQDNs related issues.
   100  Generally, working top-down is easiest as the information needed to verify
   101  low-level correctness can be collected in the initial debug invocations.
   102  
   103  
   104  REFUSED vs NXDOMAIN responses
   105  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   106  
   107  The proxy uses REFUSED DNS responses to indicate a denied request. Some libc
   108  implementations, notably musl which is common in Alpine Linux images, terminate
   109  the whole DNS search in these cases. This often manifests as a connect error in
   110  applications, as the libc lookup returns no data.
   111  To work around this, denied responses can be configured to be NXDOMAIN by
   112  setting ``--tofqdns-dns-reject-response-code=nameError`` on the command line.
   113  
   114  
   115  Monitor Events
   116  ~~~~~~~~~~~~~~
   117  
   118  The DNS Proxy emits multiple L7 DNS monitor events. One for the request and one
   119  for the response (if allowed). Often the L7 DNS rules are paired with L3
   120  toFQDNs rules and events relating to those rules are also relevant.
   121  
   122  .. Note::
   123  
   124      Be sure to run cilium-dbg monitor on the same node as the pod being debugged!
   125  
   126  .. code-block:: shell-session
   127  
   128      $ kubectl exec pod/cilium-sbp8v -n kube-system -- cilium-dbg monitor --related-to 3459
   129      Listening for events on 4 CPUs with 64x4096 of shared memory
   130      Press Ctrl-C to quit
   131      level=info msg="Initializing dissection cache..." subsys=monitor
   132  
   133      -> Request dns from 3459 ([k8s:org=alliance k8s:io.kubernetes.pod.namespace=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.cilium.k8s.policy.cluster=default k8s:class=xwing]) to 0 ([k8s:io.cilium.k8s.policy.serviceaccount=kube-dns k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns k8s:io.cilium.k8s.policy.cluster=default]), identity 323->15194, verdict Forwarded DNS Query: cilium.io. A
   134      -> endpoint 3459 flow 0xe6866e21 identity 15194->323 state reply ifindex lxc84b58cbdabfe orig-ip 10.60.1.115: 10.63.240.10:53 -> 10.60.0.182:42132 udp
   135      -> Response dns to 3459 ([k8s:org=alliance k8s:io.kubernetes.pod.namespace=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.cilium.k8s.policy.cluster=default k8s:class=xwing]) from 0 ([k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=kube-dns k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns]), identity 323->15194, verdict Forwarded DNS Query: cilium.io. A TTL: 486 Answer: '104.198.14.52'
   136      -> endpoint 3459 flow 0xe6866e21 identity 15194->323 state reply ifindex lxc84b58cbdabfe orig-ip 10.60.1.115: 10.63.240.10:53 -> 10.60.0.182:42132 udp
   137      Policy verdict log: flow 0x614e9723 local EP ID 3459, remote ID 16777217, proto 6, egress, action allow, match L3-Only, 10.60.0.182:41510 -> 104.198.14.52:80 tcp SYN
   138  
   139      -> stack flow 0x614e9723 identity 323->16777217 state new ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp SYN
   140      -> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp SYN
   141      -> endpoint 3459 flow 0x7388921 identity 16777217->323 state reply ifindex lxc84b58cbdabfe orig-ip 104.198.14.52: 104.198.14.52:80 -> 10.60.0.182:41510 tcp SYN, ACK
   142      -> stack flow 0x614e9723 identity 323->16777217 state established ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK
   143      -> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK
   144      -> stack flow 0x614e9723 identity 323->16777217 state established ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK
   145      -> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK
   146      -> endpoint 3459 flow 0x7388921 identity 16777217->323 state reply ifindex lxc84b58cbdabfe orig-ip 104.198.14.52: 104.198.14.52:80 -> 10.60.0.182:41510 tcp ACK
   147      -> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK
   148      -> stack flow 0x614e9723 identity 323->16777217 state established ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK, FIN
   149      -> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK, FIN
   150      -> endpoint 3459 flow 0x7388921 identity 16777217->323 state reply ifindex lxc84b58cbdabfe orig-ip 104.198.14.52: 104.198.14.52:80 -> 10.60.0.182:41510 tcp ACK, FIN
   151      -> stack flow 0x614e9723 identity 323->16777217 state established ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK
   152  
   153  The above is for a simple ``curl cilium.io`` in a pod. The L7 DNS request is
   154  the first set of message and the subsequent L3 connection is the HTTP
   155  component. AAAA DNS lookups commonly happen but were removed to simplify the
   156  example.
   157  
   158  - If no L7 DNS requests appear, the proxy redirect is not in place. This may
   159    mean that the policy does not select this endpoint or there is an issue with
   160    the proxy redirection. Whether any redirects exist can be checked with
   161    ``cilium-dbg status --all-redirects``.
   162    In the past, a bug occurred with more permissive L3 rules overriding the
   163    proxy redirect, causing the proxy to never see the requests.
   164  - If the L7 DNS request is blocked, with an explicit denied message, then the
   165    requests are not allowed by the proxy. This may be due to a typo in the
   166    network policy, or the matchPattern rule not allowing this domain. It may
   167    also be due to a bug in policy propagation to the DNS Proxy.
   168  - If the DNS request is allowed, with an explicit message, and it should not
   169    be, this may be because a more general policy is in place that allows the
   170    request. ``matchPattern: "*"`` visibility policies are commonly in place and
   171    would supersede all other, more restrictive, policies.
   172    If no other policies are in place, incorrect allows may indicate a bug when
   173    passing policy information to the proxy. There is no way to dump the rules in
   174    the proxy, but a debug log is printed when a rule is added. Look for 
   175    ``DNS Proxy updating matchNames in allowed list during UpdateRules``.
   176    The pkg/proxy/dns.go file contains the DNS proxy implementation.
   177  
   178  If L7 DNS behaviour seems correct, see the sections below to further isolate
   179  the issue. This can be verified with ``cilium-dbg fqdn cache list``. The IPs in the
   180  response should appear in the cache for the appropriate endpoint. The lookup
   181  time is included in the json output of the command.
   182  
   183  .. code-block:: shell-session
   184  
   185      $ kubectl exec pod/cilium-sbp8v -n kube-system -- cilium-dbg fqdn cache list
   186      Endpoint   Source   FQDN         TTL    ExpirationTime             IPs
   187      3459       lookup   cilium.io.   3600   2020-04-21T15:04:27.146Z   104.198.14.52
   188  
   189  As of Cilium 1.16, the ``ExpirationTime`` represents the next time that
   190  the entry will be evaluated for staleness. If the entry ``Source`` is
   191  ``lookup``, then the entry will expire at that time. An equivalent entry with
   192  source ``connection`` may be established when a ``lookup`` entry expires. If
   193  the corresponding Endpoint continues to communicate to this domain via one of
   194  the related IP addresses, then Cilium will continue to keep the ``connection``
   195  entry alive. When the expiration time for a ``connection`` entry is reached,
   196  the entry will be re-evaluated to determine whether it is still used by active
   197  connections, and at that time may expire or be renewed with a new target
   198  expiration time.
   199  
   200  DNS Proxy Errors
   201  ~~~~~~~~~~~~~~~~
   202  
   203  REFUSED responses are returned when the proxy encounters an error during
   204  processing. This can be confusing to debug as that is also the response when a
   205  DNS request is denied.  An error log is always printed in these cases. Some are
   206  callbacks provided by other packages via daemon in cilium-agent.
   207  
   208  - ``Rejecting DNS query from endpoint due to error``: This is the "normal"
   209    policy-reject message. It is a debug log.
   210  - ``cannot extract endpoint IP from DNS request``: The proxy cannot read the
   211    socket information to read the source endpoint IP. This could mean an
   212    issue with the datapath routing and information passing.
   213  - ``cannot extract endpoint ID from DNS request``: The proxy cannot use the
   214    source endpoint IP to get the cilium-internal ID for that endpoint. This is
   215    different from the Security Identity. This could mean that cilium is not
   216    managing this endpoint and that something has gone awry. It could also mean a
   217    routing problem where a packet has arrived at the proxy incorrectly.
   218  - ``cannot extract destination IP:port from DNS request``: The proxy cannot
   219    read the socket information of the original request to obtain the intended
   220    target IP:Port. This could mean an issue with the datapath routing and
   221    information passing.
   222  - ``cannot find server ip in ipcache``: The proxy cannot resolve a Security
   223    Identity for the target IP of the DNS request. This should always succeed, as
   224    world catches all IPs not set by more specific entries. This can mean a
   225    broken ipcache BPF table.
   226  - ``Rejecting DNS query from endpoint due to error``: While checking if the DNS
   227    request was allowed (based on Endpoint ID, destination IP:Port and the DNS
   228    query) an error occurred. These errors would come from the internal rule
   229    lookup in the proxy, the ``allowed`` field.
   230  - ``Timeout waiting for response to forwarded proxied DNS lookup``: The proxy
   231    forwards requests 1:1 and does not cache. It applies a 10s timeout on
   232    responses to those requests, as the client will retry within this period
   233    (usually). Bursts of these errors can happen if the DNS target server
   234    misbehaves and many pods see DNS timeouts. This isn't an actual problem with
   235    cilium or the proxy although it can be caused by policy blocking the DNS
   236    target server if it is in-cluster.
   237  - ``Timed out waiting for datapath updates of FQDN IP information; returning
   238    response``: When the proxy updates the DNS caches with response data, it
   239    needs to allow some time for that information to get into the datapath.
   240    Otherwise, pods would attempt to make the outbound connection (the thing that
   241    caused the DNS lookup) before the datapath is ready. Many stacks retry the
   242    SYN in such cases but some return an error and some apps further crash as a
   243    response. This delay is configurable by setting the
   244    ``--tofqdns-proxy-response-max-delay`` command line argument but defaults to
   245    100ms. It can be exceeded if the system is under load.
   246  
   247  .. _isolating-source-toFQDNs-issues-identities-policy:
   248  
   249  Identities and Policy
   250  ~~~~~~~~~~~~~~~~~~~~~
   251  
   252  Once a DNS response has been passed back through the proxy and is placed in the
   253  DNS cache ``toFQDNs`` rules can begin using the IPs in the cache. There are
   254  multiple layers of cache:
   255  
   256  - A per-Endpoint ``DNSCache`` stores the lookups for this endpoint. It is
   257    restored on cilium startup with the endpoint. Limits are applied here for
   258    ``--tofqdns-endpoint-max-ip-per-hostname`` and TTLs are tracked. The
   259    ``--tofqdns-min-ttl`` is not used here.
   260  - A per-Endpoint ``DNSZombieMapping`` list of IPs that have expired from the
   261    per-Endpoint cache but are waiting for the Connection Tracking GC to mark
   262    them in-use or not. This can take up to 12 hours to occur. This list is
   263    size-limited by ``--tofqdns-max-deferred-connection-deletes``. 
   264  - A global ``DNSCache`` where all endpoint and poller DNS data is collected. It
   265    does apply the ``--tofqdns-min-ttl`` value but not the
   266    ``--tofqdns-endpoint-max-ip-per-hostname`` value.
   267  
   268  If an IP exists in the FQDN cache (check with ``cilium-dbg fqdn cache list``) then
   269  ``toFQDNs`` rules that select a domain name, either explicitly via
   270  ``matchName`` or via ``matchPattern``, should cause IPs for that domain to have
   271  allocated Security Identities. These can be listed with:
   272  
   273  .. code-block:: shell-session
   274  
   275      $ kubectl exec pod/cilium-sbp8v -n kube-system -- cilium-dbg identity list
   276      ID         LABELS
   277      1          reserved:host
   278      2          reserved:world
   279      3          reserved:unmanaged
   280      4          reserved:health
   281      5          reserved:init
   282      6          reserved:remote-node
   283      323        k8s:class=xwing
   284                 k8s:io.cilium.k8s.policy.cluster=default
   285                 k8s:io.cilium.k8s.policy.serviceaccount=default
   286                 k8s:io.kubernetes.pod.namespace=default
   287                 k8s:org=alliance
   288      ...
   289      16777217   fqdn:*
   290                 reserved:world
   291  
   292  Note that FQDN identities are allocated locally on the node and have a high-bit set so they are often in the 16-million range.
   293  Note that this is the identity in the monitor output for the HTTP connection.
   294  
   295  In cases where there is no matching identity for an IP in the fqdn cache it may
   296  simply be because no policy selects an associated domain. The policy system
   297  represents each ``toFQDNs:`` rule with a ``FQDNSelector`` instance. These
   298  receive updates from a global ``NameManager`` in the daemon.
   299  They can be listed along with other selectors (roughly corresponding to any L3 rule):
   300  
   301  .. code-block:: shell-session
   302  
   303      $ kubectl exec pod/cilium-sbp8v -n kube-system -- cilium-dbg policy selectors
   304      SELECTOR                                                                                                         USERS   IDENTITIES
   305      MatchName: , MatchPattern: *                                                                                     1       16777217
   306      &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{},}                   2       1
   307                                                                                                                               2
   308                                                                                                                               3
   309                                                                                                                               4
   310                                                                                                                               5
   311                                                                                                                               6
   312                                                                                                                               323
   313                                                                                                                               6188
   314                                                                                                                               15194
   315                                                                                                                               18892
   316                                                                                                                               25379
   317                                                                                                                               29200
   318                                                                                                                               32255
   319                                                                                                                               33831
   320                                                                                                                               16777217
   321      &LabelSelector{MatchLabels:map[string]string{reserved.none: ,},MatchExpressions:[]LabelSelectorRequirement{},}   1
   322  
   323  In this example 16777217 is used by two selectors, one with ``matchPattern: "*"``
   324  and another empty one. This is because of the policy in use:
   325  
   326  .. code-block:: yaml
   327  
   328      apiVersion: cilium.io/v2
   329      kind: CiliumNetworkPolicy
   330      metadata:
   331        name: "tofqdn-dns-visibility"
   332      spec:
   333        endpointSelector:
   334          matchLabels:
   335            any:org: alliance
   336        egress:
   337        - toPorts:
   338            - ports:
   339               - port: "53"
   340                 protocol: ANY
   341              rules:
   342                dns:
   343                  - matchPattern: "*"
   344        - toFQDNs:
   345            - matchPattern: "*"
   346  
   347  The L7 DNS rule has an implicit L3 allow-all because it defines only L4 and L7
   348  sections. This is the second selector in the list, and includes all possible L3
   349  identities known in the system. In contrast, the first selector, which
   350  corresponds to the ``toFQDNS: matchName: "*"`` rule would list all identities
   351  for IPs that came from the DNS Proxy.
   352  
   353  Unintended DNS Policy Drops
   354  ~~~~~~~~~~~~~~~~~~~~~~~~~~~
   355  
   356  ``toFQDNSs`` policy enforcement relies on the source pod performing a DNS query
   357  before using an IP address returned in the DNS response. Sometimes pods may hold
   358  on to a DNS response and start new connections to the same IP address at a later
   359  time. This may trigger policy drops if the DNS response has expired as requested
   360  by the DNS server in the time-to-live (TTL) value in the response. When DNS is
   361  used for service load balancing the advertised TTL value may be short (e.g., 60
   362  seconds).
   363  
   364  Cilium honors the TTL values returned by the DNS server by default, but you can
   365  override them by setting a minimum TTL using ``--tofqdns-min-ttl`` flag. This
   366  setting overrides short TTLs and allows the pod to use the IP address in the DNS
   367  response for a longer duration. Existing connections also keep the IP address as
   368  allowed in the policy.
   369  
   370  Any new connections opened by the pod using the same IP address without
   371  performing a new DNS query after the (possibly extended) DNS TTL has expired are
   372  dropped by Cilium policy enforcement. To allow pods to use the DNS response
   373  after TTL expiry for new connections, a command line option
   374  ``--tofqdns-idle-connection-grace-period`` may be used to keep the IP address /
   375  name mapping valid in the policy for an extended time after DNS TTL expiry. This
   376  option takes effect only if the pod has opened at least one connection during
   377  the DNS TTL period.
   378  
   379  Datapath Plumbing
   380  ~~~~~~~~~~~~~~~~~
   381  
   382  For a policy to be fully realized the datapath for an Endpoint must be updated.
   383  In the case of a new DNS-source IP, the FQDN identity associated with it must
   384  propagate from the selectors to the Endpoint specific policy. Unless a new
   385  policy is being added, this often only involves updating the Policy Map of the
   386  Endpoint with the new FQDN Identity of the IP. This can be verified:
   387  
   388  .. code-block:: shell-session
   389  
   390      $ kubectl exec pod/cilium-sbp8v -n kube-system -- cilium-dbg bpf policy get 3459
   391      DIRECTION   LABELS (source:key[=value])   PORT/PROTO   PROXY PORT   BYTES   PACKETS
   392      Ingress     reserved:unknown              ANY          NONE         1367    7
   393      Ingress     reserved:host                 ANY          NONE         0       0
   394      Egress      reserved:unknown              53/TCP       36447        0       0
   395      Egress      reserved:unknown              53/UDP       36447        138     2
   396      Egress      fqdn:*                        ANY          NONE         477     6
   397                  reserved:world 
   398  
   399  Note that the labels for identities are resolved here. This can be skipped, or
   400  there may be cases where this doesn't occur:
   401  
   402  .. code-block:: shell-session
   403  
   404      $ kubectl exec pod/cilium-sbp8v -n kube-system -- cilium-dbg bpf policy get -n 3459
   405      DIRECTION   IDENTITY   PORT/PROTO   PROXY PORT   BYTES   PACKETS
   406      Ingress     0          ANY          NONE         1367    7
   407      Ingress     1          ANY          NONE         0       0
   408      Egress      0          53/TCP       36447        0       0
   409      Egress      0          53/UDP       36447        138     2
   410      Egress      16777217   ANY          NONE         477     6
   411  
   412  
   413  L3 ``toFQDNs`` rules are egress only, so we would expect to see an ``Egress``
   414  entry with Security Identity ``16777217``. The L7 rule, used to redirect to the
   415  DNS Proxy is also present with a populated ``PROXY PORT``. It has a 0
   416  ``IDENTITY`` as it is an L3 wildcard, i.e. the policy allows any peer on the
   417  specified port.
   418  
   419  An identity missing here can be an error in various places:
   420  
   421  - Policy doesn't actually allow this Endpoint to connect. A sanity check is to
   422    use ``cilium-dbg endpoint list`` to see if cilium thinks it should have policy
   423    enforcement.
   424  - Endpoint regeneration is slow and the Policy Map has not been updated yet.
   425    This can occur in cases where we have leaked IPs from the DNS cache (i.e.
   426    they were never deleted correctly) or when there are legitimately many IPs.
   427    It can also simply mean an overloaded node or even a deadlock within cilium.
   428  - A more permissive policy has removed the need to include this identity. This
   429    is likely a bug, however, as the IP would still have an identity allocated
   430    and it would be included in the Policy Map.  In the past, a similar bug
   431    occurred with the L7 redirect and that would stop this whole process at the
   432    beginning.
   433  
   434  Mutexes / Locks and Data Races
   435  ------------------------------
   436  
   437  .. Note::
   438  
   439      This section only applies to Golang code.
   440  
   441  There are a few options available to debug Cilium data races and deadlocks.
   442  
   443  To debug data races, Golang allows ``-race`` to be passed to the compiler to
   444  compile Cilium with race detection. Additionally, the flag can be provided to
   445  ``go test`` to detect data races in a testing context.
   446  
   447  .. _compile-cilium-with-race-detection:
   448  
   449  ~~~~~~~~~~~~~~
   450  Race detection
   451  ~~~~~~~~~~~~~~
   452  
   453  To compile a Cilium binary with race detection, you can do:
   454  
   455  .. code-block:: shell-session
   456  
   457      $ make RACE=1
   458  
   459  .. Note::
   460  
   461      For building the Operator with race detection, you must also provide
   462      ``BASE_IMAGE`` which can be the ``cilium/cilium-runtime`` image from the
   463      root Dockerfile found in the Cilium repository.
   464  
   465  To run integration tests with race detection, you can do:
   466  
   467  .. code-block:: shell-session
   468  
   469      $ make RACE=1 integration-tests
   470  
   471  ~~~~~~~~~~~~~~~~~~
   472  Deadlock detection
   473  ~~~~~~~~~~~~~~~~~~
   474  
   475  Cilium can be compiled with a build tag ``lockdebug`` which will provide a
   476  seamless wrapper over the standard mutex types in Golang, via
   477  `sasha-s/go-deadlock library <https://github.com/sasha-s/go-deadlock>`_. No
   478  action is required, besides building the binary with this tag.
   479  
   480  For example:
   481  
   482  .. code-block:: shell-session
   483  
   484      $ make LOCKDEBUG=1
   485      $ # Deadlock detection during integration tests:
   486      $ make LOCKDEBUG=1 integration-tests
   487  
   488  CPU Profiling and Memory Leaks
   489  ------------------------------
   490  
   491  Cilium bundles ``gops``, a standard tool for Golang applications, which
   492  provides the ability to collect CPU and memory profiles using ``pprof``.
   493  Inspecting profiles can help identify CPU bottlenecks and memory leaks.
   494  
   495  To capture a profile, take a :ref:`sysdump <sysdump>` of the cluster with the
   496  Cilium CLI or more directly, use the ``cilium-bugtool`` command that is
   497  included in the Cilium image after enabling ``pprof`` in the Cilium ConfigMap:
   498  
   499  .. code-block:: shell-session
   500  
   501      $ kubectl exec -ti -n kube-system <cilium-pod-name> -- cilium-bugtool --get-pprof --pprof-trace-seconds N
   502      $ kubectl cp -n kube-system <cilium-pod-name>:/tmp/cilium-bugtool-<time-generated-name>.tar ./cilium-pprof.tar
   503      $ tar xf ./cilium-pprof.tar
   504  
   505  Be mindful that the profile window is the number of seconds passed to
   506  ``--pprof-trace-seconds``. Ensure that the number of seconds are enough to
   507  capture Cilium while it is exhibiting the problematic behavior to debug.
   508  
   509  There are 6 files that encompass the tar archive:
   510  
   511  .. code-block:: shell-session
   512  
   513      Permissions Size User  Date Modified Name
   514      .rw-r--r--   940 chris  6 Jul 14:04  gops-memstats-$(pidof-cilium-agent).md
   515      .rw-r--r--  211k chris  6 Jul 14:04  gops-stack-$(pidof-cilium-agent).md
   516      .rw-r--r--    58 chris  6 Jul 14:04  gops-stats-$(pidof-cilium-agent).md
   517      .rw-r--r--   212 chris  6 Jul 14:04  pprof-cpu
   518      .rw-r--r--  2.3M chris  6 Jul 14:04  pprof-heap
   519      .rw-r--r--   25k chris  6 Jul 14:04  pprof-trace
   520  
   521  The files prefixed with ``pprof-`` are profiles. For more information on each
   522  one, see `Julia Evan's blog`_ on ``pprof``.
   523  
   524  To view the CPU or memory profile, simply execute the following command:
   525  
   526  .. code-block:: shell-session
   527  
   528      $ go tool pprof -http localhost:9090 pprof-cpu  # for CPU
   529      $ go tool pprof -http localhost:9090 pprof-heap # for memory
   530  
   531  This opens a browser window for profile inspection.
   532  
   533  .. _Julia Evan's blog: https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/