github.com/inspektor-gadget/inspektor-gadget@v0.28.1/docs/builtin-gadgets/trace/capabilities.md (about)

     1  ---
     2  title: 'Using trace capabilities'
     3  weight: 20
     4  description: >
     5    Trace security capability checks.
     6  ---
     7  
     8  ![Screencast of the trace capabilities gadget](capabilities.gif)
     9  
    10  The trace capabilities gadget allows us to see what capability security checks
    11  are triggered by applications running in Kubernetes Pods.
    12  
    13  Linux [capabilities](https://man7.org/linux/man-pages/man7/capabilities.7.html) allow for a finer
    14  privilege control because they can give root-like capabilities to processes without giving them full
    15  root access. They can also be taken away from root processes. If a pod is directly executing
    16  programs as root, we can further lock it down by taking capabilities away. Sometimes we need to add
    17  capabilities which are not there by default. You can see the list of default and available
    18  capabilities [in
    19  Docker](https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities).
    20  Specially if our pod is directly run as user instead of root (runAsUser: ID), we can give some more
    21  capabilities (think as partly root) and still take all unused capabilities to really lock it down.
    22  
    23  ### On Kubernetes
    24  
    25  Here we have a small demo app which logs failures due to lacking capabilities.
    26  Since none of the default capabilities is dropped, we have to find
    27  out what non-default capability we have to add.
    28  
    29  ```bash
    30  $ cat docs/examples/app-set-priority.yaml
    31  apiVersion: apps/v1
    32  kind: Deployment
    33  metadata:
    34    name: set-priority
    35    labels:
    36      k8s-app: set-priority
    37  spec:
    38    selector:
    39      matchLabels:
    40        name: set-priority
    41    template:
    42      metadata:
    43        labels:
    44          name: set-priority
    45      spec:
    46        containers:
    47        - name: set-priority
    48          image: busybox
    49          command: [ "sh", "-c", "while /bin/true ; do nice -n -20 echo ; sleep 5; done" ]
    50  
    51  $ kubectl apply -f docs/examples/app-set-priority.yaml
    52  deployment.apps/set-priority created
    53  $ kubectl logs -lname=set-priority
    54  nice: setpriority(-20): Permission denied
    55  nice: setpriority(-20): Permission denied
    56  ```
    57  
    58  We could see the error messages in the pod's log.
    59  Let's use Inspektor Gadget to watch the capability checks:
    60  
    61  ```bash
    62  $ kubectl gadget trace capabilities --selector name=set-priority
    63  K8S.NODE         K8S.NAMESPACE  K8S.POD                 K8S.CONTAINER PID      COMM  SYSCALL      UID  CAP CAPNAME   AUDIT  VERDICT
    64  minikube-docker  default        set-priorit…495c8-t88x8 set-priority  2711127  nice  setpriority  0    23  SYS_NICE  1      Deny
    65  minikube-docker  default        set-priorit…495c8-t88x8 set-priority  2711260  nice  setpriority  0    23  SYS_NICE  1      Deny
    66  minikube-docker  default        set-priorit…495c8-t88x8 set-priority  2711457  nice  setpriority  0    23  SYS_NICE  1      Deny
    67  minikube-docker  default        set-priorit…495c8-t88x8 set-priority  2711619  nice  setpriority  0    23  SYS_NICE  1      Deny
    68  minikube-docker  default        set-priorit…495c8-t88x8 set-priority  2711815  nice  setpriority  0    23  SYS_NICE  1      Deny
    69  ^C
    70  Terminating...
    71  ```
    72  
    73  We can leave the gadget with Ctrl-C.
    74  In the output we see that the `SYS_NICE` capability got checked when `nice` was run.
    75  We should probably add it to our pod template for `nice` to work. We can also drop
    76  all other capabilities from the default list (see link above) since `nice`
    77  did not use them:
    78  
    79  The meaning of the columns is:
    80  
    81  * `SYSCALL`: the system call that caused the capability to be exercised
    82  * `CAP`: capability number
    83  * `CAPNAME`: capability name in a human friendly format
    84  * `AUDIT`: whether the kernel should audit the security request or not
    85  * `VERDICT`: whether the capability was present (allow) or not (deny)
    86  
    87  ```bash
    88  $ cat docs/examples/app-set-priority-locked-down.yaml
    89  apiVersion: apps/v1
    90  kind: Deployment
    91  metadata:
    92    name: set-priority
    93    labels:
    94      k8s-app: set-priority
    95  spec:
    96    selector:
    97      matchLabels:
    98        name: set-priority
    99    template:
   100      metadata:
   101        labels:
   102          name: set-priority
   103      spec:
   104        containers:
   105        - name: set-priority
   106          image: busybox
   107          command: [ "sh", "-c", "while /bin/true ; do nice -n -20 echo ; sleep 5; done" ]
   108          securityContext:
   109            capabilities:
   110              add: ["SYS_NICE"]
   111              drop: [all]
   112  
   113  ```
   114  
   115  Let's verify that our locked-down version works.
   116  
   117  ```bash
   118  $ kubectl delete -f docs/examples/app-set-priority.yaml
   119  deployment.apps "set-priority" deleted
   120  $ kubectl apply -f docs/examples/app-set-priority-locked-down.yaml
   121  deployment.apps/set-priority created
   122  $ kubectl logs -lname=set-priority
   123  
   124  ```
   125  
   126  The logs are clean, so everything works!
   127  
   128  We can see the same checks but this time with the `Allow` verdict:
   129  
   130  ```bash
   131  $ kubectl gadget trace capabilities --selector name=set-priority
   132  K8S.NODE         K8S.NAMESPACE  K8S.POD                 K8S.CONTAINER PID      COMM  SYSCALL      UID  CAP CAPNAME   AUDIT  VERDICT
   133  minikube-docker  default        set-priorit…66dff-nm5pt set-priority  2718069  nice  setpriority  0    23  SYS_NICE  1      Allow
   134  minikube-docker  default        set-priorit…66dff-nm5pt set-priority  2718291  nice  setpriority  0    23  SYS_NICE  1      Allow
   135  ^C
   136  Terminating...
   137  ```
   138  
   139  You can now delete the pod you created:
   140  ```
   141  $ kubectl delete -f docs/examples/app-set-priority-locked-down.yaml
   142  ```
   143  
   144  #### Interpreting advanced columns
   145  
   146  Some columns are not displayed by default:
   147  * `caps`: the effective capability bitfield of the process
   148  * `capsnames`: same as caps in a human friendly format
   149  * `currentuserns`: the user namespace of the process
   150  * `targetuserns`: the user namespace that the kernel used to test the
   151    capability.
   152  
   153  They can be useful to understand advanced usage of capabilities.
   154  Let's see two examples.
   155  
   156  ```
   157  $ kubectl run -ti --rm --restart=Never \
   158      --image busybox --privileged testcaps -- \
   159      chroot /
   160  ```
   161  
   162  ```
   163  $ kubectl gadget trace capabilities \
   164      -o columns=comm,syscall,capName,verdict,targetuserns,currentuserns,caps,capsnames
   165  COMM             SYSCALL                      CAPNAME            VERDICT TARGETUSERNS        CURRENTUSERNS       CAPS                 CAPSNAMES
   166  chroot           chroot                       SYS_CHROOT         Allow   4026531837          4026531837          3fffffffff           chown,dac_override,dac_…
   167  ```
   168  
   169  In this example, targetuserns and currentuserns are the same. This is
   170  necessarily the case for chroot because the kernel tests the capability
   171  in this way:
   172  ```
   173  if (!ns_capable(current_user_ns(), CAP_SYS_CHROOT))
   174  ```
   175  
   176  The effective capability bitfield is "3fffffffff".
   177  This can be decoded in this way:
   178  ```shell
   179  $ capsh --decode=3fffffffff
   180  0x0000003fffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
   181  ```
   182  
   183  The effective capability set includes `CAP_SYS_CHROOT` and targetuserns and currentuserns are the same.
   184  Hence the verdict "Allow".
   185  
   186  It is also possible to see the list of capabilities in json:
   187  
   188  ```
   189  $ kubectl gadget trace capabilities -o json | jq .
   190  {
   191    "node": "minikube-docker",
   192    "namespace": "default",
   193    "pod": "testcaps",
   194    "container": "testcaps",
   195    "timestamp": 1677087968732237745,
   196    "type": "normal",
   197    "mountnsid": 4026533307,
   198    "pid": 3277678,
   199    "comm": "chroot",
   200    "syscall": "chroot",
   201    "uid": 0,
   202    "gid": 0,
   203    "cap": 18,
   204    "capName": "SYS_CHROOT",
   205    "audit": 1,
   206    "verdict": "Allow",
   207    "insetid": false,
   208    "targetuserns": 4026531837,
   209    "currentuserns": 4026531837,
   210    "caps": 274877906943,
   211    "capsNames": [
   212      ...
   213      "sys_rawio",
   214      "sys_chroot",
   215      "sys_ptrace",
   216      ...
   217    ]
   218  }
   219  ```
   220  
   221  In the next example, we will create a new user namespace but without creating a new mount namespace.
   222  We will then attempt to create a new mount:
   223  
   224  ```
   225  $ kubectl run -ti --rm --restart=Never \
   226      --image busybox --privileged testcaps -- \
   227      /bin/unshare -Urf /bin/mount -t tmpfs tmpfs /tmp
   228  ```
   229  
   230  Let's have a look at the generated logs for the mount process:
   231  
   232  ```
   233  $ kubectl gadget trace capabilities -o json | jq .
   234  {
   235    "node": "minikube-docker",
   236    "namespace": "default",
   237    "pod": "testcaps",
   238    "container": "testcaps",
   239    "timestamp": 1677088257998618652,
   240    "type": "normal",
   241    "mountnsid": 4026533307,
   242    "pid": 3287538,
   243    "comm": "mount",
   244    "syscall": "mount",
   245    "uid": 0,
   246    "gid": 0,
   247    "cap": 21,
   248    "capName": "SYS_ADMIN",
   249    "audit": 1,
   250    "verdict": "Deny",
   251    "insetid": false,
   252    "targetuserns": 4026531837,
   253    "currentuserns": 4026533310,
   254    "caps": 2199023255551,
   255    "capsNames": [
   256      ...
   257      "sys_pacct",
   258      "sys_admin",
   259      "sys_boot",
   260      ...
   261    ]
   262  }
   263  ```
   264  
   265  The capability set includes `CAP_SYS_ADMIN`.
   266  However, the verdict is "Deny".
   267  
   268  This can be explained by the interaction with user namespaces.
   269  The target and current user namespaces are different. This makes a difference
   270  because the kernel tests the capability with regard to the user
   271  namespaces owning the mount namespace, that is the parent user namespace:
   272  ```
   273  if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) || ...
   274  ```
   275  
   276  ### With `ig`
   277  
   278  Start `ig`:
   279  
   280  ```bash
   281  $ ig trace capabilities -r docker -c test
   282  RUNTIME.CONTAINERNAME  PID      COMM     SYSCALL  UID  CAP CAPNAME      AUDIT  VERDICT
   283  ```
   284  
   285  Start the test container exercising the capabilities:
   286  ```bash
   287  $ docker run -ti --rm --name=test --privileged busybox
   288  / # touch /aaa ; chown 1:1 /aaa ; chmod 400 /aaa
   289  / # chroot /
   290  / # mkdir /mnt ; mount -t tmpfs tmpfs /mnt
   291  / # export PPID=$$;/bin/unshare -i sh -c "/bin/nsenter -i -t $PPID echo OK"
   292  OK
   293  ```
   294  
   295  Observe the resulting trace:
   296  
   297  ```
   298  RUNTIME.CONTAINERNAME  PID      COMM     SYSCALL  UID  CAP CAPNAME      AUDIT  VERDICT
   299  test                   2609137  chown    chown    0    0   CHOWN        1      Allow
   300  test                   2609137  chown    chown    0    0   CHOWN        1      Allow
   301  test                   2609138  chmod    chmod    0    3   FOWNER       1      Allow
   302  test                   2609138  chmod    chmod    0    4   FSETID       1      Allow
   303  test                   2609138  chmod    chmod    0    4   FSETID       1      Allow
   304  test                   2609694  chroot   chroot   0    18  SYS_CHROOT   1      Allow
   305  test                   2610364  mount    mount    0    21  SYS_ADMIN    1      Allow
   306  test                   2610364  mount    mount    0    21  SYS_ADMIN    1      Allow
   307  test                   2633270  unshare  unshare  0    21  SYS_ADMIN    1      Allow
   308  test                   2633270  nsenter  setns    0    21  SYS_ADMIN    1      Allow
   309  test                   2633270  nsenter  setns    0    21  SYS_ADMIN    1      Allow
   310  ```