github.com/inspektor-gadget/inspektor-gadget@v0.28.1/docs/builtin-gadgets/trace/capabilities.md (about) 1 --- 2 title: 'Using trace capabilities' 3 weight: 20 4 description: > 5 Trace security capability checks. 6 --- 7 8  9 10 The trace capabilities gadget allows us to see what capability security checks 11 are triggered by applications running in Kubernetes Pods. 12 13 Linux [capabilities](https://man7.org/linux/man-pages/man7/capabilities.7.html) allow for a finer 14 privilege control because they can give root-like capabilities to processes without giving them full 15 root access. They can also be taken away from root processes. If a pod is directly executing 16 programs as root, we can further lock it down by taking capabilities away. Sometimes we need to add 17 capabilities which are not there by default. You can see the list of default and available 18 capabilities [in 19 Docker](https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities). 20 Specially if our pod is directly run as user instead of root (runAsUser: ID), we can give some more 21 capabilities (think as partly root) and still take all unused capabilities to really lock it down. 22 23 ### On Kubernetes 24 25 Here we have a small demo app which logs failures due to lacking capabilities. 26 Since none of the default capabilities is dropped, we have to find 27 out what non-default capability we have to add. 28 29 ```bash 30 $ cat docs/examples/app-set-priority.yaml 31 apiVersion: apps/v1 32 kind: Deployment 33 metadata: 34 name: set-priority 35 labels: 36 k8s-app: set-priority 37 spec: 38 selector: 39 matchLabels: 40 name: set-priority 41 template: 42 metadata: 43 labels: 44 name: set-priority 45 spec: 46 containers: 47 - name: set-priority 48 image: busybox 49 command: [ "sh", "-c", "while /bin/true ; do nice -n -20 echo ; sleep 5; done" ] 50 51 $ kubectl apply -f docs/examples/app-set-priority.yaml 52 deployment.apps/set-priority created 53 $ kubectl logs -lname=set-priority 54 nice: setpriority(-20): Permission denied 55 nice: setpriority(-20): Permission denied 56 ``` 57 58 We could see the error messages in the pod's log. 59 Let's use Inspektor Gadget to watch the capability checks: 60 61 ```bash 62 $ kubectl gadget trace capabilities --selector name=set-priority 63 K8S.NODE K8S.NAMESPACE K8S.POD K8S.CONTAINER PID COMM SYSCALL UID CAP CAPNAME AUDIT VERDICT 64 minikube-docker default set-priorit…495c8-t88x8 set-priority 2711127 nice setpriority 0 23 SYS_NICE 1 Deny 65 minikube-docker default set-priorit…495c8-t88x8 set-priority 2711260 nice setpriority 0 23 SYS_NICE 1 Deny 66 minikube-docker default set-priorit…495c8-t88x8 set-priority 2711457 nice setpriority 0 23 SYS_NICE 1 Deny 67 minikube-docker default set-priorit…495c8-t88x8 set-priority 2711619 nice setpriority 0 23 SYS_NICE 1 Deny 68 minikube-docker default set-priorit…495c8-t88x8 set-priority 2711815 nice setpriority 0 23 SYS_NICE 1 Deny 69 ^C 70 Terminating... 71 ``` 72 73 We can leave the gadget with Ctrl-C. 74 In the output we see that the `SYS_NICE` capability got checked when `nice` was run. 75 We should probably add it to our pod template for `nice` to work. We can also drop 76 all other capabilities from the default list (see link above) since `nice` 77 did not use them: 78 79 The meaning of the columns is: 80 81 * `SYSCALL`: the system call that caused the capability to be exercised 82 * `CAP`: capability number 83 * `CAPNAME`: capability name in a human friendly format 84 * `AUDIT`: whether the kernel should audit the security request or not 85 * `VERDICT`: whether the capability was present (allow) or not (deny) 86 87 ```bash 88 $ cat docs/examples/app-set-priority-locked-down.yaml 89 apiVersion: apps/v1 90 kind: Deployment 91 metadata: 92 name: set-priority 93 labels: 94 k8s-app: set-priority 95 spec: 96 selector: 97 matchLabels: 98 name: set-priority 99 template: 100 metadata: 101 labels: 102 name: set-priority 103 spec: 104 containers: 105 - name: set-priority 106 image: busybox 107 command: [ "sh", "-c", "while /bin/true ; do nice -n -20 echo ; sleep 5; done" ] 108 securityContext: 109 capabilities: 110 add: ["SYS_NICE"] 111 drop: [all] 112 113 ``` 114 115 Let's verify that our locked-down version works. 116 117 ```bash 118 $ kubectl delete -f docs/examples/app-set-priority.yaml 119 deployment.apps "set-priority" deleted 120 $ kubectl apply -f docs/examples/app-set-priority-locked-down.yaml 121 deployment.apps/set-priority created 122 $ kubectl logs -lname=set-priority 123 124 ``` 125 126 The logs are clean, so everything works! 127 128 We can see the same checks but this time with the `Allow` verdict: 129 130 ```bash 131 $ kubectl gadget trace capabilities --selector name=set-priority 132 K8S.NODE K8S.NAMESPACE K8S.POD K8S.CONTAINER PID COMM SYSCALL UID CAP CAPNAME AUDIT VERDICT 133 minikube-docker default set-priorit…66dff-nm5pt set-priority 2718069 nice setpriority 0 23 SYS_NICE 1 Allow 134 minikube-docker default set-priorit…66dff-nm5pt set-priority 2718291 nice setpriority 0 23 SYS_NICE 1 Allow 135 ^C 136 Terminating... 137 ``` 138 139 You can now delete the pod you created: 140 ``` 141 $ kubectl delete -f docs/examples/app-set-priority-locked-down.yaml 142 ``` 143 144 #### Interpreting advanced columns 145 146 Some columns are not displayed by default: 147 * `caps`: the effective capability bitfield of the process 148 * `capsnames`: same as caps in a human friendly format 149 * `currentuserns`: the user namespace of the process 150 * `targetuserns`: the user namespace that the kernel used to test the 151 capability. 152 153 They can be useful to understand advanced usage of capabilities. 154 Let's see two examples. 155 156 ``` 157 $ kubectl run -ti --rm --restart=Never \ 158 --image busybox --privileged testcaps -- \ 159 chroot / 160 ``` 161 162 ``` 163 $ kubectl gadget trace capabilities \ 164 -o columns=comm,syscall,capName,verdict,targetuserns,currentuserns,caps,capsnames 165 COMM SYSCALL CAPNAME VERDICT TARGETUSERNS CURRENTUSERNS CAPS CAPSNAMES 166 chroot chroot SYS_CHROOT Allow 4026531837 4026531837 3fffffffff chown,dac_override,dac_… 167 ``` 168 169 In this example, targetuserns and currentuserns are the same. This is 170 necessarily the case for chroot because the kernel tests the capability 171 in this way: 172 ``` 173 if (!ns_capable(current_user_ns(), CAP_SYS_CHROOT)) 174 ``` 175 176 The effective capability bitfield is "3fffffffff". 177 This can be decoded in this way: 178 ```shell 179 $ capsh --decode=3fffffffff 180 0x0000003fffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read 181 ``` 182 183 The effective capability set includes `CAP_SYS_CHROOT` and targetuserns and currentuserns are the same. 184 Hence the verdict "Allow". 185 186 It is also possible to see the list of capabilities in json: 187 188 ``` 189 $ kubectl gadget trace capabilities -o json | jq . 190 { 191 "node": "minikube-docker", 192 "namespace": "default", 193 "pod": "testcaps", 194 "container": "testcaps", 195 "timestamp": 1677087968732237745, 196 "type": "normal", 197 "mountnsid": 4026533307, 198 "pid": 3277678, 199 "comm": "chroot", 200 "syscall": "chroot", 201 "uid": 0, 202 "gid": 0, 203 "cap": 18, 204 "capName": "SYS_CHROOT", 205 "audit": 1, 206 "verdict": "Allow", 207 "insetid": false, 208 "targetuserns": 4026531837, 209 "currentuserns": 4026531837, 210 "caps": 274877906943, 211 "capsNames": [ 212 ... 213 "sys_rawio", 214 "sys_chroot", 215 "sys_ptrace", 216 ... 217 ] 218 } 219 ``` 220 221 In the next example, we will create a new user namespace but without creating a new mount namespace. 222 We will then attempt to create a new mount: 223 224 ``` 225 $ kubectl run -ti --rm --restart=Never \ 226 --image busybox --privileged testcaps -- \ 227 /bin/unshare -Urf /bin/mount -t tmpfs tmpfs /tmp 228 ``` 229 230 Let's have a look at the generated logs for the mount process: 231 232 ``` 233 $ kubectl gadget trace capabilities -o json | jq . 234 { 235 "node": "minikube-docker", 236 "namespace": "default", 237 "pod": "testcaps", 238 "container": "testcaps", 239 "timestamp": 1677088257998618652, 240 "type": "normal", 241 "mountnsid": 4026533307, 242 "pid": 3287538, 243 "comm": "mount", 244 "syscall": "mount", 245 "uid": 0, 246 "gid": 0, 247 "cap": 21, 248 "capName": "SYS_ADMIN", 249 "audit": 1, 250 "verdict": "Deny", 251 "insetid": false, 252 "targetuserns": 4026531837, 253 "currentuserns": 4026533310, 254 "caps": 2199023255551, 255 "capsNames": [ 256 ... 257 "sys_pacct", 258 "sys_admin", 259 "sys_boot", 260 ... 261 ] 262 } 263 ``` 264 265 The capability set includes `CAP_SYS_ADMIN`. 266 However, the verdict is "Deny". 267 268 This can be explained by the interaction with user namespaces. 269 The target and current user namespaces are different. This makes a difference 270 because the kernel tests the capability with regard to the user 271 namespaces owning the mount namespace, that is the parent user namespace: 272 ``` 273 if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) || ... 274 ``` 275 276 ### With `ig` 277 278 Start `ig`: 279 280 ```bash 281 $ ig trace capabilities -r docker -c test 282 RUNTIME.CONTAINERNAME PID COMM SYSCALL UID CAP CAPNAME AUDIT VERDICT 283 ``` 284 285 Start the test container exercising the capabilities: 286 ```bash 287 $ docker run -ti --rm --name=test --privileged busybox 288 / # touch /aaa ; chown 1:1 /aaa ; chmod 400 /aaa 289 / # chroot / 290 / # mkdir /mnt ; mount -t tmpfs tmpfs /mnt 291 / # export PPID=$$;/bin/unshare -i sh -c "/bin/nsenter -i -t $PPID echo OK" 292 OK 293 ``` 294 295 Observe the resulting trace: 296 297 ``` 298 RUNTIME.CONTAINERNAME PID COMM SYSCALL UID CAP CAPNAME AUDIT VERDICT 299 test 2609137 chown chown 0 0 CHOWN 1 Allow 300 test 2609137 chown chown 0 0 CHOWN 1 Allow 301 test 2609138 chmod chmod 0 3 FOWNER 1 Allow 302 test 2609138 chmod chmod 0 4 FSETID 1 Allow 303 test 2609138 chmod chmod 0 4 FSETID 1 Allow 304 test 2609694 chroot chroot 0 18 SYS_CHROOT 1 Allow 305 test 2610364 mount mount 0 21 SYS_ADMIN 1 Allow 306 test 2610364 mount mount 0 21 SYS_ADMIN 1 Allow 307 test 2633270 unshare unshare 0 21 SYS_ADMIN 1 Allow 308 test 2633270 nsenter setns 0 21 SYS_ADMIN 1 Allow 309 test 2633270 nsenter setns 0 21 SYS_ADMIN 1 Allow 310 ```