github.com/mgoltzsche/ctnr@v0.7.1-alpha/nested-containers.md (about)

     1  # Experiments with nested containers
     2  
     3  ... in the repository directory on an ubuntu 16.04 host.
     4  
     5  
     6  ## Run ctnr container inside privileged docker container
     7  ```
     8  docker run -ti --rm --privileged \
     9  	-v "$(pwd)/dist/bin/ctnr:/bin/ctnr" \
    10  	-v "$(pwd)/image-policy-example.json:/etc/containers/policy.json" \
    11  	alpine:3.7
    12  > ctnr run -t --network=host docker://alpine:3.7
    13  ```
    14  
    15  
    16  ## Run ctnr container inside unprivileged user's privileged ctnr container
    17  ```
    18  dist/bin/ctnr run -t --privileged \
    19  	-v "$(pwd)/dist/bin/ctnr:/bin/ctnr" \
    20  	-v "$(pwd)/image-policy-example.json:/etc/containers/policy.json" \
    21  	--image-policy=image-policy-example.json \
    22  	docker://alpine:3.7
    23  > ctnr run -t --rootless --network=host docker://alpine:3.7
    24  ```
    25  
    26  
    27  ## Not working: Run ctnr container inside unprivileged docker container
    28  ```
    29  docker run -ti --rm \
    30  	-v "$(pwd)/dist/bin/ctnr:/bin/ctnr" \
    31  	-v "$(pwd)/image-policy-example.json:/etc/containers/policy.json" \
    32  	alpine:3.7
    33  > ctnr run  -ti --rootless --network=host docker://alpine:3.7
    34  ```
    35  Error: Cannot change the process namespace ("running exec setns process for init caused \"exit status 34\"")
    36  => seccomp denies setns
    37  
    38  Adding a custom seccomp profile solves this problem but...
    39  (TODO: use docker-default apparmor profile without `deny mount`, see https://github.com/moby/moby/blob/master/profiles/apparmor/template.go)
    40  ```
    41  docker run -ti --rm --user=`id -u`:`id -g` \
    42  	--security-opt apparmor=unconfined \
    43  	--security-opt seccomp="$(pwd)/seccomp-container.json" \
    44  	-v /sys/fs/cgroup:/sys/fs/cgroup:ro \
    45  	-v "$HOME/.ctnr:/.ctnr" \
    46  	-v "$(pwd)/dist/bin:/usr/local/bin" \
    47  	-v "$(pwd)/image-policy-example.json:/etc/containers/policy.json" \
    48  	debian:9 /bin/bash
    49  $ ctnr --state-dir /tmp/ctnr run --verbose -ti -b test --update --rootless --no-new-keyring --no-pivot docker://alpine:3.8
    50  ```
    51  Error: run process: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/.ctnr/bundles/test/rootfs\\\" at \\\"/proc\\\" caused \\\"operation not permitted\\\"\""
    52  => proc cannot be mounted
    53  => See https://github.com/opencontainers/runc/issues/1658
    54  
    55  
    56  ## How to analyze container problems
    57  - Run parent container with `CAP_SYS_PTRACE` capability and child container with
    58    `strace -ff` to debug system calls
    59  - Run moby's `check-config` script _(requires kernel config to be mounted)_:  
    60  ```
    61  apk update && apk add bash
    62  wget -O /bin/chcfg https://raw.githubusercontent.com/moby/moby/master/contrib/check-config.sh
    63  chmod +x /bin/chcfg && chcfg
    64  ```
    65  
    66  
    67  ## Known errors and workarounds to run a container in another container
    68  
    69  _Workarounds you do not want to do_  
    70  _(also see https://github.com/opencontainers/runc/issues/1456)_  
    71  
    72  - "running exec setns process for init caused \"exit status 34\""  
    73    -> inner container: add `--rootless` option (if that has no effect: add setns syscall to list of SCMP_ACT_ALLOW calls (TODO: which syscall exactly?))  
    74    -> {root} (outer container: add `--seccomp=unconfined` option)  
    75    -> add `--cap-add=SYS_ADMIN` to rootless outer container and `--rootless` to inner
    76  - "mkdir /sys/fs/cgroup/cpuset/05dh[...]: permission denied"  
    77    -> inner container: add --rootless option  
    78    -> {ctnr} outer container: add --mount-cgroup=rw option
    79  - "could not create session key: operation not permitted"  
    80    -> inner container: enable --no-new-keyring option  
    81    -> outer container: allow corresponding syscall in seccomp profile (dirty: set --seccomp=unconfined)
    82  - "pivot_root operation not permitted"  
    83    -> inner container: enable --no-pivot option  
    84    -> outer container: seccomp: add "pivot_root" syscall to the list of SCMP_ACT_ALLOW calls
    85  
    86  *Note regarding cgroups*:
    87  The cgroup hierarchy can be mounted into a container using `--mount-cgroups=rw`.
    88  Currently this is a security vulnerability since all cgroups are mounted writeable.
    89  When using kernel >=4.6 it is possible to only make the process' cgroups writeable
    90  (see https://github.com/opencontainers/runc/issues/225).
    91  
    92  
    93  ## Summary so far
    94  Containers can be run in privileged containers but nesting them in unprivileged containers is still problematic.
    95  Docker's sane seccomp and apparmor default profiles deny syscalls that are required to run a container.
    96  The seccomp profile denies `setns` and a few other syscalls. The apparmor profile denies `mount`.
    97  Unfortunately it still doesn't run when apparmor is disabled (or better a custom profile provided that allows mount)
    98  and a custom seccomp profile is provided since /proc cannot be mounted since masked by docker
    99  (see https://github.com/opencontainers/runc/issues/1658,
   100  https://lists.linuxfoundation.org/pipermail/containers/2018-April/038864.html
   101  and https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1533642.html).