gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2023-06-20-gpu-pytorch-stable-diffusion.md (about)

     1  # Running Stable Diffusion on GPU with gVisor
     3  gVisor is [starting to support GPU][gVisor GPU support] workloads. This post
     4  showcases running the [Stable Diffusion] generative model from [Stability AI] to
     5  generate images using a GPU from within gVisor. Both the
     6  [Automatic1111 Stable Diffusion web UI][automatic1111/stable-diffusion-webui]
     7  and the [PyTorch] code used by Stable Diffusion were run entirely within gVisor
     8  while being able to leverage the NVIDIA GPU.
    10  ![A sandboxed GPU](/assets/images/2023-06-20-sandboxed-gpu.png "A sandboxed GPU.")
    11  <span class="attribution">**Sand**boxing a GPU. Generated with Stable Diffusion
    12  v1.5.<br/>This picture gets a lot deeper once you realize that GPUs are made out
    13  of sand.</span>
    15  --------------------------------------------------------------------------------
    17  ## Disclaimer
    19  As of this writing (2023-06), [gVisor's GPU support][gVisor GPU support] is not
    20  generalized. Only some PyTorch workloads have been tested on NVIDIA T4, L4,
    21  A100, and H100 GPUs, using the specific driver versions `525.60.13` and
    22  `525.105.17`. Contributions are welcome to expand this set to support other GPUs
    23  and driver versions!
    25  Additionally, while gVisor does its best to sandbox the workload, interacting
    26  with the GPU inherently requires running code on GPU hardware, where isolation
    27  is enforced by the GPU driver and hardware itself rather than gVisor. More to
    28  come soon on the value of the protection gVisor provides for GPU workloads.
    30  In a few months, gVisor's GPU support will have broadened and become
    31  easier-to-use, such that it will not be constrained to the specific sets of
    32  versions used here. In the meantime, this blog stands as an example of what's
    33  possible today with gVisor's GPU support.
    35  ![Various space suit helmets](/assets/images/2023-06-20-spacesuit-helmets.png "Various space suit helmets."){:width="100%"}
    36  <span class="attribution">**A collection of astronaut helmets in various styles**.<br/>Other than the helmet in the center, each helmet was generated using Stable Diffusion v1.5.</span>
    38  ## Why even do this?
    40  The recent explosion of machine learning models has led to a large number of new
    41  open-source projects. Much like it is good practice to be careful about running
    42  new software downloaded from the Internet, it is good practice to run new
    43  open-source projects in a sandbox. For projects like the
    44  [Automatic1111 Stable Diffusion web UI][automatic1111/stable-diffusion-webui],
    45  which automatically download various models, components, and
    46  [extensions][Stable Diffusion Web UI extensions] from external repositories as
    47  the user enables them in the web UI, this principle applies all the more.
    49  Additionally, within the machine learning space, tooling for packaging and
    50  distributing models are still nascent. While some models (including Stable
    51  Diffusion) are packaged using the more secure [safetensors] format, **the
    52  majority of models available online today are distributed using the
    53  [Pickle format], which can execute arbitrary Python code** upon deserialization.
    54  As such, even when using trustworthy software, using Pickle-formatted models may
    55  still be risky (**Edited 2024-04-04:
    56  [this exact vulnerability vector was found in Hugging Face's Inference API](https://www.wiz.io/blog/wiz-and-hugging-face-address-risks-to-ai-infrastructure)**).
    57  gVisor provides a layer of protection around this process which helps protect
    58  the host machine.
    60  Third, **machine learning applications are typically not I/O heavy**, which
    61  means they tend not to experience a significant performance overhead. The
    62  process of uploading code to the GPU is not a significant number of system
    63  calls, and most communication to/from the GPU happens over shared memory, where
    64  gVisor imposes no overhead. Therefore, the question is not so much "why should I
    65  run this GPU workload in gVisor?" but rather "why not?".
    67  ![Cool astronauts don't look at explosions](/assets/images/2023-06-20-turbo.png "Cool astronauts don't look at explosions.")
    68  <span class="attribution">**Cool astronauts don't look at explosions**.
    69  Generated using Stable Diffusion v1.5.</span>
    71  Lastly, running GPU workloads in gVisor is pretty cool.
    73  ## Setup
    75  We use a Debian virtual machine on GCE. The machine needs to have a GPU and to
    76  have sufficient RAM and disk space to handle Stable Diffusion and its large
    77  model files. The following command creates a VM with 4 vCPUs, 15GiB of RAM, 64GB
    78  of disk space, and an NVIDIA T4 GPU, running Debian 11 (bullseye). Since this is
    79  just an experiment, the VM is set to self-destruct after 6 hours.
    81  ```shell
    82  $ gcloud compute instances create stable-diffusion-testing \
    83      --zone=us-central1-a \
    84      --machine-type=n1-standard-4 \
    85      --max-run-duration=6h \
    86      --instance-termination-action=DELETE \
    87      --maintenance-policy TERMINATE \
    88      --accelerator=count=1,type=nvidia-tesla-t4 \
    89      --create-disk=auto-delete=yes,boot=yes,device-name=stable-diffusion-testing,image=projects/debian-cloud/global/images/debian-11-bullseye-v20230509,mode=rw,size=64
    90  $ gcloud compute ssh --zone=us-central1-a stable-diffusion-testing
    91  ```
    93  All further commands in this post are performed while SSH'd into the VM. We
    94  first need to install the specific NVIDIA driver version that gVisor is
    95  currently compatible with.
    97  ```shell
    98  $ sudo apt-get update && sudo apt-get -y upgrade
    99  $ sudo apt-get install -y build-essential linux-headers-$(uname -r)
   100  $ DRIVER_VERSION=525.60.13
   101  $ curl -fSsl -O "https://us.download.nvidia.com/tesla/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run"
   102  $ sudo sh NVIDIA-Linux-x86_64-$DRIVER_VERSION.run
   103  ```
   105  <!--
   106  The above in a single live, for convenience:
   107  DRIVER_VERSION=525.60.13; sudo apt-get update && sudo apt-get -y upgrade && sudo apt-get install -y build-essential linux-headers-$(uname -r) && curl -fSsl -O "https://us.download.nvidia.com/tesla/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run" && sudo sh NVIDIA-Linux-x86_64-$DRIVER_VERSION.run
   108  -->
   110  Next, we install Docker, per [its instructions][Docker installation on Debian].
   112  ```shell
   113  $ sudo apt-get install -y ca-certificates curl gnupg
   114  $ sudo install -m 0755 -d /etc/apt/keyrings
   115  $ curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor --batch --yes -o /etc/apt/keyrings/docker.gpg
   116  $ sudo chmod a+r /etc/apt/keyrings/docker.gpg
   117  $ echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
   118  $ sudo apt-get update && sudo apt-get install -y docker-ce docker-ce-cli
   119  ```
   121  <!--
   122  The above in a single live, for convenience:
   123  sudo apt-get install -y ca-certificates curl gnupg && sudo install -m 0755 -d /etc/apt/keyrings && curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor --batch --yes -o /etc/apt/keyrings/docker.gpg && sudo chmod a+r /etc/apt/keyrings/docker.gpg && echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null && sudo apt-get update && sudo apt-get install -y docker-ce docker-ce-cli
   124  -->
   126  We will also need the [NVIDIA container toolkit], which enables use of GPUs with
   127  Docker. Per its
   128  [installation instructions][NVIDIA container toolkit installation]:
   130  ```shell
   131  $ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
   132  $ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
   133  ```
   135  Of course, we also need to [install gVisor][gVisor setup] itself.
   137  ```shell
   138  $ sudo apt-get install -y apt-transport-https ca-certificates curl gnupg
   139  $ curl -fsSL https://gvisor.dev/archive.key | sudo gpg --dearmor -o /usr/share/keyrings/gvisor-archive-keyring.gpg
   140  $ echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/gvisor-archive-keyring.gpg] https://storage.googleapis.com/gvisor/releases release main" | sudo tee /etc/apt/sources.list.d/gvisor.list > /dev/null
   141  $ sudo apt-get update && sudo apt-get install -y runsc
   143  # As gVisor does not yet enable GPU support by default, we need to set the flags
   144  # that will enable it:
   145  $ sudo runsc install -- --nvproxy=true --nvproxy-docker=true
   147  $ sudo systemctl restart docker
   148  ```
   150  Now, let's make sure everything works by running commands that involve more and
   151  more of what we just set up.
   153  ```shell
   154  # Check that the NVIDIA drivers are installed, with the right version, and with
   155  # a supported GPU attached
   156  $ sudo nvidia-smi -L
   157  GPU 0: Tesla T4 (UUID: GPU-6a96a2af-2271-5627-34c5-91dcb4f408aa)
   158  $ sudo cat /proc/driver/nvidia/version
   159  NVRM version: NVIDIA UNIX x86_64 Kernel Module  525.60.13  Wed Nov 30 06:39:21 UTC 2022
   161  # Check that Docker works.
   162  $ sudo docker version
   163  # [...]
   164  Server: Docker Engine - Community
   165   Engine:
   166    Version:          24.0.2
   167  # [...]
   169  # Check that gVisor works.
   170  $ sudo docker run --rm --runtime=runsc debian:latest dmesg | head -1
   171  [    0.000000] Starting gVisor...
   173  # Check that Docker GPU support (without gVisor) works.
   174  $ sudo docker run --rm --gpus=all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi -L
   175  GPU 0: Tesla T4 (UUID: GPU-6a96a2af-2271-5627-34c5-91dcb4f408aa)
   177  # Check that gVisor works with the GPU.
   178  $ sudo docker run --rm --runtime=runsc --gpus=all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi -L
   179  GPU 0: Tesla T4 (UUID: GPU-6a96a2af-2271-5627-34c5-91dcb4f408aa)
   180  ```
   182  We're all set! Now we can actually get Stable Diffusion running.
   184  We used the following `Dockerfile` to run Stable Diffusion and its web UI within
   185  a GPU-enabled Docker container.
   187  ```dockerfile
   188  FROM python:3.10
   190  # Set of dependencies that are needed to make this work.
   191  RUN apt-get update && apt-get install -y git wget build-essential \
   192          nghttp2 libnghttp2-dev libssl-dev ffmpeg libsm6 libxext6
   193  # Clone the project at the revision used for this test.
   194  RUN git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git && \
   195      cd /stable-diffusion-webui && \
   196      git checkout baf6946e06249c5af9851c60171692c44ef633e0
   197  # We don't want the build step to start the server.
   198  RUN sed -i '/start()/d' /stable-diffusion-webui/launch.py
   199  # Install some pip packages.
   200  # Note that this command will run as part of the Docker build process,
   201  # which is *not* sandboxed by gVisor.
   202  RUN cd /stable-diffusion-webui && COMMANDLINE_ARGS=--skip-torch-cuda-test python launch.py
   203  WORKDIR /stable-diffusion-webui
   204  # This causes the web UI to use the Gradio service to create a public URL.
   205  # Do not use this if you plan on leaving the container running long-term.
   206  ENV COMMANDLINE_ARGS=--share
   207  # Start the webui app.
   208  CMD ["python", "webui.py"]
   209  ```
   211  We build the image and create a container with it using the `docker`
   212  command-line.
   214  ```shell
   215  $ cat > Dockerfile
   216  (... Paste the above contents...)
   217  ^D
   218  $ sudo docker build --tag=sdui .
   219  ```
   221  Finally, we can start the Stable Diffusion web UI. Note that it will take a long
   222  time to start, as it has to download all the models from the Internet. To keep
   223  this post simple, we didn't set up any kind of volume that would enable data
   224  persistence, so it will do this every time the container starts.
   226  ```shell
   227  $ sudo docker run --runtime=runsc --gpus=all --name=sdui --detach sdui
   229  # Follow the logs:
   230  $ sudo docker logs -f sdui
   231  # [...]
   232  Calculating sha256 for /stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors: Running on local URL:
   233  Running on public URL: https://4446d982b4129a66d7.gradio.live
   235  This share link expires in 72 hours.
   236  # [...]
   237  ```
   239  We're all set! Now we can browse to the Gradio URL shown in the logs and start
   240  generating pictures, all within the secure confines of gVisor.
   242  ![Stable Diffusion Web UI](/assets/images/2023-06-20-stable-diffusion-web-ui.png "Stable Diffusion UI."){:width="100%"}
   243  <span class="attribution">**Stable Diffusion Web UI screenshot.** Inner image
   244  generated with Stable Diffusion v1.5.</span>
   246  Happy sandboxing!
   248  ![Astronaut showing thumbs up](/assets/images/2023-06-20-astronaut-thumbs-up.png "Astronaut showing thumbs up.")
   249  <span class="attribution">**Happy sandboxing!** Generated with Stable Diffusion
   250  v1.5.</span>
   252  [gVisor GPU support]: https://github.com/google/gvisor/blob/master/g3doc/proposals/nvidia_driver_proxy.md
   253  [Stable Diffusion]: https://stability.ai/blog/stable-diffusion-public-release
   254  [Stability AI]: https://stability.ai/
   255  [automatic1111/stable-diffusion-webui]: https://github.com/AUTOMATIC1111/stable-diffusion-webui
   256  [Stable Diffusion Web UI extensions]: https://github.com/AUTOMATIC1111/stable-diffusion-webui-extensions/blob/master/index.json
   257  [PyTorch]: https://pytorch.org/
   258  [safetensors]: https://github.com/huggingface/safetensors
   259  [Pickle format]: https://www.splunk.com/en_us/blog/security/paws-in-the-pickle-jar-risk-vulnerability-in-the-model-sharing-ecosystem.html
   260  [Docker installation on Debian]: https://docs.docker.com/engine/install/debian/
   261  [NVIDIA container toolkit]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html
   262  [NVIDIA container toolkit installation]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
   263  [gVisor setup]: https://gvisor.dev/docs/user_guide/install/