github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/website/blog/2020-04-02-networking-security.md (about)

     1  # gVisor Networking Security
     2  
     3  In our
     4  [first blog post](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/),
     5  we covered some secure design principles and how they guided the architecture of
     6  gVisor as a whole. In this post, we will cover how these principles guided the
     7  networking architecture of gVisor, and the tradeoffs involved. In particular, we
     8  will cover how these principles culminated in two networking modes, how they
     9  work, and the properties of each.
    10  
    11  ## gVisor's security architecture in the context of networking
    12  
    13  Linux networking is complicated. The TCP protocol is over 40 years old, and has
    14  been repeatedly extended over the years to keep up with the rapid pace of
    15  network infrastructure improvements, all while maintaining compatibility. On top
    16  of that, Linux networking has a fairly large API surface. Linux supports
    17  [over 150 options](https://github.com/google/gvisor/blob/960f6a975b7e44c0efe8fd38c66b02017c4fe137/pkg/sentry/strace/socket.go#L476-L644)
    18  for the most common socket types alone. In fact, the net subsystem is one of the
    19  largest and fastest growing in Linux at approximately 1.1 million lines of code.
    20  For comparison, that is several times the size of the entire gVisor codebase.
    21  
    22  At the same time, networking is increasingly important. The cloud era is
    23  arguably about making everything a network service, and in order to make that
    24  work, the interconnect performance is critical. Adding networking support to
    25  gVisor was difficult, not just due to the inherent complexity, but also because
    26  it has the potential to significantly weaken gVisor's security model.
    27  
    28  As outlined in the previous blog post, gVisor's
    29  [secure design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#design-principles)
    30  are:
    31  
    32  1.  Defense in Depth: each component of the software stack trusts each other
    33      component as little as possible.
    34  1.  Least Privilege: each software component has only the permissions it needs
    35      to function, and no more.
    36  1.  Attack Surface Reduction: limit the surface area of the host exposed to the
    37      sandbox.
    38  1.  Secure by Default: the default choice for a user should be safe.
    39  
    40  gVisor manifests these principles as a multi-layered system. An application
    41  running in the sandbox interacts with the Sentry, a userspace kernel, which
    42  mediates all interactions with the Host OS and beyond. The Sentry is written in
    43  pure Go with minimal unsafe code, making it less vulnerable to buffer overflows
    44  and related memory bugs that can lead to a variety of compromises including code
    45  injection. It emulates Linux using only a minimal and audited set of Host OS
    46  syscalls that limit the Host OS's attack surface exposed to the Sentry itself.
    47  The syscall restrictions are enforced by running the Sentry with seccomp
    48  filters, which enforce that the Sentry can only use the expected set of
    49  syscalls. The Sentry runs as an unprivileged user and in namespaces, which,
    50  along with the seccomp filters, ensure that the Sentry is run with the Least
    51  Privilege required.
    52  
    53  gVisor's multi-layered design provides Defense in Depth. The Sentry, which does
    54  not trust the application because it may attack the Sentry and try to bypass it,
    55  is the first layer. The sandbox that the Sentry runs in is the second layer. If
    56  the Sentry were compromised, the attacker would still be in a highly restrictive
    57  sandbox which they must also break out of in order to compromise the Host OS.
    58  
    59  To enable networking functionality while preserving gVisor's security
    60  properties, we implemented a
    61  [userspace network stack](https://github.com/google/gvisor/tree/master/pkg/tcpip)
    62  in the Sentry, which we creatively named Netstack. Netstack is also written in
    63  Go, not only to avoid unsafe code in the network stack itself, but also to avoid
    64  a complicated and unsafe Foreign Function Interface. Having its own integrated
    65  network stack allows the Sentry to implement networking operations using up to
    66  three Host OS syscalls to read and write packets. These syscalls allow a very
    67  minimal set of operations which are already allowed (either through the same or
    68  a similar syscall). Moreover, because packets typically come from off-host (e.g.
    69  the internet), the Host OS's packet processing code has received a lot of
    70  scrutiny, hopefully resulting in a high degree of hardening.
    71  
    72  ![Figure 1](/assets/images/2020-04-02-networking-security-figure1.png "Network and gVisor.")
    73  
    74  ## Writing a network stack
    75  
    76  Netstack was written from scratch specifically for gVisor. Because Netstack was
    77  designed and implemented to be modular, flexible and self-contained, there are
    78  now several more projects using Netstack in creative and exciting ways. As we
    79  discussed, a custom network stack has enabled a variety of security-related
    80  goals which would not have been possible any other way. This came at a cost
    81  though. Network stacks are complex and writing a new one comes with many
    82  challenges, mostly related to application compatibility and performance.
    83  
    84  Compatibility issues typically come in two forms: missing features, and features
    85  with behavior that differs from Linux (usually due to bugs). Both of these are
    86  inevitable in an implementation of a complex system spanning many quickly
    87  evolving and ambiguous standards. However, we have invested heavily in this
    88  area, and the vast majority of applications have no issues using Netstack. For
    89  example,
    90  [we now support setting 34 different socket options](https://github.com/google/gvisor/blob/815df2959a76e4a19f5882e40402b9bbca9e70be/pkg/sentry/socket/netstack/netstack.go#L830-L1764)
    91  versus
    92  [only 7 in our initial git commit](https://github.com/google/gvisor/blob/d02b74a5dcfed4bfc8f2f8e545bca4d2afabb296/pkg/sentry/socket/epsocket/epsocket.go#L445-L702).
    93  We are continuing to make good progress in this area.
    94  
    95  Performance issues typically come from TCP behavior and packet processing speed.
    96  To improve our TCP behavior, we are working on implementing the full set of TCP
    97  RFCs. There are many RFCs which are significant to performance (e.g.
    98  [RACK](https://tools.ietf.org/id/draft-ietf-tcpm-rack-03.html) and
    99  [BBR](https://tools.ietf.org/html/draft-cardwell-iccrg-bbr-congestion-control-00))
   100  that we have yet to implement. This mostly affects TCP performance with
   101  non-ideal network conditions (e.g. cross continent connections). Faster packet
   102  processing mostly improves TCP performance when network conditions are very good
   103  (e.g. within a datacenter). Our primary strategy here is to reduce interactions
   104  with the Go runtime, specifically the garbage collector (GC) and scheduler. We
   105  are currently optimizing buffer management to reduce the amount of garbage,
   106  which will lower the GC cost. To reduce scheduler interactions, we are
   107  re-architecting the TCP implementation to use fewer goroutines. Performance
   108  today is good enough for most applications and we are making steady
   109  improvements. For example, since May of 2019, we have improved the Netstack
   110  runsc
   111  [iperf3 download benchmark](https://github.com/google/gvisor/tree/master/test/benchmarks/network)
   112  score by roughly 15% and upload score by around 10,000X. Current numbers are
   113  about 17 Gbps download and about 8 Gbps upload versus about 42 Gbps and 43 Gbps
   114  for native (Linux) respectively.
   115  
   116  ## An alternative
   117  
   118  We also offer an alternative network mode: passthrough. This name can be
   119  misleading as syscalls are never passed through from the app to the Host OS.
   120  Instead, the passthrough mode implements networking in gVisor using the Host
   121  OS's network stack. (This mode is called
   122  [hostinet](https://github.com/google/gvisor/tree/master/pkg/sentry/socket/hostinet)
   123  in the codebase.) Passthrough mode can improve performance for some use cases as
   124  the Host OS's network stack has had an enormous number of person-years poured
   125  into making it highly performant. However, there is a rather large downside to
   126  using passthrough mode: it weakens gVisor's security model by increasing the
   127  Host OS's Attack Surface. This is because using the Host OS's network stack
   128  requires the Sentry to use the Host OS's
   129  [Berkeley socket interface](https://en.wikipedia.org/wiki/Berkeley_sockets). The
   130  Berkeley socket interface is a much larger API surface than the packet interface
   131  that our network stack uses. When passthrough mode is in use, the Sentry is
   132  allowed to use
   133  [15 additional syscalls](https://github.com/google/gvisor/blob/b1576e533223e98ebe4bd1b82b04e3dcda8c4bf1/runsc/boot/filter/config.go#L312-L517).
   134  Further, this set of syscalls includes some that allow the Sentry to create file
   135  descriptors, something that
   136  [we don't normally allow](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#sentry-host-os-interface)
   137  as it opens up classes of file-based attacks.
   138  
   139  There are some networking features that we can't implement on top of syscalls
   140  that we feel are safe (most notably those behind
   141  [ioctl](http://man7.org/linux/man-pages/man2/ioctl.2.html)) and therefore are
   142  not supported. Because of this, we actually support fewer networking features in
   143  passthrough mode than we do in Netstack, reducing application compatibility.
   144  That's right: using our networking stack provides better overall application
   145  compatibility than using our passthrough mode.
   146  
   147  That said, gVisor with passthrough networking still provides a high level of
   148  isolation. Applications cannot specify host syscall arguments directly, and the
   149  sentry's seccomp policy restricts its syscall use significantly more than a
   150  general purpose seccomp policy.
   151  
   152  ## Secure by Default
   153  
   154  The goal of the Secure by Default principle is to make it easy to securely
   155  sandbox containers. Of course, disabling network access entirely is the most
   156  secure option, but that is not practical for most applications. To make gVisor
   157  Secure by Default, we have made Netstack the default networking mode in gVisor
   158  as we believe that it provides significantly better isolation. For this reason
   159  we strongly caution users from changing the default unless Netstack flat out
   160  won't work for them. The passthrough mode option is still provided, but we want
   161  users to make an informed decision when selecting it.
   162  
   163  Another way in which gVisor makes it easy to securely sandbox containers is by
   164  allowing applications to run unmodified, with no special configuration needed.
   165  In order to do this, gVisor needs to support all of the features and syscalls
   166  that applications use. Neither seccomp nor gVisor's passthrough mode can do this
   167  as applications commonly use syscalls which are too dangerous to be included in
   168  a secure policy. Even if this dream isn't fully realized today, gVisor's
   169  architecture with Netstack makes this possible.
   170  
   171  ## Give Netstack a Try
   172  
   173  If you haven't already, try running a workload in gVisor with Netstack. You can
   174  find instructions on how to get started in our
   175  [Quick Start](/docs/user_guide/quick_start/docker/). We want to hear about both
   176  your successes and any issues you encounter. We welcome your contributions,
   177  whether that be verbal feedback or code contributions, via our
   178  [Gitter channel](https://gitter.im/gvisor/community),
   179  [email list](https://groups.google.com/forum/#!forum/gvisor-users),
   180  [issue tracker](https://gvisor.dev/issue/new), and
   181  [Github repository](https://github.com/google/gvisor). Feel free to express
   182  interest in an [open issue](https://gvisor.dev/issue/), or reach out if you
   183  aren't sure where to start.