Source: Open-sourcing gVisor, a sandboxed container runtime from Google Cloud Platform
By Nicolas Lacasse, Software Engineer
Containers have revolutionized how we develop, package, and deploy applications. However, the system surface exposed to containers is broad enough that many security experts don’t recommend them for running untrusted or potentially malicious applications.
A growing desire to run more heterogenous and less trusted workloads has created a new interest in sandboxed containers—containers that help provide a secure isolation boundary between the host OS and the application running inside the container.
To that end, we’d like to introduce gVisor, a new kind of sandbox that helps provide secure isolation for containers, while being more lightweight than a virtual machine (VM). gVisor integrates with Docker and Kubernetes, making it simple and easy to run sandboxed containers in production environments.
Applications that run in traditional Linux containers access system resources in the same way that regular (non-containerized) applications do: by making system calls directly to the host kernel. The kernel runs in a privileged mode that allows it to interact with the necessary hardware and return results to the application.
With traditional containers, the kernel imposes some limits on the resources the application can access. These limits are implemented through the use of Linux cgroups and namespaces, but not all resources can be controlled via these mechanisms. Furthermore, even with these limits, the kernel still exposes a large surface area that malicious applications can attack directly.
Kernel features like seccomp filters can provide better isolation between the application and host kernel, but they require the user to create a predefined whitelist of system calls. In practice, it’s often difficult to know which system calls will be required by an application beforehand. Filters also provide little help when a vulnerability is discovered in a system call that your application requires.
One approach to improve container isolation is to run each container in its own virtual machine (VM). This gives each container its own “machine,” including kernel and virtualized devices, completely separate from the host. Even if there is a vulnerability in the guest, the hypervisor still isolates the host, as well as other applications/containers running on the host.
Running containers in distinct VMs provides great isolation, compatibility, and performance, but may also require a larger resource footprint.
Kata containers is an open-source project that uses stripped-down VMs to keep the resource footprint minimal and maximize performance for container isolation. Like gVisor, Kata contains an Open Container Initiative (OCI) runtime that is compatible with Docker and Kubernetes.
gVisor is more lightweight than a VM while maintaining a similar level of isolation. The core of gVisor is a kernel that runs as a normal, unprivileged process that supports most Linux system calls. This kernel is written in Go, which was chosen for its memory- and type-safety. Just like within a VM, an application running in a gVisor sandbox gets its own kernel and set of virtualized devices, distinct from the host and other sandboxes.
gVisor provides a strong isolation boundary by intercepting application system calls and acting as the guest kernel, all while running in user-space. Unlike a VM which requires a fixed set of resources on creation, gVisor can accommodate changing resources over time, as most normal Linux processes do. gVisor can be thought of as an extremely paravirtualized operating system with a flexible resource footprint and lower fixed cost than a full VM. However, this flexibility comes at the price of higher per-system call overhead and application compatibility—more on that below.
“Secure workloads are a priority for the industry. We are encouraged to see innovative approaches like gVisor and look forward to collaborating on specification clarifications and making improvements to joint technical components in order to bring additional security to the ecosystem.”
— Samuel Ortiz, member of the Kata Technical Steering Committee and Principal Engineer at Intel Corporation
“Hyper is encouraged to see gVisor’s novel approach to container isolation. The industry requires a robust ecosystem of secure container technologies, and we look forward to collaborating on gVisor to help bring secure containers into the mainstream.”
— Xu Wang, member of the Kata Technical Steering Committee and CTO at Hyper.sh
The gVisor runtime integrates seamlessly with Docker and Kubernetes though runsc (short for “run Sandboxed Container”), which conforms to the OCI runtime API.
The runsc runtime is interchangeable with runc, Docker’s default container runtime. Installation is simple; once installed it only takes a single additional flag to run a sandboxed container in Docker:
$ docker run --runtime=runsc hello-world $ docker run --runtime=runsc -p 3306:3306 mysql
In Kubernetes, most resource isolation occurs at the pod level, making the pod a natural fit for a gVisor sandbox boundary. The Kubernetes community is currently formalizing the sandbox pod API, but experimental support is available today.
gVisor implements a large part of the Linux system API (200 system calls and counting), but not all. Some system calls and arguments are not currently supported, as are some parts of the /proc and /sys filesystems. As a result, not all applications will run inside gVisor, but many will run just fine, including Node.js, Java 8, MySQL, Jenkins, Apache, Redis, MongoDB, and many more.
As developers, we want the best of both worlds: the ease of use and portability of containers, and the resource isolation of VMs. We think gVisor is a great step in that direction. Check out our repo on GitHub to find how to get started with gVisor and to learn more of the technical details behind it. And be sure to join our Google group to take part in the discussion!
If you’re at KubeCon in Copenhagen join us at our booth for a deep dive demo and discussion.
Also check out an interview with the gVisor PM to learn more.