Source: Virtual Trusted Platform Module for Shielded VMs: security in plaintext from Google Cloud
Today, we shared details of Shielded VMs, a suite of security tools and techniques that demonstrate that a VM hasn’t been compromised. As part of the launch, we used Shielded VM to create several of our curated Google Compute Engine instances and attached a virtual Trusted Platform Module 2.0 (TPM) device to them.
“Okay…,” we hear you asking, “what’s a TPM device and why should I care?” We’re glad you asked!
A TPM is a hardware, firmware, or virtual device that aids in securing machines in several ways: it can generate keys, use them for cryptographic operations (e.g., for symmetric and asymmetric key generation, signing, and decryption), and certify them based on its root Endorsement Key (which is in turn certified by the Google Public Root Certificate Authority). The TPM’s root keys and the keys that it generates can’t leave the TPM, thus gaining protection from compromised operating systems or highly privileged project admins. In addition, any private keys that you create on the TPM cannot be exported unless you explicitly configure them as such.
Here’s a high-level summary of a TPM’s main features:
Figure 1. The TPM’s key generation and certification and its system state capture interact heavily
TPM’s most exciting (or most widely used) feature are its Platform Configuration Registers (PCRs), which provide a concise, append-only log of system state. Using the TPM’s keys, vTPM provides a signed attestation (known as a quote) of the PCR values. Remote servers can use this quote to verify the system’s state. Additionally, the TPM can seal secrets based on the contents of the PCRs, so that the secrets can only be accessed if system state is valid.
A virtual TPM (vTPM), meanwhile, appears to the guest like a normal TPM device, and complies with the TPM 2.0 specification using FIPS 140-2 L1 certified cryptography. This means our vTPM should work identically to any existing TPMs you may be using on your operating systems. We’ve tested (and officially support) the vTPM with several instances of our Container-Optimized OS, as well as Windows Server 2012 R2 and Windows Server 2016, Windows 2017, Windows 2018, Ubuntu 18.04, with more to come with the beta.
The presence of TPMs makes it possible to perform any number of security tasks. For example, one kind of secret that is commonly sealed to TPM state is a drive decryption key. Sealing these keys makes it infeasible to decrypt a drive unless the operating system has booted correctly and is in a known-good state. This means that if a malicious attacker compromises your operating system, they will not be able to achieve persistence across reboots; any reboot of a compromised operating system (or firmware, or bootloader) results in a different state for the TPM, so it’s unable to unseal the disk decryption keys.
The TPM’s quote feature enables another common use case—remote attestation. If server A is communicating with server B, the servers may wish to validate that the other is in a known good state, in addition to verifying their authentication tokens.
Here is how such a system could work. (This is by no means the only way to implement an authentication system.)
A common question is how a vTPM PCR proves the integrity of the values it stores. Here’s an overview.
PCRs implement the append-only property because they can only be written via a TPM extend operation, which combines the current value with the new value using a cryptographic hash h:
Because of this construction, a PCR value uniquely identifies the entire chain of values that was used to generate it. To put it another way, if software extends the PCR with a series of values A, then B, then C, and the PCR value is X after these extend operations, it is cryptographically guaranteed that whenever the PCR contains value X, it has been extended with A, then B, then C (and nothing else).
Figure 2. An example when PCR extends to an initially-empty PCR. First, the string “hello” is extended to PCR0, then “goodbye”.
Figure 3. Example Python code that manually computes the hashes shown in Figure 2.
TPMs provide another key feature, measured boot, which is the groundwork for all sorts of interesting and useful security capabilities.
Measured boot refers to the process by which the bootloader and operating system extend PCRs with measurements of the software or configuration that they load during the boot process. Critically, this measurement happens before any code from any newly loaded software executes, which means that an adversary can’t modify system state or tamper with the measuring process. Thus, if you trust the integrity of the first part of the firmware (called the Root of Trust for Measurement, or RTM), your PCR values will match your expectations if (and only if) the boot sequence was what you expected.
When combined with a seal operation, measured boot is very powerful: You can seal keys (for example, disk encryption keys) to PCR values so that you can only use them if the boot sequence matches what you expected. This improves security in the event that, for example, a malicious program overwrites part of your kernel with malware. When this happens, the kernel’s measurement changes and the disk decryption keys remain sealed.
With measured boot, you can also obtain signed PCR values from the TPM and use them to prove to remote servers that the boot state is valid. In the future, we hope to integrate TPM measured boot with other Google Cloud services, to enable them to make trust decisions based upon whether the VM making a request booted to a known good state.
Here at Google, we take significant steps to ensure the integrity of code running on our servers (for more details, check out our Google Infrastructure Security Design Overview). In brief, these steps include verifying boot integrity by requiring cryptographic signatures over low-level components such as the BIOS, bootloader, kernel, and base operating system image. In addition, we implement systems that require any code running in production be reviewed and approved by an engineer other than its author.
Building on that, we’ve talked about our plan to use our Titan chips to enable first-instruction integrity on our production machines. This will allow us to establish a hardware root of trust that we can chain all the way to vTPM— not something that TPMs typically do (RTMs are usually implemented in software that can in principle be compromised, although hopefully not very easily!).
The core of our TPM implementation comes from code published by IBM and extracted automatically from the full source code of the TPM 2.0 spec.This code is fuzzed extensively by members of the external community, including Chromium’s OSS-fuzz (documentation). We also fuzz internally, using recorded traces of the TCG compliance test (version 2.1a) and Microsoft HLK tests. In addition, we run both of these tests continuously, to ensure that the vTPM continues to function correctly as we make changes to it.
We modified this TPM code to use the BoringCrypto (or BoringSSL) library for all of its cryptographic operations, rather than the TPM spec’s less audited and tested crypto code. BoringCrypto is hardened against side-channels and cryptographic attacks, is actively maintained by cryptographers, and has FIPS 140-2 L1 validation.
The data that vTPM stores is, of course, of the utmost sensitivity: it contains encryption keys for customer data and keys whose secrecy is essential to validating boot integrity. To protect this data, we encrypt it in transit between machines (for example, during a live migration), as well as at rest using the encryption capabilities of our Spanner database. Requests to access this data from Spanner also require an end user permission ticket, as described in the Google Infrastructure Security Design Overview.
And while the contents of the vTPM are stored in plain-text in host memory during VM execution, so is the entire space of VM memory, so this does present an increased risk. Standard CPU paging mechanisms protect the contents of the vTPM from access by the VM and any malicious or compromised users with root access to it.
This is just the beginning of vTPM and GCP’s Shielded VM offering, and we’re planning to significantly expand it to new use cases in the future. Watch this space as we discuss other capabilities that Shielded VMs and vTPM make possible.
1.Assuming that the cryptographic hash algorithm used (typically SHA256, but configurable to SHA384 or SHA512) is cryptographic.2.In TPM2.0 you can use Extended Authorization policies for added flexibility: instead of sealing to particular PCR values, you seal based on the policy, with an associated public key. When system administrators wish to update the OS or firmware on the computer, they must first distribute new signed policies to the TPMs. This allows you to keep the disk encrypted (and keep the encryption key in the TPM) across OS upgrades, which TPM 1.2 did not allow. 3.We’ve disabled certain tests based on known errata or tests which look for behavior in older TPM 2.0 versions. In particular, the TCG test checks compliance with version 1.16 of the TPM 2 spec, while we implement 1.42.