Source: Introducing advanced security options for Cloud Dataproc, now generally available from Google Cloud
Google Cloud Platform (GCP) offers security and governance products that help you meet your policy, regulatory, and business objectives. The controls and capabilities we offer are always expanding. We’re pleased to announce that we’ve expanded the security capabilities of Cloud Dataproc, our fully managed Hadoop and Spark service, by making Kerberos and Hadoop secure mode security configurations generally available.
Cloud Dataproc’s new security configurations give you the best of two worlds: access to modern, best-in-class security features and infrastructure, and the familiar controls you’ve already developed for your Hadoop and Spark environments.
Moving on-prem Hadoop clusters securely
With Kerberos and Hadoop secure mode, you can migrate your existing Hadoop security controls directly into the cloud without having to make changes to your security policies and procedures. You can now enable new tools in Cloud Dataproc, including:
Here’s a look at a common customer setup for Kerberos on Cloud Dataproc.
Standing on the shoulders of GCP security
Kerberos and Hadoop secure mode provides you parity with legacy Hadoop security platforms, making it easy to port your existing procedures and policies. However, you may find that even though you maintain existing security practices, the overall security posture of your Hadoop and Spark environments greatly improves with the migration to GCP.
This is because Cloud Dataproc and GCP take advantage of the same secure-by-design infrastructure, built-in protection, and global network that Google uses to protect your information, identities, applications, and devices. In addition, GCP and Cloud Dataproc offer additional security features that help protect your data. Some of the most commonly used GCP-specific security features used with Cloud Dataproc include:
Default at-rest encryption, where GCP encrypts customer data stored at rest by default, with no additional action required from you. We offer a continuum of encryption key management options, including a CMEK feature that lets you create, use, and revoke the key encryption key (KEK).
Stackdriver Monitoring provides visibility into the performance, uptime, and overall health of cloud-powered applications. Stackdriver collects and ingests metrics, events, and metadata from Cloud Dataproc clusters to bring you insights via dashboards and charts.
OS Login allows you to use Compute Engine IAM roles to manage SSH access to Cloud Dataproc instances. This is an alternative to manually managing instance access by adding and removing SSH keys in metadata.
VPC Service Controls allow you to define a security perimeter around Cloud Dataproc and the data stored in Cloud Storage buckets. Datasets can be constrained within a VPC to help mitigate data exfiltration risks. With VPC Service Controls, you can keep sensitive data private and still take advantage of the fully managed storage and data processing capabilities of GCP.
These features and many others are certified by third-party auditors. Cloud Dataproc certifications include the most widely recognized, internationally accepted independent security standards, including ISO for security controls, cloud security and privacy, as well as SOC 1, 2, and 3. These certifications help us meet the demands of industry standards such as HIPAA and PCI. We continue to expand our list of certifications globally to assist our customers with their compliance obligations.
End-to-end authorization with GCP Token Broker
As a typical cloud best practice, we recommend that the GCP service accounts associated with the virtual machines (or cloud infrastructure) access datasets on behalf of a user. Many Cloud Dataproc customers choose to provision small autoscaling clusters for each Cloud Dataproc user. This way, there is a clear audit log to see who was on which cluster when it accessed a Cloud Storage dataset.
However, we also hear that many enterprise customers would prefer to use multi-tenant clusters and have strict compliance requirements that dictate that access to GCP resources (Cloud Storage, BigQuery, Cloud Bigtable, etc.) must be attributable to the individual user who initiated the request. In addition, to meet compliance requirements, this should be done in a way that ensures no long-lived credentials are stored on client machines or worker nodes.
To meet these customer goals, Google Cloud created an open source GCP Token Broker. The GCP Token Broker enables end-to-end Kerberos security and Cloud IAM integration for Hadoop workloads on GCP. You can use this open source software to bridge the gap between Kerberos and Cloud IAM to allow users to log in with Kerberos and access GCP resources.
The following diagram illustrates the overall architecture for direct authentication.
For more on how the GCP Token Broker extends the functionality of the generally available Kerberos and Hadoop secure mode in Cloud Dataproc, check out the joint Google and Cloudera session from Google Cloud Next ’19: Building and Securing Data Lakes.
Getting started with secure mode
To get started with Kerberos and Hadoop secure mode, check “Enable Kerberos and Hadoop secure mode” in the Cloud Dataproc console, as shown here:
To securely exchange a secret key and administrator password, you will first need to create those files outside of the console and encrypt them using Cloud Key Management Service.
By default, Cloud Dataproc will turn on all the features of Hadoop secure mode, including in-flight encryption. Cloud Dataproc will auto-generate a self-signed certificate for the encryption, or you can upload your own.
Any default setting can be overwritten using a cluster property. For example, if you want to enable multi-tenant Cloud Dataproc but don’t have compliance requirements that warrant the performance penalty associated with in-transit encryption within a VPC, you can disable the in-transit encryption by setting the following Cloud Dataproc properties:
You can set these properties from gcloud or in the cluster properties page, as shown here:
A cross-realm trust option is also available if you want to rely on an external directory like Microsoft Active Directory.
For complete instructions on setting up different types of security configurations, check out Cloud Dataproc security configuration.