Source: Modern data warehousing with BigQuery: a Q&A with Engineering Director Jordan Tigani from Google Cloud
As more and more businesses look to the cloud to store and manage their data, an increasing number are embracing BigQuery as their serverless, highly scalable, enterprise data warehouse. Leading enterprises, agile startups, and planet-scale internet companies have all adopted BigQuery with equal ease, and customers from industries as diverse as retail, financial services, healthcare, and gaming are using it to uncover valuable insights from their data—all at an impressive price-performance to price ratio.
With analyst firm Forrester recently recognizing Google as a Leader in their Cloud Data Warehouse industry report, I sat down with Engineering Director Jordan Tigani to talk about the evolution of data warehousing, the current technology landscape, and how BigQuery fits in. Here’s a brief excerpt from our conversation.
Saptarshi: Jordan, you have witnessed BigQuery’s growth first-hand. What are some of the key engineering choices you and your team made to make BigQuery easy for our customers to adopt?
Jordan: We built BigQuery to serve as a cloud-native data warehouse. BigQuery’s serverless topology allows customers of any size to bring their data into the data warehouse and start analyzing their data using Standard SQL, without worrying about database operations and system engineering. Storage and compute are decoupled and can scale independently, on-demand. This structure offers both immense flexibility and cost controls for our customers, because they don’t need to keep their expensive compute resources up and running all the time. This is very different from traditional node-based cloud data warehouse solutions or on-premise massively parallel processing (MPP) systems.
I think these are some of the fundamental reasons why we’re able to serve a large spectrum of customers that have varying amounts of expertise in managing large-scale data warehousing systems. Regardless of scale, they’re all able to adopt BigQuery.
Saptarshi: In your words, what does it mean to be a “serverless” data warehouse?
Jordan: Serverless is a simple but powerful concept when it comes to gigabyte- to petabyte-scale data analysis. It’s a relatively hard engineering problem. BigQuery’s on-demand analysis engine is provisioned on-the-fly, based on the computational requirements of a specific query. There is no need for customers to define nodes or clusters. BigQuery also automatically manages query performance based on the volume of data it needs to process. This is a fundamentally different approach.
Saptarshi: We’ve heard from customers, partners, and industry analysts that they appreciate how quickly BigQuery handles gigabyte- to petabyte-scale data analysis. In the recent Forrester Wave for Cloud Data Warehouse, Google received a 5 out of 5 in the performance and scale criterion, the highest score possible among all vendors. What makes BigQuery’s analysis engine so fast?
Jordan: It’s really all about the infrastructure. Google’s infrastructure, particularly network infrastructure, allows BigQuery to scale without having to rely on caching layers. If you haven’t queried a table in a couple of weeks, it will perform just as well as the one that you query all day.
Saptarshi: In data warehousing, storage is often as important as the analysis engine. What’s unique about BigQuery storage, and what are some of its new capabilities?
Jordan: I come from a storage background, so I love hearing this question. We’ve added clustering recently, which allows users to query faster and less expensively by reading less data. Clustering allows you to quickly find a needle in a haystack, and you essentially pay the price of the needle, not the haystack. This feature makes data management easier, since you don’t need to split your tables up into small pieces to save time or money.
Saptarshi: A lot of the enterprises adopting BigQuery are migrating their data warehouse from traditional on-premise systems. Does this create a new set of requirements for your team and how is your team addressing these differences?
Jordan: The ability to handle migrations from on-prem systems is critical for enterprise adoption. BigQuery supports a Standard SQL dialect which is ANSI:2011-compliant, which reduces the need for code rewrite and allows you to take advantage of advanced SQL features. We provide free ODBC and JDBC drivers to ensure your current applications can interact with BigQuery’s powerful query engine. And BigQuery supports native integration with enterprise business intelligence (BI) tools such as Tableau and Looker, and ETL tools such as Informatica, Talend, and Stitch to reduce change management to a great extent for enterprise customers.
Saptarshi: As I talk to more enterprises, I increasingly hear that data security and reliability are top of mind. How are we addressing these critical needs?
Jordan: Security is at the top of our agenda as well, and we’re heavily investing in making BigQuery a secure and mission-critical data warehouse for our customers. BigQuery eliminates the data operations burden by providing automatic data replication for disaster recovery and high availability of processing for no additional charge. BigQuery offers a 99.9% SLA and adheres to the Privacy Shield Principles. BigQuery also offers fine-grained identity and access-management controls to make it easier to maintain strong security. Plus, BigQuery data is always encrypted, both at rest and in transit.
Saptarshi: Traditional data warehouses were designed for the batch analytics paradigm. BigQuery supports both batch and streaming data inserts. What are our plans to support real-time analytics with BigQuery?
Jordan: A lot of our customers are trying to get “fresh” data in their data warehouse and are not ready to wait for hours or days to get the latest business data. We find that this trend applies primarily to financial services, e-commerce, gaming and media customers. BigQuery’s high-speed streaming insertion API provides a powerful foundation for real-time analytics. BigQuery allows customers to analyze what’s happening now by making the latest business data immediately available for analysis. For example, Zulily uses BigQuery to stream billions of events from their web applications and perform real-time analytics. We’re making it easy to insert streaming data from Cloud Dataflow directly into BigQuery and customers with IoT, e-commerce, and mobile gaming applications are already leveraging this capability to analyze near real-time data inside BigQuery.
Saptarshi: In the enterprise, data warehouses have primarily been used for BI and reporting applications, but enterprises are increasingly undertaking AI initiatives. What are our plans with BigQuery to support the current and future machine learning-specific needs of our enterprise customers?
Jordan: Our partner engineering team has been working with leading BI partners to offer native integration with BigQuery. Data Studio, Google’s free BI and reporting tool, now has more than one million monthly users. But enterprises often expect to retrieve data from their traditional data warehouses for AI and ML projects, and in doing so, they create data silos. This silo effect was a challenge we addressed inside Google, and we knew it was a problem we had to solve for our customers as well. We have evolved BigQuery into a flexible, powerful foundation for machine learning and artificial intelligence. Besides bringing ML to your data with BigQuery ML, integrations with Cloud ML Engine and TensorFlow enable data scientists to train powerful models on structured data. Moreover, BigQuery’s ability to transform and analyze data helps you get your data in shape for machine learning, through ad hoc exploration and data preparation (often cleaning).
Saptarshi: We recently announced BigQuery GIS for geospatial analytics inside the data warehouse. Today, it’s unique to BigQuery. What inspired you and your team to build this functionality, and can you share some details on how BigQuery GIS works?
Jordan: Geo-spatial data is becoming a fundamental component of many customer applications. Especially for customers operating in retail, logistics, and energy sectors. Our customers are already storing geolocation data inside BigQuery tables. We wanted to make analysis on the appropriate data types easy. BigQuery engineering team worked with the Google Earth Engine engineering team to introduce geo-spatial analytics capability directly inside BigQuery. BigQuery GIS (BETA) brings SQL support for the most commonly used GIS functions right into your data warehouse. With support for arbitrary points, lines, polygons, and multi-polygons in WKT and GeoJSON format, data analysts can simplify geospatial analyses, visualize location-based data in new ways, or unlock entirely new lines of business with the power of BigQuery.
Saptarshi: If we take a step back to think about data warehousing as a product category, what are some of the changes that you are anticipating in the coming few years?
Jordan: I don’t think that future data warehouses will look much like the data warehouses we have had in the past. Users will spend less time worrying about the shape or size of their data, and more time worrying about what questions they want to ask of their data. Deleting data in order to make it fit into the data warehouse will be a distant memory. I think that streaming sources will be critically important, and that will push people to think about their data in a different way.
The other big change that we’re likely to see is that the range of people who can gain value from a data warehouse will expand—it won’t just be the data analysts or data scientists, it will be the spreadsheet users or people in the C-suite who will be able to gain insights from their data without needing to know SQL or how the data is laid out. I’m looking forward to seeing that happen.
Saptarshi: Thank you, Jordan, for your time