By Marcin Kołda, Senior Software Engineer at Ocado Technology
[Editor’s note: Today we hear from Google Cloud customer Ocado Technology, which created (and open sourced!) a program to give them at-a-glance insights about their data warehouse usage, by reading BigQuery metadata. Read on to learn about how they architected the tool, what kinds of questions it can answer—and whether it might be useful in your own environment.]
Here at Ocado Technology, we use a wide range of Google Cloud Platform (GCP) big data products for data-driven decisions and machine learning. Notably, we use Google BigQuery as the main storage solution for data analytics in the Ocado Smart Platform, our proprietary solution for operating online retail businesses.
Because BigQuery is so central to the Ocado platform, we wanted an easy way to get a bird’s eye view of the data stored in it. So we created GCP Census a tool that collects metadata about BigQuery tables and stores it back into BigQuery for analysis.
To have a better overview of all the data stored in BigQuery, we wanted to ask:
If you also need better visibility into your organization’s BigQuery usage, read on to learn about how we architected GCP Census and what it can do. Then go ahead and download it for your own use—we recently open sourced it!
We store petabytes of data in BigQuery, divided into multiple GCP projects and hundreds of thousands of tables. BigQuery has many useful features for enterprise cloud data warehouses, especially in terms of speed, scalability and reliability. One example is partitioned tables rather than daily tables, which we recently adopted for their numerous benefits. At the same time, partitioned tables increased the complexity and scale of our BigQuery environment, and BigQuery offers limited ways of analysing metadata:
These constraints inspired us to build an additional layer to give us a bird’s eye view of our data.
The resulting tool, GCP Census, is a Google App Engine app written in Python that regularly collects metadata about BigQuery tables and stores it in BigQuery.
Here’s how it works:
GCP Census is highly scalable as it can easily scan millions of table/partitions. It’s also easy to set up: before GCP Census scans the resources to which it has IAM access, it automatically creates the relevant tables and views. Finally, it’s a secure cloud-native solution with App Engine Firewall and fine-grained access control, plus App Engine’s scalability and reliability!
There are several benefits to using GCP Census. Now you can get answers to all the questions by querying BigQuery from the UI or the API.
You can find below a few examples of how you can query GCP Census metadata.
SELECT sum(numBytes) FROM
SELECT count(*) FROM `YOUR-PROJECT-ID.bigquery_views.partition_metadata_v1_0`
SELECT projectId, datasetId, sum(numBytes) as totalNumBytes
GROUP BY projectId, datasetId ORDER BY totalNumBytes DESC LIMIT 100
SELECT projectId, datasetId, tableId, numBytes
ORDER BY numBytes DESC LIMIT 100
SELECT projectId, datasetId, tableId, partitionId, numBytes
ORDER BY numBytes DESC LIMIT 100
Optionally, you can create a Data Studio dashboard based on the metadata. We used Data Studio because of the ease and simplicity in creating dashboards with the BigQuery connector. Splitting data by project, dataset or label and diving into the storage costs is now a breeze, and we have multiple Data Studio dashboards that help us quickly dive into the largest project, dataset or table.
Below you can find a screenshot with one of our dashboards (all real data has been redacted).
With GCP Census, we’ve learned some of the characteristics of our data; for example, we now know which data is modified daily or which historical partitions have been modified recently. We were also able to identify potential cost optimization areas—huge temporary tables that no one uses but that were incurring significant storage costs. All in all, we’ve learned a lot about our operations, and saved a bunch of money!
You can find the source code for GCP Census at Github at https://github.com/ocadotechnology/gcp-census, plus the required steps needed for installation and setup. We look forward to your ideas and contributions!