Source: Geolocation with BigQuery: De-identify 76 million IP addresses in 20 seconds from Google Cloud
BigQuery is Google Cloud’s serverless data warehouse designed for scalability and fast performance. Using it lets you explore large datasets to find new and meaningful insights. To comply with current policies and regulations, you might need to de-identify the IP addresses of your users when analyzing datasets that contain personal data. For example, under GDPR, an IP address might be considered PII or personal data.
We published our first approach to de-identifying IP addresses four years ago—GeoIP geolocation with Google BigQuery—and it’s time for an update that includes the best and latest BigQuery features, like using the latest SQL standards, dealing with nested data, and handling joins much faster.
Replacing collected IP addresses with a coarse location is one method to help reduce risk—and BigQuery is ready to help. Let’s see how.
For this example of how you can easily de-identify IP addresses, let’s use:
Let’s go straight into the query. Use the code below to replace IP addresses with the generic location.
Here’s the list of countries where users are making edits to Wikipedia, followed by the query to use:
Query complete (20.9 seconds elapsed, 1.14 GB processed)
These are the top cities where users are making edits to Wikipedia, collected from 2001 to 2010, followed by the query to use:
These new queries are compliant with the latest SQL standards, enabling a few new tricks that we'll review here.
The downloadable GeoLite2 tables are not based in ranges anymore. Now they use proper IP networks, like in "126.96.36.199/22".
Using BigQuery, we parsed these into binary IP addresses with integer masks. We also did some pre-processing of the GeoLite2 tables, combining the networks and locations into a single table, and adding the parsed network columns, as shown here:
To find one IP address within this table, like "188.8.131.52," something like this might work:
But that doesn't work. We need to apply the correct mask:
And that gets an answer: this IP address seems to live in Antarctica.
That looked easy enough, but we need a few more steps to figure out the right mask and joins between the GeoLite2 table (more than 3 million rows) and a massive source of IP addresses.
And that's what the next line in the main query does:
This is basically applying a CROSS JOIN with all the possible masks (numbers between 9 and 32) and using these to mask the source IP addresses. And then comes the really neat part: BigQuery manages to handle the correct JOIN in a massively fast way:
BigQuery here picks up only one of the masked IPs—the one where the masked IP and the network with that given mask matches. If we dig deeper, we'll find in the execution details tab that BigQuery did an "INNER HASH JOIN EACH WITH EACH ON", which requires a lot of shuffling resources, while still not requiring a full CROSS JOIN between two massive tables.
This is how BigQuery can help you to replace IP addresses with coarse locations and also provide aggregations of individual rows. This is just one technique that can help you reduce the risk of handling your data. GCP provides several other tools, including Cloud Data Loss Prevention (DLP), that can help you scan and de-identify data. You now have several options to explore and use datasets that let you comply with regulations. What interesting ways are you using de-identified data? Let us know.