Source: Google Cloud Firewall Rules Logging: How and why you should use it from Google Cloud
Google Cloud Platform (GCP) firewall rules are a great tool for securing applications. Firewall rules are customizable software-defined networking constructs that let you allow or deny traffic to and from your virtual machine (VM) instances.. To secure applications and respond to modern threats, firewall rules require monitoring and adjustment over time. GCP Firewall Rules Logging,Google made generally available in February 2019, is a feature that allows the network administrators to monitor, verify and analyze the effects of firewall rules in Google Cloud. In this blog (the first of many on this topic), we’ll discuss the basics of Firewall Rule Logging, then look at an example of how to use it to identify mislabeled VMs and refine firewall rules with minimal traffic interruption.
Firewall Rules Logging provides visibility to help you better understand the effectiveness of rules in troubleshooting scenarios. It helps answer common questions, like:
Unlike VPC flow logs, firewall rules logs are not sampled. Every connection is logged (subject to some limits. Please refer to the Appendix for details). The Firewall Rule log format can be found here.
Additionally, network administrators have the options to export firewall logs toGoogle Cloud Storage for long term log retention,to BigQuery for in-depth analysis using standard SQL, or to Pub/Sub to integrate with popular security information and event management software (SIEM), such assplunk for detecting/alerting traffic abnormalities and threats at near real time.
For reference, GCP firewall rules are software-defined constructs with the following properties:
In practice, there are many uses for Firewall Rule Logging. One common use case is to help identify mislabeled VM instances. Let’s walk through this scenario in more detail.
Scenario: Mislabeled VM instances
ACME, Inc. is migrating on-prem applications to Google Cloud. Their network admins implemented a shared VPC to centrally manage the entire company’s networking infrastructure. There are dozens of applications, run by multiple engineering teams, deployed in each GCP region with multi-tiered applications. User-facing proxies in the DMZ talk to the web servers, which communicate with the application servers, which, talk to database layers.
This diagram represents one of many regions, each of which has multiple subnets. This region, US-EAST1, includes:
Here are the traffic directions and firewall rules in place for this region:
Proxy can access web servers in acme-web-use1:
firewall rule: acme-web-use1-allow-ingress-acme-dmz-use1-proxy
Web servers can access app servers in acme-app-use1:
firewall rule: acme-app-use1-allow-ingress-acme-web-use1-webserver
App servers can access database servers in acme-db-use1:
firewall rule: acme-db-use1-allow-ingress-acme-app-use1-appsvr
This setup has granular partitions of network space that categorize compute resources by application function. This makes it possible for firewall rules to control the network space and lock the network infrastructure to comply with the least-privilege principle.
For large organizations with thousands of VMs provisioned by dozens of application teams, we use service accounts or network tags to group the VMs, in conjunction with subnet CIDR ranges to define firewall rules. For simplicity, we use the network tags to demonstrate each use case.
It’s not unusual for large enterprises to have hundreds of firewall rules due to the scale of the infrastructure and the complexity of network traffic patterns.With so much going on, it’s understandable, that application teams sometimes mislabel the VMs when they migrate them from on-prem to cloud and scale-up applications after migration. The consequences of mislabeling range from an application outage to a security breach. The same problem can also arise if we (mis)use service accounts.
Going back to our example, an ACME application team mislabeled one of their new VMs as “appsrv” when it should actually be “appsvr”. As a result, the web server’s requests to access app servers are denied.
In order to identify the mislabeling and mitigate the impact quickly, we can enable the firewall rule logging for all the firewall rules using the following gcloud command.
Next, we export the logs to BigQuery for further analysis with following steps:
Create a BigQuery dataset to store the Firewall Rule Log entries:
Create the BigQuery sink to export the firewall rules logs:
You can also export individual firewall rules by adding filters on the jsonPayload.rule_details.reference. Here is a sample filter:
When the BigQuery sink is created, a service account is generated automatically by the GCP. The service account follows the naming convention of “p<project_number>-[0–9]*@gcp-sa-logging.iam.gserviceaccount.com” and is used to write the log entries to the BigQuery table. We need to grant this service account the Data Editor role on the BigQuery dataset as following three steps:
Once the logs are populated to the BigQuery sink, you can run queries to determine which rule denies what as following:
The BigQuery table will be loaded with the log entries from the logFileName filter, which, in this case, contains all the firewall rules logs. The BigQuery table schema is directly mapped to the log entry’s json format. To keep it simple, for our example we’ll use the log viewer to inspect the log entries.
Since we have implicit ingress and the denial rule is not being logged, we create a “deny all” rule with priority 65534 to capture anything that gets denied.
The firewall rules for this scenario are:
As we can see in the viewer, “acme-deny-all-ingress-internal” is taking effect, and “acme-allow-all-ingress-internal” is disabled, so we can ignore it.
Below, we can see that the connection from websvr01 to the new appsvr02 (with the incorrect “appsrv” label) is denied.
While this approach works for this example, it presents two potential problems:
If we have a large amount of traffic, it will generate too much data for real-time analysis. In fact, one of our clients implemented this approach and ended up generating 5TB of firewall logs per day.
Mislabeled VMs can cause traffic interruptions. The firewall is doing what it is designed to do, but nobody likes outages.
So, we need a better approach to address both of the issues above.
To resolve the potential issues mentioned above, we can create another ingress rule to allow all traffic at priority 65533 and turn it on for a short period of time whenever there are new deployments.
In this scenario, we don’t need to turn on all the Firewall Rules Logging. In fact, we could turn off most of it to save space.
Any allowed evaluation logged by this firewall rule is a violator, and we do not expect many of them. The suspects are identified in real time.
Now, we fix the label.
As we can see, the connection from websvr01 to appsvr02 now works fine.
After all the mislabels are fixed, we can turn off the allow and capture rule. Everyone is happy… until the next time new resources are added to the network.
With Firewall Rules Logging, we can refine our firewall rules by following a few best-practices and identify undesired network traffic in near real-time. In addition to firewall rules logging, we’re always working on more tools and features to make managing firewall rules and network security in general easier.
Firewall quotas and limits