By Ben Scarlato, Software Engineer
When an application or service crashes, do you wonder what caused the crash? Do you wonder if the crash poses a security risk?
An important element in platform hardening is properly handling server process crashes. When a process crashes unexpectedly, it suggests there may be a security problem an attacker could exploit to compromise a service. Even highly reliable user-facing services can depend on internal server processes that crash. At Google, we collect crashes for analysis and automatically flag and analyze those with potential security implications.
Analyzing crashes is a widespread security practice — this is why, when you run Google Chrome, you’re asked if it’s okay to send data about crashes back to the company.
At Google Cloud Platform (GCP), we monitor for crashes in the processes that manage customer VMs and across our services, using standard processes to protect customer data in GCP.
There are many different security issues that can cause a crash. One well-known example is a use-after-free vulnerability. A use-after-free vulnerability occurs when you attempt to use a region of memory that’s already been freed.
Most of the time, a use-after-free action simply causes the program to crash. However, if an attacker has the ability to properly manipulate memory, there’s the potential for them to exploit the vulnerability and gain arbitrary code execution capabilities.
Debugging a single crash can be difficult. But how do you handle debugging crashes when you have to manage thousands of server jobs?
In order to help secure a set of rapidly evolving products such as Google Compute Engine, Google App Engine and the other services that comprise GCP, you need a way to automatically detect problems that can lead to crashes.
In Compute Engine’s early days, when we had a much smaller fleet of virtual machines running at any given time, it was feasible for security engineers to analyze crashes by hand.
We would load crash dumps into gdb and look at the thread that caused a crash. This provided detailed insight into the program state prior to a crash. For example, gdb allows you to see whether a program is executing from a region of memory marked executable. If it’s not, you may have a security issue.
Analyzing crashes in gdb worked well, but as Cloud grew to include more services and more users, it was no longer feasible for us to do as much of this analysis by hand.
We needed a way to automate checking crashes for use-after-free vulnerabilities and other security issues. That meant integrating with the systems used to collect crash data across Google, and running an initial set of signals against a crash to either flag it as a security problem to be fixed or that required further analysis.
Automating this triage was important, because crashes can occur for many reasons and may not pose a security threat. For instance, we expect to see many crashes just from routine stress testing. If, however, a security problem is found, we automatically file a bug that details the specific issue and assigns it an exploitability rating.
Maintaining a platform with high security standards means going up against attackers who are always evolving, and we’re always working to improve in turn.
We’re continually improving our crash analysis to automatically detect more potential security problems, better determine the root cause of a crash and even identify required fixes.