By Sridhar Rajagopalan, Software Engineer
In the year-plus since Apigee joined the Google Cloud family, we’ve had the opportunity to deploy several of our services to Google Cloud Platform (GCP). Most recently, we completely moved Apigee Sense to GCP to use its advanced machine learning capabilities. Along the way, we also experienced some important performance improvements as judged by a drop in what we call “data litter.” In this post, we explain what data litter is, and our perspective on how various GCP services keep it at bay. Through this account, you may come to recognize your own application, and come to see data litter as an important metric to consider.
First, let’s take a look at Apigee Sense and its application characteristics. At its core, Apigee Sense protects APIs running on Apigee Edge from attacks and unwanted exploitation. Those attacks are usually performed by automated processes, or “bots,” which run without the permission of the API owner. Sense is built around a four-element “CAVA” cycle: collect, analyze, visualize and act. It enhances human vigilance with statistical machine learning algorithms.
We collect a lot of traffic data as a by-product of billions of API calls that pass through Apigee Edge daily. The output end of each of the four elements in the CAVA cycle is stored in a database system. Therefore, the costs, performance and scalability of data management and data analysis toolchains are of great interest to us.
When optimizing an analytics application, there are several things that demand particular attention: latency, quality, throughput and cost.
To this mix I like to consider a fifth metric: “data litter,” which in many ways measures the interplay between the four traditional metrics. Fundamentally, all analytics systems are GIGO (garbage in / garbage out). That is, if the data entering the system is garbage, it doesn’t matter how quickly it is processed, how smart our algorithms are, or how much data we can process every second. The money we spend does matter, but only because of questions about the wisdom of continuing to spend it.
Generally speaking, there are three main sources of data litter in an analytical application like Apigee Sense.
Therefore, data litter is a holistic measure of the quality of the analysis system. It will be low only when the pipeline, analysis engine and target database are all well-tuned and constantly performing to expectations.
The easiest way to deal with the first kind of data litter is to slow down the pipeline by increasing latency. The easiest way to address the second kind is to throw money at the problem and run the analysis engine on a larger cluster. And the final problem is best addressed by adding more or bigger hardware to the database. Whichever path we take, we either increase latency and lose relevance, or lose money.
At Apigee, we track data litter with the data coverage metric, which is, roughly speaking, the inverse measure of how much of data gets dropped or otherwise doesn’t contribute to the analysis. When we moved the Sense analytics chain to GCP, the data coverage metric went from below 80% to roughly 99.8% for one of our toughest customer use cases. Put another way, our data litter decreased from over 20%, or one in five, to approximately one in five hundred. That’s a decrease of a factor of approximately 100, or two orders of magnitude!
The chart below shows the fraction of data available and used for decision making before and after our move to GCP. The chart shows the numbers for four different APIs, representing a subset of Sense customers.
These improvements were measured even while the cost of the deployed system, as well its the pipeline latency, were simultaneously tightened. Meanwhile, our throughput and algorithms stayed the same, and latencies and cost both dropped. Since the release a couple of months ago, these savings, along with the availability and performance benefits of the system, have persisted, while our customer base and the processed traffic has grown. So we’re getting more reliable answers more quickly than we did before and paying less than we did for almost the exact same use case. Wow!
There were two problems that accounted for the bulk of the data litter in the Sense pipeline. These were the elasticity of data processing and the scalability of the transactional store.
To alert customers of an attack as quickly as possible, we designed our system with adequate latency to avoid systematic data litter. In our environment, two features of the GCP platform contributed most significantly to the reduction of unplanned data litter.:
As part of this transition, we moved our analysis chain to Cloud Dataproc, which provided significantly more nimble and cost controlled elasticity. Because the cost of the analytics pipeline represented our most significant constraint, we were able to size our processing capacity limits more aggressively. This gave us the additional elastic capacity needed to meet peak demands without increasing our cost.
We also moved our target database to BigQuery. BigQuery distributes and scales cost-effectively and without hiccups well beyond our needs, and indeed, beyond most reasonable IT budgets. This completely eliminated the back pressure issue from the end of the chain.
Because two of the three sources of data litter are now gone, our team is able to focus on improving the timeliness of our analysis—ensuring that we move data from where it’s gathered through the analysis engine and make more intelligent and more relevant decisions with lower latency. This is what Sense was intended to do.
By moving Apigee Sense to GCP, we feel that we’ve taken back the control of our destiny. I’m sure that our customers will notice the benefits not just in terms of a more reliable service, but also in the velocity with which we are able to ship new capabilities to them.