谷歌中国开发者社区 (GDG)
  • 主页
  • 博客
    • Android
    • Design
    • GoogleCloud
    • GoogleMaps
    • GooglePlay
    • Web
  • 社区
    • 各地社区
    • 社区历史
    • GDG介绍
    • 社区通知
  • 视频
  • 资源
    • 资源汇总
    • 精选视频
    • 优酷频道

Running TensorFlow inference workloads at scale with TensorRT 5 and NVIDIA T4 GPUs

2019-01-16adminGoogleCloudNo comments

Source: Running TensorFlow inference workloads at scale with TensorRT 5 and NVIDIA T4 GPUs from Google Cloud

Today, we announced that Google Compute Engine now offers machine types with NVIDIA T4 GPUs, to accelerate a variety of cloud workloads, including high-performance computing, deep learning training and inference, broader machine learning (ML) workloads, data analytics, and graphics rendering.

In addition to its GPU hardware, NVIDIA also offers tools to help developers make the best use of their infrastructure. NVIDIA TensorRT is a cross-platform library for developing high-performance deep learning inference—the stage in the machine learning process where a trained model is used, typically in a runtime, live environment, to recognize, process, and classify results. The library includes a deep learning inference data type (quantization) optimizer, model conversion process, and runtime that delivers low latency and high throughput. TensorRT-based applications perform up to 40 times faster1 than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in most major frameworks, calibrate for lower precision with high accuracy, and finally, deploy to a variety of environments. These might include hyperscale data centers, embedded systems, or automotive product platforms.

In this blog post, we’ll show you how to run deep learning inference on large-scale workloads with NVIDIA TensorRT 5 running on Compute Engine VMs configured with our Cloud Deep Learning VM image and NVIDIA T4 GPUs.

Overview

This tutorial shows you how to set up a multi-zone cluster for running an inference workload on an autoscaling group that scales to meet changing GPU utilization demands, and covers the following steps:

  • Preparing a model using a pre-trained graph (ResNet)

  • Benchmarking the inference speed for a model with different optimization modes

  • Converting a custom model to TensorRT format

  • Setting up a multi-zone cluster that is:

    • Built on Deep Learning VMs preinstalled with TensorFlow, TensorFlow serving, and TensorRT 5.

    • Configured to auto-scale based on GPU utilization.

    • Configured for load-balancing.

    • Firewall enabled.  

  • Running an inference workload in the multi-zone cluster.

Here’s a high-level architectural perspective for this setup:

1_high-level architectural perspective.png

Preparing and optimizing the model with TensorRT

In this section, we will create a VM instance to run the model, and then download a model from the TensorFlow official models catalog.

Create a new Deep Learning Virtual Machine instance

Create the VM instance:

If command is successful you should see a message that looks like this:

Notes:

  • You can create this instance in any available zone that supports T4 GPUs.

  • A single GPU is enough to compare the different TensorRT optimization modes.

Download a ResNet model pre-trained graph

This tutorial uses the ResNet model, which trained on the ImageNet dataset that is in TensorFlow. To download the ResNet model to your VM instance, run the following command:

Verify model was downloaded correctly:

Save the location of your ResNet model in the $WORKDIR variable:

Benchmarking the model

Leveraging fast linear algebra libraries and hand tuned kernels, TensorRT can speed up inference workloads, but the most significant speed-up comes from the quantization process. Model quantization is the process by which you reduce the precision of weights for a model. For example, if the initial weight of a model is FP32, you have the option to reduce the precision to FP16, INT8, or even INT4, with the goal of improving runtime performance. It’s important to pick the right balance between speed (precision of weights) and accuracy of a model. Luckily, TensorFlow includes functionality that does exactly this, measuring accuracy vs. speed, or other metrics such as throughput, latency, node conversion rates, and total training time.

Note: This test is limited to image recognition models at the moment, however it should not be too hard to implement a custom test based on this code.

Set up the ResNet model

To set up the model, run the following command:

This test requires a frozen graph from the ResNet model (the same one that we downloaded before), as well as arguments for the different quantization modes that we want to test.

The following command prepares the test for the execution:

Run the test

This command will take some time to finish.

Notes:

  • $WORKDIR is the directory in which you downloaded the ResNet model.

  • The --native arguments are the different available quantization modes you can test.

Review the results

When the test completes, you will see a comparison of the inference results for each optimization mode.

To see the full results, run the following command:

V100 (Old)

2_V100 (Old).png

V100

3_V100.png

T4

4_T4.png

P4

5_P4.png

From the above results, you can see that FP32 and FP16 performance numbers are identical under predictions. This means that if you are content working with TensorRT, you can definitely start using FP16 right away. INT8, on the other hand, shows slightly worse accuracy and requires understanding the accuracy-versus-performance tradeoffs for your models.

In addition, you can observe that when you run the model with TensorRT 5:

  • Using FP32 optimization improves throughput by 40% (440 vs 314). At the same time it decreases latency by ~30%, making it 0.28 ms instead of 0.40 ms.

  • Using FP16 optimization rather than native TF graph increases the speed by 214%! (from 314 to 988 fps). At the same time latency decreased by 0.12 ms (almost 3x decrease!).

  • Using INT8, the last result displayed above, we observed a speedup of 385% (from 314 to 1524) with the latency decreasing to 0.08 ms.

Notes:

  • The above results do not include latency for image pre-processing nor HTTP requests latency. In production systems the inference’ speed may not be a bottleneck at all, and you will need to account for all the factors mentioned in order to measure your end to end inference’ speed.

Now, let’s pick a model, in this case, INT8.

Converting a custom model to TensorRT

Download and extract ResNet model

To convert a custom model to a TensorRT graph you will need a saved model. To download a saved INT8 ResNet model, run the following command:

Convert the model to a TensorRT graph with TFTools

Now we can convert this model to its corresponding TensorRT graph with a simple tool:

You now have an INT8 model in your $WORKDIR/resnet_v2_int8_NCHW/00001 directory.

To ensure that everything is set up properly, try running an inference test.

Upload the model to Cloud Storage

You’ll need to run this step so that the model can be served from the multi-zone cluster that we will set up in the next section. To upload the model, complete the following steps:

1. Archive the model.

2. Upload the archive.

If needed, you can obtain an INT8 precision variant of the frozen graph from Cloud Storage at this URL:

Setting up a multi-zone cluster

Create the cluster

Now that we have a model in Cloud Storage, let’s create a cluster.

Create an instance template

An instance template is a useful way to create new instances. Here’s how:

Notes:

  • This instance template includes a startup script that is specified by the metadata parameter.

  • The startup script runs during instance creation on every instance that uses this template, and performs the following steps:

    • Installs NVIDIA drivers, NVIDIA drivers are installed on each new instance. Without NVIDIA drivers, inference will not work.

    • Installs a monitoring agent that monitors GPU usage on the instance

    • Downloads the model

    • Starts the inference service

  • The startup script runs tf_serve.py, which contains the inference logic. For this example, I have created a very small Python file based on the TFServe package.

  • To view the startup script, see start_agent_and_inf_server.sh.

Create a managed instance group

You’ll need to set up a managed instance group, to allow you to run multiple instances in specific zones. The instances are created based on the instance template generated in the previous step.

Notes:

  • INSTANCE_TEMPLATE_NAME is the name of the instance that you created in the previous step.

  • You can create this instance in any available zone that supports T4 GPUs. Ensure that you have available GPU quotas in the zone.

  • Creating the instance takes some time. You can watch the progress with the following command:

Once the managed instance group is created, you should see output that resembles the following:

Confirm metrics in Stackdriver

1. Access Stackdriver’s Metrics Explorer here

2. Search for gpu_utilization. StackDriver > Resources > Metrics Explorer

8_stackdriver.png

3. If data is coming in, you should see something like this:

9_metrics-explorer.png

Enable auto-scaling

Now, you’ll need to enable auto-scaling for your managed instance group.

Notes:

  • The custom.proxy.ustclug.org/gpu_utilization is the full path to our metric.

  • We are using level 85, this means that whenever GPU utilization reaches 85, the platform will create a new instance in our group.

Test auto-scaling

To test auto-scaling, perform the following steps:

1. SSH to the instances. See Connecting to Instances for more details.

2. Use the gpu-burn tool to load your GPU to 100% utilization for 600 seconds:

Notes:

  • During the make process, you may receive some warnings, ignore them.

  • You can monitor the gpu usage information, with a refresh interval of 5 seconds:

3. You can observe the autoscaling in Stackdriver, one instance at a time.

10_autoscaling in Stackdriver.png

4. Go to the Instance Groups page in the Google Cloud Console.

5. Click on the deeplearning-instance-group managed instance group.

6. Click on the Monitoring tab.

11_metrics monitoring.png

At this point your auto-scaling logic should be trying to spin as many instances as possible to reduce the load. And that is exactly what is happening:

11_metrics monitoring.png

At this point you can safely stop any loaded instances (due to the burn-in tool) and watch the cluster scale down.

Set up a load balancer

Let's revisit what we have so far:

  • A trained model, optimized with TensorRT 5 (using INT8 quantization)

  • A managed instance group. These instances have auto-scaling enable based on the GPU utilization

Now you can create a load balancer in front of the instances.

Create health checks

Health checks are used to determine if a particular host on our backend can serve the traffic.

Create inferences forwarder

Configure named-ports of the instance group so that LB can forward inference requests, sent via port 80, to the inference service that is served via port 8888.

Create a backend service

Create a backend service that has an instance group and health check.

First, create the health check:

Then, add the instance group to the new backend service:

Set up the forwarding URL

The load balancer needs to know which URL can be forwarded to the backend services.

Create the load balancer

Add an external IP address to the load balancer:

Find the allocated IP address:

Set up the forwarding rule that tells GCP to forward all requests from the public IP to the load balancer:

After creating the global forwarding rules, it can take several minutes for your configuration to propagate.

Enable the firewall

You need to enable a firewall on your project, or else it will be impossible to connect to your VM instances from the external internet. To enable a firewall for your instances, run the following command:

Running inference

You can use the following Python script to convert images to a format that can be uploaded to the server.

Finally, run the inference request:

That’s it!

Toward TensorFlow inference bliss

Running ML inference workloads with TensorFlow has come a long way. Together, the combination of NVIDIA T4 GPUs and its TensorRT framework make running inference workloads a relatively trivial task—and with T4 GPUs available on Google Cloud, you can spin them up and down on demand. If you have feedback on this post, please reach out to us here.

Acknowledgements: Viacheslav Kovalevskyi, Software Engineer, Gonzalo Gasca Meza, Developer Programs Engineer, Yaboo Oyabu, Machine Learning Specialist and Karthik Ramasamy, Software Engineer contributed to this post.

1. Inference benchmarks show ResNet training times to be 27x faster, and GNMT times to be 36x faster

除非特别声明,此文章内容采用知识共享署名 3.0许可,代码示例采用Apache 2.0许可。更多细节请查看我们的服务条款。

Tags: Cloud

Related Articles

Bringing Smart Reply to Hangouts Chat

2018-12-07admin

Customer Managed Encryption Keys (CMEK) for Dataproc is now generally available

2018-11-06admin

Introducing commercial Kubernetes applications in GCP Marketplace

2018-07-18admin

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

Recent Posts

  • New UI tools and a richer creative canvas come to ARCore
  • Introducing PlaNet: A Deep Planning Network for Reinforcement Learning
  • AI in depth: monitoring home appliances from power readings with ML
  • AI in depth: monitoring home appliances from power readings with ML
  • AI in depth: monitoring home appliances from power readings with ML

Recent Comments

  • Chen Zhixiang on Concurrent marking in V8
  • admin on 使用 Android Jetpack 加快应用开发速度
  • 怪盗kidou on 使用 Android Jetpack 加快应用开发速度
  • 鸿维 on Google 帐号登录 API 更新
  • admin on 推出 CVPR 2018 学习图像压缩挑战赛

Archives

  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • January 1970

Categories

  • Android
  • Design
  • Firebase
  • GoogleCloud
  • GoogleDevFeeds
  • GoogleMaps
  • GooglePlay
  • Google动态
  • iOS
  • Uncategorized
  • VR
  • Web
  • WebMaster
  • 社区
  • 通知

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org

最新文章

  • New UI tools and a richer creative canvas come to ARCore
  • Introducing PlaNet: A Deep Planning Network for Reinforcement Learning
  • AI in depth: monitoring home appliances from power readings with ML
  • AI in depth: monitoring home appliances from power readings with ML
  • AI in depth: monitoring home appliances from power readings with ML
  • Run cron jobs reliably on Compute Engine with Cloud Scheduler
  • Run cron jobs reliably on Compute Engine with Cloud Scheduler
  • Run cron jobs reliably on Compute Engine with Cloud Scheduler
  • Introducing scheduled snapshots for Compute Engine persistent disk
  • Revevol: How we built a BI dashboard with GCP to track G Suite adoption

最多查看

  • 谷歌招聘软件工程师 (21,002)
  • Google 推出的 31 套在线课程 (20,072)
  • 如何选择 compileSdkVersion, minSdkVersion 和 targetSdkVersion (18,624)
  • Seti UI 主题: 让你编辑器焕然一新 (12,670)
  • Android Studio 2.0 稳定版 (8,958)
  • Android N 最初预览版:开发者 API 和工具 (7,934)
  • 像 Sublime Text 一样使用 Chrome DevTools (5,947)
  • Google I/O 2016: Android 演讲视频汇总 (5,519)
  • 用 Google Cloud 打造你的私有免费 Git 仓库 (5,496)
  • 面向普通开发者的机器学习应用方案 (5,193)
  • 生还是死?Android 进程优先级详解 (4,969)
  • 面向 Web 开发者的 Sublime Text 插件 (4,134)
  • 适配 Android N 多窗口特性的 5 个要诀 (4,103)
  • 参加 Google I/O Extended,观看 I/O 直播,线下聚会! (3,472)
© 2018 中国谷歌开发者社区 - ChinaGDG