谷歌中国开发者社区 (GDG)
  • 主页
  • 博客
    • Android
    • Design
    • GoogleCloud
    • GoogleMaps
    • GooglePlay
    • Web
  • 社区
    • 各地社区
    • 社区历史
    • GDG介绍
    • 社区通知
  • 视频
  • 资源
    • 资源汇总
    • 精选视频
    • 优酷频道

Balanced Partitioning and Hierarchical Clustering at Scale

2018-03-15adminGoogleDevFeedsNo comments

Posted by Hossein Bateni, Research Scientist and Kevin Aydin, Software Engineer, NYC Algorithms and Optimization Research Team

Solving large-scale optimization problems often starts with graph partitioning, which means partitioning the vertices of the graph into clusters to be processed on different machines. The need to make sure that clusters are of near equal size gives rise to the balanced graph partitioning problem. In simple terms, we need to partition the vertices of a given graph into k almost equal clusters, while we minimize the number of edges that are cut by the partition. This NP-hard problem is notoriously difficult in practice because the best approximation algorithms for small instances rely on semidefinite programming which is impractical for larger instances.

This post presents the distributed algorithm we developed which is more applicable to large instances. We introduced this balanced graph-partitioning algorithm in our WSDM 2016 paper, and have applied this approach to several applications within Google. Our more recent NIPS 2017 paper provides more details of the algorithm via a theoretical and empirical study.

Balanced Partitioning via Linear Embedding
Our algorithm first embeds vertices of the graph onto a line, and then processes vertices in a distributed manner guided by the linear embedding order. We examine various ways to find the initial embedding, and apply four different techniques (such as local swaps and dynamic programming) to obtain the final partition. The best initial embedding is based on “affinity clustering”.

Affinity Hierarchical Clustering
Affinity clustering is an agglomerative hierarchical graph clustering based on Borůvka’s classic Maximum-cost Spanning Tree algorithm. As discussed above, this algorithm is a critical part of our balanced partitioning tool. The algorithm starts by placing each vertex in a cluster of its own: v0, v1, and so on. Then, in each iteration, the highest-cost edge out of each cluster is selected in order to induce larger merged clusters: A0, A1, A2, etc. in the first round and B0, B1, etc. in the second round and so on. The set of merges naturally produces a hierarchical clustering, and gives rise to a linear ordering of the leaf vertices (vertices with degree one). The image below demonstrates this, with the numbers at the bottom corresponding to the ordering of the vertices.

Our NIPS’17 paper explains how we run affinity clustering efficiently in the massively parallel computation (MPC) model, in particular using distributed hash tables (DHTs) to significantly reduce running time. This paper also presents a theoretical study of the algorithm. We report clustering results for graphs with tens of trillions of edges, and also observe that affinity clustering empirically beats other clustering algorithms such as k-means in terms of “quality of the clusters”. This video contains a summary of the result and explains how this parallel algorithm may produce higher-quality clusters even compared to a sequential single-linkage agglomerative algorithm.

Comparison to Previous Work
In comparing our algorithm to previous work in (distributed) balanced graph partitioning, we focus on FENNEL, Spinner, METIS, and a recent label propagation-based algorithm. We report results on several public social networks as well as a large private map graph. For a Twitter followership graph, we see a consistent improvement of 15–25% over previous results (Ugander and Backstrom, 2013), and for LiveJournal graph, our algorithm outperforms all the others for all cases except k = 2, where ours is slightly worse than FENNEL’s.

The following table presents the fraction of cut edges in the Twitter graph obtained via different algorithms for various values of k, the number of clusters. The numbers given in parentheses denote the size imbalance factor: i.e., the relative difference of the sizes of largest and smallest clusters. Here “Vanilla Affinity Clustering” denotes the first stage of our algorithm where only the hierarchical clustering is built and no further processing is performed on the cuts. Notice that this is already as good as the best previous work (shown in the first two columns below), cutting a smaller fraction of edges while achieving a perfect (and thus better) balance (i.e., 0% imbalance). The last column in the table includes the final result of our algorithm with the post-processing.

k
UB13
(5%)
Spinner
(5%)
Vanilla Affinity
Clustering
(0%)
Final Algorithm
(0%)
20
37.0%
38.0%
35.71%
27.50%
40
43.0%
40.0%
40.83%
33.71%
60
46.0%
43.0%
43.03%
36.65%
80
47.5%
44.0%
43.27%
38.65%
100
49.0%
46.0%
45.05%
41.53%

Applications
We apply balanced graph partitioning to multiple applications including Google Maps driving directions, the serving backend for web search, and finding treatment groups for experimental design. For example, in Google Maps the World map graph is stored in several shards. The navigational queries spanning multiple shards are substantially more expensive than those handled within a shard. Using the methods described in our paper, we can reduce 21% of cross-shard queries by increasing the shard imbalance factor from 0% to 10%. As discussed in our paper, live experiments on real traffic show that the number of multi-shard queries from our cut-optimization techniques is 40% less compared to a baseline Hilbert embedding technique. This, in turn, results in less CPU usage in response to queries. In a future blog post, we will talk about application of this work in the web search serving backend, where balanced partitioning helped us design a cache-aware load balancing system that dramatically reduced our cache miss rate.

Acknowledgements
We especially thank Vahab Mirrokni whose guidance and technical contribution were instrumental in developing these algorithms and writing this post. We also thank our other co-authors and colleagues for their contributions: Raimondas Kiveris, Soheil Behnezhad, Mahsa Derakhshan, MohammadTaghi Hajiaghayi, Silvio Lattanzi, Aaron Archer and other members of NYC Algorithms and Optimization research team.



Source: Balanced Partitioning and Hierarchical Clustering at Scale

除非特别声明,此文章内容采用知识共享署名 3.0许可,代码示例采用Apache 2.0许可。更多细节请查看我们的服务条款。

Tags: Develop

Related Articles

Google I/O 2017:助力开发者在各个平台上打造最佳体验

2017-05-23admin

DeepVariant:利用深度神经网络重构高度精确的基因组

2017-12-27admin

Updates from Coral: A new compiler and much more

2019-04-12admin

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

Recent Posts

  • Admin Essentials: know your options for Modern Enterprise Browser Management
  • TheVentureCity and Google Consolidate Miami as a Tech Powerhouse
  • Keep a better eye on your Google Cloud environment
  • Using HLL++ to speed up count-distinct in massive datasets
  • Season of Docs Announces Results of 2019 Program

Recent Comments

  • admin on Using advanced Kubernetes autoscaling with Vertical Pod Autoscaler and Node Auto Provisioning
  • Martijn on Using advanced Kubernetes autoscaling with Vertical Pod Autoscaler and Node Auto Provisioning
  • Martijn on Using advanced Kubernetes autoscaling with Vertical Pod Autoscaler and Node Auto Provisioning
  • Chen Zhixiang on Concurrent marking in V8
  • admin on 使用 Android Jetpack 加快应用开发速度

Archives

  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • January 1970

Categories

  • Android
  • Design
  • Firebase
  • GoogleCloud
  • GoogleDevFeeds
  • GoogleMaps
  • GooglePlay
  • Google动态
  • iOS
  • Uncategorized
  • VR
  • Web
  • WebMaster
  • 社区
  • 通知

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org

最新文章

  • Admin Essentials: know your options for Modern Enterprise Browser Management
  • TheVentureCity and Google Consolidate Miami as a Tech Powerhouse
  • Keep a better eye on your Google Cloud environment
  • Using HLL++ to speed up count-distinct in massive datasets
  • Season of Docs Announces Results of 2019 Program
  • Admin Insider: What's new in Chrome Enterprise, Release 79
  • Discover insights from text with AutoML Natural Language, now generally available
  • Introducing Storage Transfer Service for on-premises data
  • How Mynd uses G Suite to manage a flurry of acquisitions
  • W3C Trace Context Specification: What it Means for You

最多查看

  • 如何选择 compileSdkVersion, minSdkVersion 和 targetSdkVersion (25,371)
  • Google 推出的 31 套在线课程 (22,456)
  • 谷歌招聘软件工程师 (22,336)
  • Seti UI 主题: 让你编辑器焕然一新 (13,823)
  • Android Studio 2.0 稳定版 (9,420)
  • Android N 最初预览版:开发者 API 和工具 (8,036)
  • 像 Sublime Text 一样使用 Chrome DevTools (6,323)
  • 用 Google Cloud 打造你的私有免费 Git 仓库 (6,076)
  • Google I/O 2016: Android 演讲视频汇总 (5,608)
  • 面向普通开发者的机器学习应用方案 (5,539)
  • 生还是死?Android 进程优先级详解 (5,228)
  • 面向 Web 开发者的 Sublime Text 插件 (4,341)
  • 适配 Android N 多窗口特性的 5 个要诀 (4,311)
  • 参加 Google I/O Extended,观看 I/O 直播,线下聚会! (3,620)
© 2019 中国谷歌开发者社区 - ChinaGDG