谷歌中国开发者社区 (GDG)
  • 主页
  • 博客
    • Android
    • Design
    • GoogleCloud
    • GoogleMaps
    • GooglePlay
    • Web
  • 社区
    • 各地社区
    • 社区历史
    • GDG介绍
    • 社区通知
  • 视频
  • 资源
    • 资源汇总
    • 精选视频
    • 优酷频道

Analyzing 3024 rice genomes characterized by DeepVariant

2019-03-19adminGoogleCloudNo comments

Source: Analyzing 3024 rice genomes characterized by DeepVariant from Google Cloud

Rice is an ideal candidate for study in genomics, not only because it’s one of the world’s most important food crops, but also because centuries of agricultural cross-breeding have created unique, geographically-induced differences. With the potential for global population growth and climate change to impact crop yields, the study of this genome has important social considerations.

This post explores how to identify and analyze different rice genome mutations with a tool calledDeepVariant. To do this, we performed a re-analysis of the Rice 3K dataset and have made the data publicly available as part of the Google Cloud Public Dataset Program pre-publication and under the terms of the Toronto Statement.

We aim to show how AI can improve food security by accelerating genetic enhancement to increase rice crop yield. According to the Food and Agriculture Organization of the United Nations, crop improvements will reduce the negative impact of climate change and loss of arable land on rice yields, as well as support an estimated 25% increase in rice demand by 2030.

Why catalog genetic variation for rice on Google Cloud?

In March 2018, Google AI showed that deep convolutional neural networks can identify genetic variation in aligned DNA sequence data. This approach, called DeepVariant, outperforms existing methods on human data, and we showed that the approach to call variants on a human can be used to call variants on other animal species. This blog post demonstrates that DeepVariant is also effective at calling variants on a plant, thus demonstrating the effectiveness of deep neural network transfer learning in genomics.

In April 2018, three research institutions—the Chinese Academy of Agricultural Sciences (CAAS), the Beijing Genomics Institute (BGI) Shenzhen, and the International Rice Research Institute (IRRI)—published the results of a collaboration to sequence and characterize the genomic variation of the Rice 3K dataset, which consists of genomes from 3,024 varieties of rice from 89 countries. Variant calls used in this publication were identified against a Nipponbare reference genome using best practices and are available from the SNP-Seek database (Mansueto et al, 2017).

We recharacterized the genomic variation of the Rice 3K dataset with DeepVariant. Preliminary results indicate a larger number of variants discovered at a similar or lower error rate than those detected by conventional best practice, i.e. GATK.

In total the Rice3K DeepVariant datasetcontains ~12 billion variants at ~74 million genomic locations (SNPs and Indels). These are available in a 1.5 terabyte (TB) table that uses the BigQuery Variants Schema.

Even at this size, you can still run interactive analyses, thanks to the scalable design of BigQuery. The queries we present below run on the order of a few seconds to a few minutes. Speed matters, because genomic data are often being interlinked with data generated by other precision agriculture technologies.

Illustrative queries and analyses

Below, we present some example queries and visualizations of how to query and analyze the Rice 3K dataset. Our analyses focus on two topics:

  • The distribution of genome variant positions, across 3024 rice varieties.
  • The distribution of allele frequencies across the rice genome.

For a step-by-step tutorial on how to work with variant data in BigQuery using the Rice 3K data or another variant dataset of your choosing, consider trying out the Analyzing variants with BigQuery codelab.

Analysis 1: Genetic variants are not uniformly distributed

Genomic locations with very high or very low levels of variation can indicate regions of the genome that are under unusually high or low selective pressure.

In the case of these rice varieties, high selective pressure (which corresponds to low genetic variation) indicates regions of the genome under high artificial selective pressure (i.e. domestication). Moreover, these regions contain genes responsible for traits that regulate important cultivational or nutritional properties of the plant.

We can measure the magnitude of the regional pressure by calculating at each position the Z statistic of each individual variety vs. all varieties. Here’s the query we used to produce the heatmap below, which shows the distribution of genetic variation across all 1Mbase-sized regions across all 12 chromosomes as columns (labeled by the top colored row), vs. all 3024 rice varieties as rows. Red indicates very low variant density relative to other samples within a particular genomic region, while pale yellow indicates very high variant density within a particular genomic region. The dendrogram below shows the similarity among samples (branch length) and groups similar rice varieties together:

rice_genomes_plot.png

A high resolution PDF of this plot is available, as well as the R script used to generate it.

Some interesting details of the dataset are highlighted (in yellow) in the heatmap above:

  1. Closer inspection of chromosome 5 (cyan columns, 1Mbase blocks 9-12) shows that the distinct distribution of Z scores across samples likely occurs due to two factors:

    1. this region includes many centromeric satellites resulting in a high false-positive rate of variants detected, and

    2. a genomic introgression present in some of the rice varieties multiplies this effect (yellow rows).

  2. Nearly all of the 3024 rice varieties included in the Rice 3K dataset are from rice species Oryza sativa. However, 5 Oryza glaberrima varieties were also included. These have a high level of detected genetic variation because they are from a different species, and are revealed as a bright yellow band at the top of the heatmap.

  3. The majority of samples can be partitioned into one group with high variant density and another group with low variant density. This partition fits with previously used methods for classification by admixture. For example, the bottom rows that are mostly red correspond to rice varieties in the japonica and circum-basmati (aromatic) groups that are similar to the Nipponbare reference genome we used.

Analysis 2: Some specific regions are under selective pressure

According to the Hardy-Weinberg Principle, the expected proportion of genotype frequencies within a randomly mating population, in the absence of selective evolutionary pressure, can be calculated from the component allele frequencies. For a bi-allelic position having alleles P and Q and corresponding population frequencies p and q, the expected genotype proportions for PP, PQ, and QQ can be calculated with the formula p2 + 2pq + q2 = 1. However we need to modify this formula by adding an inbreeding coefficient F to reflect the population structure (see: Wahlund effect) and the self-pollination tendency of rice: PP=p2+Fpq ; PQ=2(1-F)pq ; QQ=q2+Fpq where F=0.95.

The significance of genomic positions deviating from the expected genotype distribution follows χ2 , allowing a p-value to be derived and thus identification of positions that are either under significant selective pressure or neutral. In short, this analysis, highlights the fact that rice is highly inbred.

Below you can find a plot of 10-kilobase genome regions from the Oryza sativa genome, colored according to the proportion of variant positions that are significantly (p<0.05) out of (inbreeding modified) Hardy-Weinberg equilibrium, with white regions corresponding to those under low selective pressure and red regions corresponding to those under high selective pressure:

Oryza sativa genome plot.png

The data shown above were retrieved using this query and plotted using this R script. The query used to make this figure was adapted to the BigQuery Variants Schema from one of a number of quality control metrics found in the Google Genomics Cookbook.

Note that selective pressure on the genome is not uniformly distributed, indicated by the clumps of red visible in the plot. Interestingly, there is little correspondence between the prevalence of variants within a region (previous figure) and the proportion of variants within that same region that are under significant selective pressure. The bin size (10 kilobases) used in this visualization is on the order of the average Oryza sativa gene size (3 kilobases) and, given the low correlation between high selective pressure and variant density, it may be useful to guide a gene hunting expedition aimed at identifying genomic loci associated with phenotypes of interest (i.e. those that affect caloric areal yield, nutritive value, and drought- and pest-resistance).

Data availability and conclusion

Genome sequencer reads in FastQ format from Sequence Read Archive Project PRJEB6180, were aligned to the Oryza sativa Os-Nipponbare-Reference-IRGSP-1.0 reference genome using the Burrow-Wheeler Aligner (BWA), producing a set of aligned read files in BAM format.

Subsequently, the BAM files were processed with the Cloud DeepVariant Pipeline, a Cloud TPU-enabled, managed service that executes the DeepVariant open-source software. The pipeline produced a list of variants detected in the aligned reads, and these variants were written out to storage as a set of variant call files in VCF format.

Finally, all VCF files were processed with the Variant Transforms Cloud Dataflow Pipeline, which wrote records to a BigQuery Public Dataset table in the BigQuery Variants Schema format.

For additional guidance on how to use DeepVariant and BigQuery to analyze your own data on Google Cloud, please check out the following resources:

  • Variant Calling on a Rice Genome with DeepVariant

  • Analyzing variants with BigQuery

  • The Google Genomics Cookbook

  • DeepVariant on GitHub

Acknowledgments

We’d like to thank our collaborators and their organizations—both within and outside Google—for making this post possible:

  • Allen Day, Google Cloud

  • Ryan Poplin, Google AI

  • Ken McNally, IRRI

  • Dmytro Chebotarov, IRRI

  • Ramil Mauleon, IRRI

除非特别声明,此文章内容采用知识共享署名 3.0许可,代码示例采用Apache 2.0许可。更多细节请查看我们的服务条款。

Tags: Cloud

Related Articles

Announcing updates to Cloud Speech-to-Text and the general availability of Cloud Text-to-Speech

2018-08-29admin

Google Maps Platform now integrated with the GCP Console

2018-05-18admin

BigCommerce transforms the retail experience for more than 60,000 merchants using Google Cloud Platform

2019-01-22admin

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

Recent Posts

  • Android Game SDK
  • From Sheets to Apps: how to curate and send content automatically with a simple script
  • Blockly Summit 2019: Rendering, Accessibility, and More!
  • Behind the counters, Chrome Enterprise and G Suite help Schnucks create faster grocery service
  • 5 favorite tools for improved log analytics

Recent Comments

  • admin on Using advanced Kubernetes autoscaling with Vertical Pod Autoscaler and Node Auto Provisioning
  • Martijn on Using advanced Kubernetes autoscaling with Vertical Pod Autoscaler and Node Auto Provisioning
  • Martijn on Using advanced Kubernetes autoscaling with Vertical Pod Autoscaler and Node Auto Provisioning
  • Chen Zhixiang on Concurrent marking in V8
  • admin on 使用 Android Jetpack 加快应用开发速度

Archives

  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • January 1970

Categories

  • Android
  • Design
  • Firebase
  • GoogleCloud
  • GoogleDevFeeds
  • GoogleMaps
  • GooglePlay
  • Google动态
  • iOS
  • Uncategorized
  • VR
  • Web
  • WebMaster
  • 社区
  • 通知

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org

最新文章

  • Android Game SDK
  • From Sheets to Apps: how to curate and send content automatically with a simple script
  • Blockly Summit 2019: Rendering, Accessibility, and More!
  • Behind the counters, Chrome Enterprise and G Suite help Schnucks create faster grocery service
  • 5 favorite tools for improved log analytics
  • Networking cost optimization best practices: an overview
  • Shrinking the time to mitigate production incidents – CRE life lessons
  • Simplified data transformations for machine learning in BigQuery
  • Last month today: November on GCP
  • Flutter Interact – December 11 – create beautiful apps

最多查看

  • 如何选择 compileSdkVersion, minSdkVersion 和 targetSdkVersion (25,240)
  • Google 推出的 31 套在线课程 (22,403)
  • 谷歌招聘软件工程师 (22,286)
  • Seti UI 主题: 让你编辑器焕然一新 (13,813)
  • Android Studio 2.0 稳定版 (9,403)
  • Android N 最初预览版:开发者 API 和工具 (8,031)
  • 像 Sublime Text 一样使用 Chrome DevTools (6,304)
  • 用 Google Cloud 打造你的私有免费 Git 仓库 (6,071)
  • Google I/O 2016: Android 演讲视频汇总 (5,601)
  • 面向普通开发者的机器学习应用方案 (5,519)
  • 生还是死?Android 进程优先级详解 (5,218)
  • 面向 Web 开发者的 Sublime Text 插件 (4,335)
  • 适配 Android N 多窗口特性的 5 个要诀 (4,308)
  • 参加 Google I/O Extended,观看 I/O 直播,线下聚会! (3,619)
© 2019 中国谷歌开发者社区 - ChinaGDG