Source: Improved Grading of Prostate Cancer Using Deep Learning from Google Research
Posted by Martin Stumpe, Technical Lead and Craig Mermel, Product Manager, Healthcare, Google AI
Approximately 1 in 9 men in the United States will develop prostate cancer in their lifetime, making it the most common cancer in males. Despite being common, prostate cancers are frequently non-aggressive, making it challenging to determine if the cancer poses a significant enough risk to the patient to warrant treatment such as surgical removal of the prostate (prostatectomy) or radiation therapy. A key factor that helps in the “risk stratification” of prostate cancer patients is the Gleason grade, which classifies the cancer cells based on how closely they resemble normal prostate glands when viewed on a slide under a microscope.
However, despite its widely recognized clinical importance, Gleason grading of prostate cancer is complex and subjective, as evidenced by studies reporting inter-pathologist disagreements ranging from 30-53% . Furthermore, there are not enough speciality trained pathologists to meet the global demand for prostate cancer pathology, especially outside the United States. Recent guidelines also recommend that pathologists report the percentage of tumor of different Gleason patterns in their final report, which adds to the workload and is yet another subjective challenge for the pathologist . Overall, these issues suggest an opportunity to improve the diagnosis and clinical management of prostate cancer using deep learning–based models, similar to how Google and others used such techniques to demonstrate the potential to improve metastatic breast cancer detection.
In “Development and Validation of a Deep Learning Algorithm for Improving Gleason Scoring of Prostate Cancer”, we explore whether deep learning could improve the accuracy and objectivity of Gleason grading of prostate cancer in prostatectomy specimens. We developed a deep learning system (DLS) that mirrors a pathologist’s workflow by first categorizing each region in a slide into a Gleason pattern, with lower patterns corresponding to tumors that more closely resemble normal prostate glands. The DLS then summarizes an overall Gleason grade group based on the two most common Gleason patterns present. The higher the grade group, the greater the risk of further cancer progression and the more likely the patient is to benefit from treatment.
|Visual examples of Gleason patterns, which are used in the Gleason system for grading prostate cancer. Individual cancer patches are assigned a Gleason pattern based on how closely the cancer resembles normal prostate tissue, with lower numbers corresponding to more well differentiated tumors. Image Source: National Institutes of Health.|
To develop and validate the DLS, we collected de-identified images of prostatectomy samples which contain a greater amount and diversity of prostate cancer than needle core biopsies, even though the latter is the more common clinical procedure. On the training data, a cohort of 32 pathologists provided detailed annotations of Gleason patterns (resulting in over 112 million annotated image patches) and an overall Gleason grade group for each image. To overcome the previously referenced variability in Gleason grading, each slide in the validation set was independently graded by 3 to 5 general pathologists (selected from a cohort of 29 pathologists) and had a final Gleason grade assigned by a genitourinary-specialist pathologist to obtain the ground-truth label for that slide.
In the paper, we show that our DLS achieved an overall accuracy of 70%, compared to an average accuracy of 61% achieved by US board-certified general pathologists in our study. Of 10 high-performing individual general pathologists who graded every slide in the validation set, the DLS was more accurate than 8. The DLS was also more accurate than the average pathologist at Gleason pattern quantitation. These improvements in Gleason grading translated into better clinical risk stratification: the DLS better identified patients at higher risk for disease recurrence after surgery than the average general pathologist, potentially enabling doctors to use this information to better match patients to therapy.
|Comparison of scoring performance of the DLS with pathologists. a: Accuracy of the DLS (in red) compared with the mean accuracy among a cohort-of-29 pathologists (in green). Error bars indicate 95% confidence intervals. b: Comparison of risk stratification provided by the DLS, the cohort-of-29 pathologists, and the genitourinary specialist pathologists. Patients are divided into low and high risk groups based on their Gleason grade group, where a larger separation between the Kaplan-Meier curves of these risk groups indicates better stratification.|
We also found that the DLS was able to characterize tissue morphology that appeared to lie at the cusp of two Gleason patterns, which is one reason for the disagreements in Gleason grading observed between pathologists, suggesting the possibility of creating finer grained “precision grading” of prostate cancer. While the clinical significance of these intermediate patterns (e.g. Gleason pattern 3.3 or 3.7) is not known, the increased precision of the DLS will enable further research into this interesting question.
|Assessing the region-level classification of the DLS. a: Annotations from 3 pathologists compared to DLS predictions. The pathologists show general concordance on the location and the extent of tumor areas, but poor agreement in classifying Gleason patterns. The DLS’s precision Gleason pattern for each region is represented by interpolating between the DLS’s prediction patterns for Gleason patterns 3 (green), 4 (yellow), and 5 (red). b: DLS prediction
patterns compared to the distribution of pathologists’ Gleason pattern classifications on 41 million annotated image patches from the test dataset. On patches where pathologists are discordant, where the tissue is more likely to be on the cusp of two patterns, the DLS reflects this ambiguity in it’s prediction scores.
While these initial results are encouraging, there is much more work to be done before systems like our DLS can be used to improve the care of prostate cancer patients. First, the accuracy of the model can be further improved with additional training data and should be validated on independent cohorts containing a larger number and more diverse group of patients. In addition, we are actively working on refining our DLS system to work on diagnostic needle core biopsies, which occur prior to the decision to undergo surgery and where Gleason grading therefore has a significantly greater impact on clinical decision-making. Further work will be needed to assess how to best integrate our DLS into the pathologist’s diagnostic workflow and the impact of such artificial-intelligence based assistance on the overall efficiency, accuracy, and prognostic ability of Gleason grading in clinical practice. Nonetheless, we are excited about the potential of technologies like this to significantly improve cancer diagnostics and patient care.
This work involved the efforts of a multidisciplinary team of software engineers, researchers, clinicians and logistics support staff. Key contributors to this project include Kunal Nagpal, Davis Foote, Yun Liu, Po-Hsuan (Cameron) Chen, Ellery Wulczyn, Fraser Tan, Niels Olson, Jenny L. Smith, Arash Mohtashamian, James H. Wren, Greg S. Corrado, Robert MacDonald, Lily H. Peng, Mahul B. Amin, Andrew J. Evans, Ankur R. Sangoi, Craig H. Mermel, Jason D. Hipp and Martin C. Stumpe. We would also like to thank Tim Hesterberg, Michael Howell, David Miller, Alvin Rajkomar, Benny Ayalew, Robert Nagle, Melissa Moran, Krishna Gadepalli, Aleksey Boyko, and Christopher Gammage. Lastly, this work would not have been possible without the aid of the pathologists who annotated data for this study.