Posted by Natasha Noy, Google Research and Dan Brickley, Open Source Programs Office
There are many hundreds of data repositories on the Web, providing access to tens of thousands—or millions—of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others’ work, and providing data journalists easier access to information and its provenance. For these reasons, many publishers and funding agencies now require that scientists make their research data available publicly.
However, due to the volume of data repositories available on the Web, it can be extremely difficult to determine not only where is the dataset that has the information that you are looking for, but also the veracity or provenance of that information. Yet, there is no reason why searching for datasets shouldn’t be as easy as searching for recipes, or jobs, or movies. These types of searches are often open-ended ones, where some structure over the search space makes the exploration and serendipitous discovery possible.
To provide better discovery and rich content for books, movies, events, recipes, reviews and a number of other content categories with Google Search, we rely on structured data that content providers embed in their sites using schema.org vocabulary. To facilitate similar capabilities for datasets, we have recently published new guidelines to help data providers describe their datasets in a structured way, enabling Google and others to link this structured metadata with information describing locations, scientific publications, or even Knowledge Graph, facilitating data discovery for others. We hope that this metadata will help us improve the discovery and reuse of public datasets on the Web for everybody.
The schema.org approach for describing datasets is based on an effort recently standardized at W3C (the Data Catalog Vocabulary), which we expect will be a foundation for future elaborations and improvements to dataset description. While these industry discussions are evolving, we are confident that the standards that already exist today provide a solid basis for building a data ecosystem.
While we have released the guidelines on publishing the metadata, many technical challenges remain before search for data becomes as seamless as we feel it should be. These challenges include:
In addition to the technical and social challenges that we’ve just listed, many remaining research challenges touch on longer term open-ended research: Many datasets are described in unstructured way, in captions, figures, and tables of scientific papers and other documents. We can build on other promising efforts to extract this metadata. While we have a reasonable handle on ranking in the content of Web search, ranking datasets is often a challenging problem: we don’t know yet if the same signals that work for ranking Web pages will work equally well for ranking datasets. In the cases where the dataset content is public and available, we may be able to extract additional semantics about the dataset, for example, by learning the types of values in different fields. Indeed, can we understand the content enough to enable data integration and discovery of related resources?
A Call to Action
As any ecosystem, a data ecosystem will thrive only if a variety of players contribute to it:
Our ultimate goal is to help foster an ecosystem for publishing, consuming and discovering datasets. As such, this ecosystem would include data publishers, aggregators (in the form of large data repositories that provide additional value by cleaning and reconciling metadata), search engines that enable data discovery of the data, and, most important, data consumers.