The next time you’re on the Internet, type the word “dataset” into your browser. In half a second your selected search engine will provide you with a group of about 33 million results organized in a way you might find useful, if that is, you had a reason to search for the word “dataset” without having been prompted to. You might perform a similar, though slightly more complicated task by searching for Italian restaurants on Main Street that open for dinner at five.
The speeds of searches are incredible enough, but add to that the fact that the results provided are ranked in a manner corresponding to your search engine’s intuitive sense of what’s important to you and you’ve got something special, even if it’s something you’ve never thought about.
Not all of us working with datasets are lucky enough to have such tools to take for granted. Scientists, for instance, have seen an explosion in the number and type of data-collecting sensors being used in the field over the past decade. These sensors have turned what was a trickle of information in the not too distant past into a great deluge of data today. Many of these new sensors collect data continuously—terabytes of data, billions of records flowing into observational archives day in and day out. With this immense amount of information stored in diverse databases and thousands of datasets, it is becoming increasingly difficult for scientists to search for and find relevant observations. There is, however, help on the horizon.
“Data Near Here” is a technology that promises to help scientists perform ranked searches for data using geospatial, temporal, and observational-variable characteristics. This new capability, developed by Dr. David Maier, Professor of Computer Science in the Maseeh College of Engineering & Computer Science, and Veronika Megler, a doctoral student in the Department of Computer Science, provides a method by which scientists collecting data from environmental sensors can make queries of their datasets using geospatial, temporal, and variable query terms and quickly get ranked and scored results based on the parameters established by the query. The process is not dissimilar to searching for that Italian restaurant, only in this case the searcher is a marine biologist or an oceanographer looking into an archive or databases and the resulting list of Italian restaurants is a hierarchical summary of historical observations.
So how did a couple of computer scientists get involved with creating a tool for ranked searches of data collected by microbiologists and oceanographers working in and around Oregon’s coast?
“We asked them what they needed,” Dr. Maier said.
Working with the Center for Coastal Margin Observation and Prediction (CMOP), a group that studies Oregon’s estuarine and coastal environment, Dr. Maier and Megler found that while CMOP had some tools for working with data, they lacked a tool to help them pull relevant observations from the billions of historical records they had collected and stored in databases and file systems.
“If the scientists knew exactly what they were looking for, they could find it,” Megler said. “But they were wasting all this time hunting and pecking through their information to find it. And if they didn’t know exactly what was there, they’d never look. We thought, ‘clearly ideas from information retrieval apply here,’ so I sat down and began to think about whether it was possible to apply those ideas to scientific data.”
It is a research problem Megler and Dr. Maier think they’ve solved.
“It works,” Dr. Maier said, “because we scan each dataset in advance to produce a ‘footprint’ that summarizes its spatial and temporal extent, as well as the ranges of observational variables. Then when someone is actually conducting a search, we compare their search query to our collection of footprints, so that we don’t have to access the billions of individual observations.”
Scientists at CMOP can now comb through their databases using a compound query, with geospatial, temporal, and variable terms that will provide them with a set of results ranked by parameters they themselves set down in their search.
“My approach is to use the size of the query to define how important the parameters were,” Megler said. “If you’re a scientist examining a particular location over a period of a decade, the Astoria Bridge, for example, you could input those parameters into the query. You’d get the dataset for the decade at the top of the list and sets by years, months, weeks, or days further down the list. The system lets users handle variations in their searches smoothly.”
Dr. Maier and Megler imagine that in the future they could have a catalog of templates that they could plug into the architecture of their ranked search tool that would open the tool up for other fields where scientists and researchers have to locate particular datasets in a massive repository.
“It’s simple to extend the search tool to other kinds of spatial/temporal datasets,” Dr. Maier said. “I’ve thought about applying the tool to anatomical data, brain scans, gene expression, and other datasets that express a similar pattern.”
“Right now we’re focused on geospatial and temporal searches because they’re of great interest to scientists,” Megler said. “Space and time are where we started from, but they’re not the limit of where we can go.”
Living, as we do, during a time that’s been called “The Data Deluge,” we need innovators like Dr. Maier and Ms. Megler to help researchers perform their research, and help us sort through – and make sense of – the mountains of data we collect. As Megler points out, too much information “is a constraint on scientific discovery.” However, as is clear from the example of this new technology, constraints can often lead to research and innovation, and with an award Dr. Maier recently received the work with CMOP will continue for another five years. This is why we’re glad Megler and Dr. Maier are out there asking questions like ‘what is it that you need?’
For more information about the Center for Coastal Margin Observation and Prediction, visit their website at www.stccmop.org.