Google Wants to Do What?|
Does it comfort you to know that it took countless hours of human ingenuity and expertise for the Google search engine to return hundreds of thousands of hits for your query about “best digital pens” in only 30-80 milliseconds?
The audience that turned out recently in Lipsett Amphitheater for the first talk in a new series on data science may have needed a search engine simply to decode the language of guest speaker Dr. Andrew Moore, dean of the School of Computer Science at Carnegie Mellon University.
He spoke of planet scale data systems, entity stores, decorated entities, fact stores, knowledge graphs, the ingestion of unstructured facts and the “architecturing” of big systems. It was a relief to hear him say, “We always need a human-in-theloop to make sure the system isn’t screwing up or hallucinating.”
Moore, who may be presiding over the most esteemed computer science faculty in the United States, is also an émigré from Google; he once served as vice president of engineering at Google Pittsburgh, where he was responsible for the retail segment.
He came to NIH to learn how best to harness burgeoning mountains of medical data so that the public may one day be able to use search engines to get useful information about their health.
In computer parlance, that’s known as query-to-result. “I foresee being able to ask medical questions on behalf of ourselves, our friends and our families,” said Moore. “That’s what I’m most excited about.
“Thousands of human years of expertise have gone into optimizing search engines,” he said. The effort has relied on many inputs, including “taking advantage of the wisdom of crowds,” with a goal of better results pages.
Moore is not one of those people who quake at Moore’s (no relation) law—computer processors double in complexity about every 2 years—fearing that one day computers will be smarter than people. But he is able to explain to you quite clearly why that ad for a Gibson electric guitar keeps following you from page to page as you surf the Internet. And why two different people who plug in the exact same search terms on their PCs get different results.
People who use the web generate “click streams”—billions of data points that provide a context for their next search. There are machines out there that learn your tastes, because as a web cruiser, that’s all you’re doing—leaving hints. Advertisers, such as music stores, gobble that evidence up.
But Moore thinks the world needs more than sophisticated ads, so he is now in academia, where a data scientist is free to wonder, “What is edible underwear? Is it clothing or food?” That’s an example of the need for a human in the loop, he said, and number 5 on his Top 5 topics in 21st century computing.
“The world is going to throw you all kinds of ambiguity,” he said, noting that the market for pet Halloween costumes now tops $100 million a year. “The challenge of ontology creation is that you can’t be static.”
Data scientists are fond of the term stack, and Moore’s “top of the stack” may explain why he visited NIH: “You cannot do data science in its own right. Some kind of action is needed, such as providing data that will influence clinical practice.”
Moore says the world is veering away from desktop query. “The end is in sight for that kind of use of the Internet,” he said. “We like handheld devices…There is a huge bet being made on the company side, where question-answering is crucial.” Apple’s Siri and Microsoft’s Cortana are examples of this trend. People will begin to use their devices to negotiate on their behalf, he predicted.
Moore said there is widespread academic work internationally in his field as applied to public health, and offered several examples.
A data scientist at the University of Pittsburgh has taken advantage of breakthroughs in the science of parsing facial expressions, he reported, which can yield a measure of predictability in peoples’ social affect. Imagine tracking a college population for signs of depression—you could get a snapshot of campus mood by monitoring the sound of laughter, or speed of students’ movement, all of which is predictive, Moore said.
Another researcher is mapping the eye movements of new readers to determine the origins of cognition.
But whether the field is health or the Amazon.com warehouse, the basic computing building block is the entity, or named entity, which becomes a node in a knowledge graph encompassing any concept from business, science, administration, entertainment—any topic.
“There is a massive hierarchy of them, and they are all interrelated,” Moore said.
Valiant efforts have been made to build global entity stores, which must be somewhat like the challenges faced by the first makers of dictionaries. Success stories include GIS, or geographic information systems, which are massive accumulations of fact that may one day be capable of guiding driverless cars; UMLS (Unified Medical Language System) codes, pioneered at NIH; Cyc, an enormous accumulation of facts cataloguing everyday items; Freebase, which Moore said is used in-house at Amazon; and Schema.org, which he said is the largest and most successful tagging solution for structured data.
“All of these, along with Facebook, Twitter and Yelp, are trying to become repositories of known human fact,” Moore said.
He finished his talk with warnings about a possible dystopian future in which there would be stock, single answers to questions, or massive invasions of privacy if people divulge too much personal information.
But overall, Moore seemed to be a good human to have in the loop as the future bears down on us; you can see for yourself at http://videocast.nih.gov/summary.asp?Live=17064&bhcp=1.