News

Wired: AI Can Pass Standardized Tests—But It Would Fail Preschool
Author: Melanie Mitchell
Posted: September 17, 2019

To read the original story, visit Wired.

Artificial intelligence researchers have long dreamed of building a computer as knowledgeable and communicative as the one in Star Trek, which could interact with humans in natural (i.e., human) language. Last week, we seemed to boldly go toward that ideal. The New York Times reported that a team at the Allen Institute for Artificial Intelligence (AI2) had achieved “an artificial-intelligence milestone.” AI2’s program, Aristo, not only passed but also excelled on a standardized eighth-grade science test. The machine, the Times heralded, “is ready for high school science. Maybe even college.”

Or maybe not. Aristo isn’t the first AI system to shine on a test designed to gauge human knowledge and reasoning abilities. In 2015 one system matched a 4-year-old’s performance on an IQ test, prompting the BBC headline “AI had IQ of four-year-old child.” Another group reported their system could solve SAT geometry questions “as well as the average American 11th-grade student.” More recently, Stanford researchers created a question-answering test that prompted the New York Post to announce that “AI systems are beating humans in reading comprehension.” The truth is that while these systems perform well on specific language-processing tests, they can only take the test. None come anywhere close to matching humans in reading comprehension or other general abilities that the test was designed to measure.

The problem is that today’s machines, which excel at certain narrow tasks, still lack what we might call common sense. This includes the vast, and mostly unconscious, background knowledge that we use to understand the situations we encounter and the language we communicate with. Common sense also includes our ability to apply this knowledge quickly and flexibly to new circumstances.

The goal of endowing machines with common sense is as old as the field of AI itself, and is, I would venture, AI’s hardest open problem. Beginning in the 1990s, research on common sense took a back seat to statistical, data-driven AI approaches—especially in the form of neural networks and “deep learning.” But researchers have recently found that deep learning systems lack the robustness and generality of human learning, primarily because they lack our broad knowledge and flexible reasoning capabilities. Giving machines humanlike common sense is now at the top of AI’s to-do list.

Open-ended question-answering, like that of the Star Trek computer, is still too hard for current AI systems, so researchers make progress by creating programs that can perform well on “benchmarks”—particular data sets that represent a specific task. Aristo’s benchmark consists of a set of multiple-choice questions from the New York State Regents Exam in science. A sample question:

Which equipment will best separate a mixture of iron filings and black pepper?
(a) magnet (b) filter paper (c) triple-beam balance (d) voltmeter

Aristo’s creators believe that developing AI systems to answer such questions is one of the best ways to push the field forward. “While not a full test of machine intelligence,” they note, these questions “do explore several capabilities strongly associated with intelligence, including language understanding, reasoning, and use of common-sense knowledge.”

Aristo is a complicated system that combines several AI methods. However, the component that accounts for almost all of the system’s success is a deep neural network that has been trained to be a so-called language model—a mechanism that, given a sequence of words, can predict what the next word will be. “I was driving way too fast when I was stopped by the ...” What’s the next word? Maybe “police.” Probably not “grapefruit.” Given a sequence of words, a language model computes the probability that each of the hundreds of thousands of words in its vocabulary will be the next one in the sequence.

Aristo’s language model was trained on word sequences from millions of documents (including all of English Wikipedia). After training with this vast collection of English, the neural network has presumably learned some useful things about language in general. At this point the network can be “fine-tuned” to learn to answer multiple-choice questions. When it takes the Regents exam, its input is the question plus the four possible answers; the output is the probability that each answer is correct. The network returns the highest-probability answer as its guess.

Aristo was tested on 119 questions from the eighth-grade exam and was correct on over 90 percent of them, a remarkable performance. It was also correct on over 83 percent of 12th-grade questions. While the Times reported that Aristo “passed the test,” the AI2 team noted that the actual tests New York students take include questions that refer to diagrams, as well as “direct answer” questions, neither of which Aristo was able to handle.

This is exciting progress, but we must keep in mind that a high score on a particular data set does not always mean that a machine has actually learned the task its human programmers intended. Sometimes the data used to train and test a learning system has subtle statistical patterns—I’ll call these giveaways—that allow the system to perform well without any real understanding or reasoning.

For example, one neural-network language model—similar to the one Aristo uses—was reported in 2019 to capably determine whether one sentence logically implies another. However, the reason for the high performance was not that the network understood the sentences or their connecting logic; rather, it relied on superficial syntactic properties such as how much the words in one sentence overlapped those in the second sentence. When the network was given sentences for which it could not take advantage of these syntactic properties, its performance plummeted.

Dozens of papers have been published over the past few years revealing the existence of subtle giveaways in benchmark data sets used to evaluate machine-learning systems. This has led some researchers to question the extent to which deep learning systems are exhibiting “true understanding” or merely responding to superficial cues in the data.

The Aristo team argued that its Regents exam questions are less likely to be vulnerable to such giveaways than the more commonly used crowdsourced question-answering data sets. They note that “many of the benchmark questions intuitively appear to require reasoning to answer” and that Aristo’s excellent performance “suggests that the machine has indeed learned something about language and the world, and how to manipulate that knowledge.”

But to what extent is reasoning, comprehension, or knowledge of science actually needed to answer these questions? For example, consider the sample question above. The Aristo team asserts, “To answer this kind of question robustly, it is not sufficient to understand magnetism. Aristo also needs to have some model of ‘black pepper’ and ‘mixture,’ because the answer would be different if the iron filings were submerged in a bottle of water.”

I’ll make a competing hypothesis: Given Aristo’s language model, no such knowledge or reasoning is needed to answer this specific question; instead, the language model will have captured statistical associations between words that allow it to answer the question without any real understanding whatsoever. To illustrate, consider the following four sentences.

1. Magnet will best separate a mixture of iron filings and black pepper.
2. Filter paper will best separate a mixture of iron filings and black pepper.
3. Triple-beam balance will best separate a mixture of iron filings and black pepper.
4. Voltmeter will best separate a mixture of iron filings and black pepper.

A language model can input each of these sentences and output the sentence’s “probability”—how well the sentence fits the word associations the model has learned—and choose the option with the highest probability. As a very rough simulation, I typed a version of each of these sentences into Google (making sure it found no exact matches) and looked at how many “hits” each received. Indeed, the sentence beginning with “magnet” got the most hits. My crude language model answered the question correctly without any intelligence other than word associations on the web.

I tried this same experiment with other randomly chosen questions from the Regents exam and found that the correct answer received the most hits in six out of 10 cases. My Googling experiment is just an illustration, not meant to be scientific, but it does agree pretty well with the score the Aristo team itself reported for “baseline retrieval methods.” It’s far less than 90 percent, but it highlights that there are “giveaways” that can boost a learning system’s performance without requiring any knowledge or reasoning at all. Moreover, this may be only the tip of the iceberg of the subtle giveaways that a machine-learning system could use to choose an answer.

Neural networks are notoriously opaque; it’s typically very hard to tease out exactly what they’ve learned. It may be that Aristo’s impressive performance is actually due to an ability to extract and reason about scientific concepts. But given the history of natural-language processing systems that exploit giveaways and are “right for the wrong reasons,” it’s essential to more fully probe these claims. The Aristo team itself offered one telling step in this direction: They performed an experiment in which they added four additional incorrect answers to each question, specifically choosing new answers that might confuse the system. Aristo’s performance dropped to less than 60 percent correct. Probing the weakness of one’s own AI system is essential to making progress on these very hard problems.

True understanding of human language requires extensive background knowledge and mental models that allow flexible reasoning. Developing systems with such understanding remains the hardest problem in AI. Notably, the US Defense Advanced Research Projects Agency has begun pouring money into research on machine common sense. One of Darpa’s challenge problems is to develop an AI system with the common sense of an 18-month-old child—something the field seems quite far from achieving. Rather than being ready for high school or college, AI has a lot of growing to do before it’s even ready for preschool.