A Tougher Turing Test

Jul 25 2016

Published by Steven Novella under Neuroscience,Technology
Comments: 270

exmachinsert5 In 1950 Alan Turing, as a thought experiment, considered a test for telling the difference between a human and an artificial intelligence (AI). If a person had an extensive conversation with the AI and could not tell them apart from a real person, then that would be a good indication that the AI had human-like intelligence.

This process became known as the Turing Test, and every year various groups administer their version of the Turing Test to AI contestants. The test has limits, however, and is generally considered to be too easy. It is also dependent on the skills of the human questioner.

Parsing Language

A recent AI contest used a different approach, the Winograd Schema Challenge (WSC).This is one of many alternatives to the Turing Test that are being explored. Here is the format of the challenge:

Two entities or sets of entities, not necessarily people or sentient beings, are mentioned in the sentences by noun phrases.
A pronoun or possessive adjective is used to reference one of the parties (of the right sort so it can refer to either party).
The question involves determining the referent of the pronoun.
There is a special word that is mentioned in the sentence and possibly the question. When replaced with an alternate word, the answer changes although the question still makes sense (e.g., in the above examples, “big” can be changed to “small”; “feared” can be changed to “advocated”.)

Here is an example question:

I. The trophy would not fit in the brown suitcase because it was too big (small). What was too big (small)?
Answer 0: the trophy
Answer 1: the suitcase

This is an interesting format, because language can be ambiguous and we need to think about context and common sense in order to make sense of it. Even intelligent and well-educated people will occasionally misinterpret what other people say or be temporarily confused because of an ambiguous sentence.

This is especially true of spoken language. When writing we tend to be more formal and careful. In conversations, however, people tend to jump around more, reference back to previous points without clarifying the shift in subject, and depend more on context.

Ambiguous statements can often be humorous, and sometimes people will deliberately exploit such ambiguity for humor (“When I nod my head, you hit it.”).

The WSC uses statements that most people should have no problem parsing based on context. It is obvious that the suitcase needs to be big enough to fit the trophy. We rely on general knowledge and common sense.

As an interesting aside, neurologists sometimes use similar complex or ambiguous statements to test patients for language ability or cognition.

Types of AI

It is often necessary to clarify what we mean by AI when we discuss it. AI does not necessarily refer only to self-aware conscious computers. It is essentially any software that mimics intelligence in a dynamic way – software that is interactive, will react to what the user is doing, or will learn from experience.

If you have ever played a modern video game, you have experienced AI.

One type of AI is referred to as a chat bot. They mimic human dialogue. You can speak with a chat bot and it will have a conversation with you. This requires some understanding of language and a general knowledge base. These are the programs that have been taking the traditional Turing Test. They can now often fool humans in casual conversation, especially if you are not experienced with them or are not specifically trying to identify that you are talking with a chat bot.

There are also expert systems, like IBM’s Watson, that are programmed with vast databases and can give contextual information as needed (like when playing Jeopardy).

Such systems, however, are all top-down, meaning that the software is not conscious, it is not thinking as humans think. It does not have any common sense. Rather, the software is following a complex algorithm. What the computers do well is store vast amounts of data accurately, and supercomputers have very fast processing speed.

These systems are very good at things like playing chess, or Go, which are games with finite rules that benefit from being able to process many possible moves and calculate outcomes.

What they are not good at is inferring new information or meaning based on context and common sense.

The bottom line is that computer hardware and software simply function differently than the human brain. We will not get to actual human-like consciousness by improving AI as it currently is – better algorithms, more data, faster processing. We will need to design computers that function differently, more like a vertebrate brain. We don’t really understand what that means, although we are making progress. We may duplicate the function of a human brain by simply copying it before we fully understand it.

How Did They Do?

So, how did the AI competitors do in the WSC? Horribly.

It was reported that random guessing would produce 45% correct. This is not fully explained, however. Why isn’t it 50%? The WSC page says there are only two possible answers, but then they also state that:

“Finally, a list of possible referents is given, labelled “A”, “B”, “C” …”

This implies that sometimes there are three choices. Using common sense inference, I conclude that some of the questions have three possible answers, so that random guessing produces the 45% correct.

The best contestants performed at 48% on 60 questions. This was described as slightly better than chance, which is true individually, but it is probably not statistically different from chance (I would need to see the whole distribution of outcomes to tell). Essentially, with the best performers doing 48% it is most likely that we are seeing a Bell curve around 45%, meaning overall the contestants were performing at chance level.

Either way, this was nowhere near the 90% correct threshold set by the contest. So, as far as we have come in AI, in chat bots and natural language algorithms, a test that requires context and common sense resulted in complete failure.

I suspect that, as a result of this contest, programmers will figure out new algorithms that will be progressively better at parsing pronoun ambiguous statements. Once they crack this nut, someone will come up with a better Turing Test that requires a mental function not specifically tackled by the current algorithms.

Still, these AIs are mastering specific skills that will have useful applications. Being able to interact with humans using natural language is extremely useful. The technology is advancing as AIs get better and better. This is simply not consciousness technology.

270 responses so far