Home Artists Posts Import Register

Content

[This is a transcript with links to references.]

Artificial Intelligence. It seems like lately everyone talks about it, everywhere, all at once. But is artificial intelligence really intelligent? What do we even mean by intelligent? If it’s not yet intelligent, how would we find out if it were to become intelligent? That’s what we’ll talk about today.

Paul Graham, computer scientist, writer, and investor, recently wrote that what worries him about artificial intelligence is that “Few things are harder to predict than the ways in which someone much smarter than you might outsmart you.”

Sabine Hossenfelder, theoretical physicist, writer, and involuntary comedian, knows from personal experience that you don’t have to be smart to appear smart, and that worries her much more.

When it comes to AI, the question of consciousness receives all the attention, but intelligence is much more difficult to define. For example, few of my colleagues would doubt I’m conscious, but opinions on my intelligence diverge.

While there is no agreed upon definition for intelligence, there are two ingredients that I believe most of us would agree are aspects of intelligence.

First, there’s the ability to solve a large variety of problems, especially new ones. This requires knowledge transfer, creativity, and the ability to learn from mistakes.

Second, there is the ability to think abstractly, to understand concepts and their relations, and deduce new properties. This relies heavily on the use of logic and reasoning.  

The relation between knowledge and intelligence is especially difficult to untangle. This is well illustrated by a 1980 thought experiment from the philosopher John Searle, known as the Chinese Room.

Searle asks us to imagine we’re in a room with a book that contains instructions for responding to Chinese text. From outside the room, people send in pieces of paper with Chinese characters, because that’s totally a thing people do. We use our book to figure out how to respond to these messages and return an answer. The person outside might get away thinking there’s someone in the room who understands Chinese, even though that isn’t so.

Searle used this thought experiment to argue that a piece of software doesn’t really understand what it’s doing. I explained in my earlier video why that’s a bad analogy for artificial intelligence. But while the Chinese Room doesn’t tell us much about understanding, it illustrates nicely the difference between knowledge and intelligence.

In the Chinese Room, we with our instruction book undoubtedly have knowledge. But we’re not intelligent. We can’t solve any new problem. We can’t transfer knowledge. We have no capacity for abstract thought or logical reasoning. We’re just following instructions. But as long as the person outside is just asking for *knowledge, they won’t be able to tell whether we’re intelligent.

Ok, so we have a vague intuitive idea for what it means to be intelligent, but can we make this more precise? If you can’t measure it, does it even exist?

Trying to measure intelligence is not a new idea. The first attempt has been attributed to Sir Francis Galton, a British polymath and cousin of Charles Darwin. But Galton’s work on intelligence focused on sensory and perceptual tasks. He developed tests for skills such as reaction time and visual acuity. A good start I guess, but it conflates cognitive with physical abilities.

The next step was taken by Charles Spearman at the beginning of the 20th century. He introduced the g-factor, a measure for what he called general intelligence, as opposed to specific factors, that might be a talent in a particular domain. Spearman wasn’t trying to say you’re dumb if you don’t know everything. It was rather that he thought everyone is a genius in some area, but it takes suitable tests to figure out just what that area is.  

Briefly after that, but unrelated to Spearman’s work, psychologists Alfred Binet and Théodore Simon took on the task of developing a test to quantify intellectual ability in children. It was an assignment by the French government with the aim of identifying children who needed special support. The Binet-Simon test came out first in 1905. It includes questions of comprehension, for example about the content of short stories, the meaning of words, counting backwards, memorizing images and stuff like this.

Similar tests are still used in some places to assess whether children are old enough to start primary school. In a story that my mum never tires of telling, I failed my primary school assessment because I refused to hop on one leg. So please forgive me if I’m somewhat cynical about assessment tests.

The idea of the Intelligence Quotient, IQ for short, goes back to these early ideas from Spearman, Binet, and Simon. If you want to take an IQ test today, you could try to assemble a piece of furniture from a certain Swedish retailer. But the standard way to do it is to see a psychologist and have them walk you through a certified test. There are several institutions who regularly update these tests and who sell the most recent version to licensed practitioners. These tests are not meant to be taken on your own, though you find some similar versions online.

The most commonly used IQ test might be the Wechsler Intelligence Scale, that comes in a version for children and adults. It measures intellectual ability with several subtests that assess verbal comprehension, perceptual reasoning, working memory, mathematical ability, and processing speed.

One big disadvantage of this test is however that verbal comprehension strongly depends on what language you’ve grown up with and what education you’ve gone through. In most cases it’s a good proxy for general intelligence, but some people fall through the cracks because of their social and cultural background.

A popular alternative is therefore Raven’s Progressive Matrices, a non-verbal test for abstract reasoning skills, that works by completing patterns. There’s also the Cattell Cultural Fair Test, which also relies on abstract patterns and that was specifically designed to minimize cultural bias.

The scores for these tests are calculated by recruiting a sample group, setting their average score to one hundred points, and one standard deviation to fifteen points. This means that about sixty-eight percent of people will score between 85 and 115 points.

One tricky point is that since the IQ is defined relative to that sample group, the average IQ of the entire population, that the sample is supposedly representative for, will generically not be exactly 100. Generally, the IQ of any group will depend on the comparative sample. This is why, in an international comparison, the average IQ of people in different countries can noticeably diverge from 100. It’s because the score 100 was based on a global average sample, so nations can come out higher or lower.

In a recent global comparison, the highest-scoring country was Japan with an average IQ of 106, closely followed by Taiwan.  Germany came in 10th with almost exactly 100 on average, so we’re world-best at being average if nothing else. The UK came in 20th with 99, and the United States 28th with 97 point 5. At the bottom of the list, you find countries where malnutrition is rampant and illiteracy is common, such as Nicaragua and Nepal.

This brings up the question just what these IQ tests measure. Studies have shown that the IQ changes with education, health, environment, and age. We also know that it has a strong heritable component, but it can fluctuate wildly. According to a study by researchers from University College London in 2011, IQ in teenagers changed by as much as 21 points over four years, and that was only on the days when the teenagers could find their brain to begin with. There is also some evidence that cognitive training can improve parts of the performances on IQ tests, such as working memory and verbal abilities.

So in the end, what does the IQ tell us? As the psychologist Edwin Boring put it already in 1923, “Intelligence is what the tests test.” Basically the psychologist’s version of “shut up and calculate.” And this is fine if what you want is just a number to find out what it’s correlated with. But in the end, we don’t just want a number. We want to know what someone, or something can *do with this supposed intelligence. We want to know what we can expect from them. We want to know how far we can trust them not to be stupid.

And while measuring things is all well and fine, the idea that intelligence can be reduced to a single number is itself problematic. Of course, we’re not the first to point that out, which is why some smart people have thought of more intelligent ways of measuring intelligence.

One popular alternative to the IQ is the theory of multiple intelligences proposed in 1983 by the developmental psychologist Howard Gardner. According to Gardner, the IQ is too narrow a measure for intelligence. Gardener argued that intelligence is not a singular entity, but rather a collection of eight different ones. That’s linguistic, logical-mathematical, musical, spatial, bodily-kinaesthetic, interpersonal, intrapersonal, and naturalistic intelligence.

While Gardner’s theory of multiple intelligences offers an appealingly inclusive view of intelligence, it’s been criticised for being subjective, lacking empirical evidence, and being in practice too difficult to use. You see, it’s a nice idea in principle to say there’s no easy way to measure intelligence, but in reality, it’s rather useless.

If you feel that eight types of intelligence are a little excessive, maybe Robert Sternberg’s triarchic theory of intelligence is more to your liking.

In his 1985 book “Beyond IQ”, Sternberg proposed that intelligence is not solely based on cognitive abilities, but also on practical intelligence and creative intelligence. By practical intelligence he means the ability to adapt to real-life situations, and creative intelligence is the capacity to generate novel and valuable ideas. But Sternberg’s triarchic theory has the same basic problem as Gardner’s eight levels: The definition is vague and in practice it’s difficult to assess, so it’s never been widely used.

Ok, now that we have some idea of what we mean by intelligence in humans, let’s talk about intelligent software. Computer scientists distinguish three different types of artificial intelligence.

First, there’s narrow AI, also known as weak AI. It’s designed to perform specific tasks with high proficiency. This includes AI systems that can recognize speech, play games, or identify objects in images. Every AI you have met so far was narrow AI.

Second, there’s the idea of general artificial intelligence, also referred to as strong AI. It’s supposed to have an intelligence comparable to that of humans, useful across a wide range of tasks and domains, except understanding English spelling which is something humans weren’t meant to understand.

Finally, there could be super intelligent AI, which is one that surpasses human intelligence in most if not all tasks. This brings up the interesting question whether we’d notice if a super intelligent AI pretends to be dumb, so as to not scare us into shutting it down.

So how could we find out whether an AI is intelligent? Well, the most obvious thing you can do is to give it an IQ test. There are a few problems with this though.

The first problem is that the current IQ tests for humans test things such as working speed and memory, and even a dumb computer will easily outperform humans on these measures. Which makes me wonder whether these tasks should ever have been part of an intelligence test. They arguably matter for cognitive function and are therefore correlated with intelligence, but are a good memory and processing speed really a sign of intelligence on their own?

The second problem is that IQ tests for humans use different types of verbal and visual input and at the moment few AIs are good at all of these tasks. But leaving aside memory and speed tests and focusing only on the tasks ChatGPT can do, Eka Rovainen an assessment psychologist in Finland, gave everybody’s favourite AI the verbal part of the previously mentioned Wechsler test.

This includes questions about the meaning of words, or the similarity between words or phrases. It also includes questions of general knowledge, like who elects the president or what’s the capital of Iceland. ChatGPT scored 155.  

But as we’ve all learned in the past couple of months, ChatGPT’s knowledge is extremely narrow, even when it comes to verbal intelligence. A stunning example is that while it’s been trained on text, it doesn’t understand the idea of letters. If you ask it to write a paragraph on a particular topic, say, animals, that doesn’t contain, say, the letter “n”, it has no idea what to do, but doesn’t know that it doesn’t know what to do. It also can’t count the number of letters in a sentence, though it will still answer the question if you ask it to.  

This tells us three things. First, the issue isn’t that an AI doesn’t know some things but rather that it doesn’t know what it doesn’t know. Second, in practice we often measure dumbness rather than intelligence. We look for negatives. We want to know where someone or something fails, and use that to assess its intelligence. Third, we use our own intelligence to search for these failures, which is an approach that will inevitably, eventually fail.  

The Turing test is based on this idea, that we can use our own intelligence to judge that of an AI. The Turing test is named after the British mathematician and computer scientist Alan Turing who proposed it in 1950. It relies on an operator asking questions to an artificial intelligence and a human and trying to figure out which is which. If the human evaluator can’t distinguish the machine’s responses from the human’s, the machine passes the test.

There have been a few claims in the past 10 years that a chatbot has passed the Turing test, but they didn’t have the best scientific standards. I suspect that someone’s going to stage a Turing test later this year just to make a point, and we’ll see AI pass it. However, the Turing test doesn’t so much prove that the machine is intelligent, but rather that it has the ability to pretend to be intelligent.

Another problem with the Turing test is that, well, it depends on how smart the person is who asks the questions. To take at least this issue out of the picture, Terry Winograd, a professor of computer science at Stanford University, came up with the Winograd Schema Challenge for AI.

The challenge involves a series of questions that are designed to be easy for humans to answer, but hard for computers. Each question is based on a sentence that contains an ambiguity, and the correct interpretation of the sentence requires common-sense reasoning about the context. Here’s an example:

“The city councilmen refused the demonstrators a permit because they feared violence.”

The challenge is to determine who “they” refers to in this sentence: the city councilmen or the demonstrators? Or maybe I should say that’s what the challenge “was”. Because several AIs have passed the test already, in 2019, with more than 90 percent accuracy.

But this test only looks for verbal ability. You can also test AIs based on the IQ tests used for human intelligence that we discussed earlier.

For example, Raven’s Progressive Matrices have been proposed for testing the visual abilities of AIs. In 2020, two German scientists developed an AI that was able to generate a fitting image from scratch. It solved an AI-adapted variant of the test with 98 percent accuracy ​. More recently, in March 2023, another group developed an AI that scored 87 percent accuracy on the regular test.

There are further tests that are specifically designed for AIs, such as variants of the Bongard Problems. This is a software-generated dataset of abstract sketches, some of which match a pattern and some that don't. The AI must decide if new sketches match the pattern. In a 2020 study, researchers found that AIs reach 60-70 percent accuracy, whereas humans tend to reach more than 90 percent. So, this one’s still somewhat of a challenge for AIs. Good news for our egos, I guess.

In summary, researchers are using a number of ways to measure how “intelligent” artificial intelligence is and how it progresses. AIs have always been ahead of humans in terms of memory and processing speed, and have recently rapidly caught up on verbal and visual tasks. While they’re still nowhere near humans in terms of general reasoning skills, I think it’s only a matter of time until they get there. And I agree with Paul Graham, we should worry less about consciousness and more about intelligence. Because of the two, intelligence is much more dangerous.   

How will we know when an AI becomes intelligent? Probably because it’ll start arguing with us about what we mean by intelligence.

Files

(No title)

Comments

Anonymous

He comes for phantasy, Sabine for science

Anonymous

Hey everyone, off topic again but the Green Bank science community zoom tomorrow (Jul 5) will include a talk about the new NANOGrav gravitational wave background results. All are invited, but you have to register first https://greenbankobservatory.org/science/community-zoom-calls/