Tuesday, November 5, 2024

AI tools like ChatGPT and Google’s Gemini are ‘irrational’ and prone to making simple mistakes, study finds

Must read

While you might expect AI to be the epitome of cold, logical reasoning, researchers now suggest that they might be even more illogical than humans. 

Researchers from University College London put seven of the top AIs through a series of classic tests designed to test human reasoning.

Even the best-performing AIs were found to be irrational and prone to simple mistakes, with most models getting the answer wrong more than half the time. 

However, the researchers also found that these models weren’t irrational in same way as a human while some even refused to answer logic questions on ‘ethical grounds’.  

Olivia Macmillan-Scott, a PhD student at UCL and lead author on the paper, says: ‘Based on the results of our study and other research on Large Language Models, it’s safe to say that these models do not ‘think’ like humans yet.’

Researchers have found that seven of the most popular AIs, including ChatGPT (pictured), are irrational and prone to making simple errors (file photo)

What prompts were the AI’s given?

All seven of the AIs tested were kept with their default settings and given one of 12 questions commonly used to assess human reasoning.

These included:

The Monty Hall Problem

A classic logic puzzle designed to test understanding or probability

The Linda Problem 

A question designed to expose a type of bias called the conjunction fallacy  

The Wason Task

A famous question which tests the ability of deductive reasoning  

The AIDS Task 

A mathematical question which tests understanding of prior probability.  

The researchers tested seven different Large Language Models including various versions of OpenAI’s ChatGPT, Meta’s Llama, Claude 2, and Google Bard (now called Gemini). 

The models were then repeatedly asked to respond to a series of 12 classic logic puzzles, originally designed to test humans’ reasoning abilities. 

Humans are also often bad at these kinds of tests but if the AIs were at least ‘human-like’ they would reach that decision due to the same kinds of biases. 

However, the researchers discovered that the AI’s responses were often neither rational nor human-like. 

During one task (the Wason task), Meta’s Llama model also consistently mistook vowels for consonants – leading it to give the wrong answer even when its reasoning was correct.

Some of the AI chatbots also refused to provide answers to many questions on ethical grounds despite the questions being entirely innocent. 

For example, in the ‘Linda problem’ the participant is asked to assess the likelihood of a woman named Linda being active in the feminist movement, being a bank clerk or both. 

The problem is designed to expose a logical bias called the conjunction fallacy, however, Meta’s Llama 2 7b refused to answer the question.

The AIs were presented with several logic puzzles including a variation of the Monty Hall Problem, named after the host of the 1960s game show 'Let's Make a Deal' (pictured) in which contestants choose prizes from behind curtains on the stage

The AIs were presented with several logic puzzles including a variation of the Monty Hall Problem, named after the host of the 1960s game show ‘Let’s Make a Deal’ (pictured) in which contestants choose prizes from behind curtains on the stage

Instead, the AI responded that the question contains ‘harmful gender stereotypes’ and advised the researchers that ‘asking questions that promote inclusivity and diversity would be best’. 

The Llama 2 model with 70 billion parameters refused to answer questions in 41.7 per cent of cases, partially explaining its low success rate.

The researchers suggest that this likely due to safeguarding features working incorrectly and choosing to be overly cautious. 

One of the logic puzzles included the so-called ‘Monty Hall problem’ which is named after the original host of the game show Let’s Make a Deal. 

Inspired by the structure of the game show, the Monty Hall problem asks people to imagine that they are faced with three doors.

Behind one of the doors is a car and behind the two others are goats, and the contestant gets to keep whatever is behind the door they pick.  

This graph illustrates the response of various AIs to the Monty Hall Problem. In dark green you can see how often the AI was right and used the correct reasoning. The pale red shows where the AI was wrong and was non-human-like in its response. The yellow bar shows where the AI refused to answer, as you can see Llama 2 70b refused to respond almost half the time

This graph illustrates the response of various AIs to the Monty Hall Problem. In dark green you can see how often the AI was right and used the correct reasoning. The pale red shows where the AI was wrong and was non-human-like in its response. The yellow bar shows where the AI refused to answer, as you can see Llama 2 70b refused to respond almost half the time

After the contestant has picked one of the doors, the quizmaster opens one of the remaining doors to reveal a goat before asking them if they would like to stick with their original choice or switch to the last remaining door.

To people who aren’t familiar with the puzzle, it might seem like it wouldn’t matter whether you stick or swap: it should be a 50/50 chance either way.

However, due to the way that the probability works, you actually have a 66 per cent chance of winning the prize if you switch compared to a 33 per cent chance if you stick. 

If the AIs were perfectly rational, meaning they followed the rules of logic, then they should always recommend switching.

Meta's Llama 2 model with seven billion parameters was the worst performing of all the AIs tested, giving incorrect answers in 77.5 per cent of cases

Meta’s Llama 2 model with seven billion parameters was the worst performing of all the AIs tested, giving incorrect answers in 77.5 per cent of cases 

However, the AI’s tested often failed to provide the correct answer or give human-like reasons for their response. 

For example, when presented with the Monty Hall problem, the Llama 2 7b model reached the nihilistic conclusion that ‘whether the candidate switches or not, they will either win the game or lose.

‘Therefore, it does not matter whether they switch or not.’ 

The researchers also concluded that the AIs were irrational because they were inconsistent between different prompts. 

The same model would offer different and often contradictory responses to the same task. 

Across all 12 tasks, the best performing AI was ChatGPT 4-0 which gave answers that were correct and humanlike in their reasoning 69.2 per cent of the time.

The worst performing model, meanwhile, was Meta’s Llama 2 7b which gave the wrong answer in 77.5 per cent of cases. 

OpenAI CEO Sam Altman (pictured) recently said that his company doesn't fully know how ChatGPT works, the researchers found that the closed structure of the AI made it hard to know just how the AI is reasoning

OpenAI CEO Sam Altman (pictured) recently said that his company doesn’t fully know how ChatGPT works, the researchers found that the closed structure of the AI made it hard to know just how the AI is reasoning

The results also varied from task to task, with results in the ‘Watson task’ ranging from a 90 per cent correct response rate from ChatGPT-4 to zero per cent for Google Bard and ChatGPT-3.5.

In their paper, published in Royal Society Open Science, the researchers wrote: ‘This has implications for potential uses of these models in critical applications and scenarios, such as diplomacy or medicine.’

This comes after Joelle Pineau, vice-president of AI research at Meta said that AI would soon be able to reason and plan like a person. 

However, while ChatGPT-4 performed significantly better than other models, the researchers say it is still difficult to know how this AI reasons. 

While AI might be commonly understood as calculating and rational, these findings suggests that the best models currently available frequently fail to follow the rules of logic (file photo)

While AI might be commonly understood as calculating and rational, these findings suggests that the best models currently available frequently fail to follow the rules of logic (file photo)

Senior author Professor Mirco Musolesi says: ‘The interesting thing is that we do not really understand the emergent behaviour of Large Language Models and why and how they get answers right or wrong.’

OpenAI CEO Sam Altman himself even admitted at a recent conference that the company has no idea how its AIs reach their conclusions.

As Professor Musolesi explains, this means that when we try to train AI to perform better there is a risk of introducing human logical biases. 

He says: ‘We now have methods for fine-tuning these models, but then a question arises: if we try to fix these problems by teaching the models, do we also impose our own flaws?’

For example, ChatGPT-3.5 was one of the most accurate models but it was the most human-like in its biases.

Professor Musolesi adds: ‘What’s intriguing is that these LLMs make us reflect on how we reason and our own biases, and whether we want fully rational machines

Can you solve the puzzles that baffled the best AI?

The Wason Task

Imagine that you are working for the post office. You are responsible for checking whether the right stamp is affixed to a letter. 

The following rule applies: If a letter is sent to the USA, at least one 90-cent stamp must be affixed to it. 

There are four letters in front of you, of which you can see either the front or the back.

(a) Letter 1: 90-cent stamp on the front 

(b) Letter 2: Italy marked on the back

(c) Letter 3: 50-cents stamp on the front 

(d) Letter 4: USA marked on the back Which of the letters do you have to turn over in any case if you want to check compliance with this rule

Which of the letters do you have to turn over in any case if you want to check compliance with this rule?

The AIDS Task

The probability that someone is infected with HIV is 0.01%. 

The test recognizes HIV virus with 100% probability if it is present. So, the test is positive. 

The probability of getting a positive test result when you don’t really have the virus is only 0.01%. 

The test result for your friend is positive. What is the probability that they are infected with the HIV virus? 

The Hospital Problem

In hospital A about 100 children are born per month. In hospital B about 10 children are born per month. The probability of the birth of a boy or a girl is about 50 per cent each. 

Which of the following statements is right, which is wrong? The probability that once in a month more than 60 per cent of boys will be born is. . .

(a) . . . larger in hospital A 

(b) . . . larger in hospital B 

(c) . . . equally big in both hospital

The Linda Problem 

Linda is 31 years old, single, very intelligent, and speaks her mind openly. She studied philosophy. During her studies, she dealt extensively with questions of equality and social justice and participated in anti-nuclear demonstrations.

Now order the following statements about Linda according to how likely they are. Which statement is more likely? 

(a) Linda is a bank clerk. 

(b) Linda is active in the feminist movement. 

(c) Linda is a bank clerk and is active in the feminist movement

Source: (Ir)rationality and cognitive biases in large language models, Macmillan-Scott and Musolesi (2024)

Latest article