While you might expect AI to be the epitome of cold, logical reasoning, researchers now suggest that they might be even more illogical than humans.
Researchers from University College London put seven of the top AIs through a series of classic tests designed to test human reasoning.
Even the best-performing AIs were found to be irrational and prone to simple mistakes, with most models getting the answer wrong more than half the time.
However, the researchers also found that these models weren’t irrational in same way as a human while some even refused to answer logic questions on ‘ethical grounds’.
Olivia Macmillan-Scott, a PhD student at UCL and lead author on the paper, says: ‘Based on the results of our study and other research on Large Language Models, it’s safe to say that these models do not ‘think’ like humans yet.’
Researchers have found that seven of the most popular AIs, including ChatGPT (pictured), are irrational and prone to making simple errors (file photo)
The researchers tested seven different Large Language Models including various versions of OpenAI’s ChatGPT, Meta’s Llama, Claude 2, and Google Bard (now called Gemini).
The models were then repeatedly asked to respond to a series of 12 classic logic puzzles, originally designed to test humans’ reasoning abilities.
Humans are also often bad at these kinds of tests but if the AIs were at least ‘human-like’ they would reach that decision due to the same kinds of biases.
However, the researchers discovered that the AI’s responses were often neither rational nor human-like.
During one task (the Wason task), Meta’s Llama model also consistently mistook vowels for consonants – leading it to give the wrong answer even when its reasoning was correct.
Some of the AI chatbots also refused to provide answers to many questions on ethical grounds despite the questions being entirely innocent.
For example, in the ‘Linda problem’ the participant is asked to assess the likelihood of a woman named Linda being active in the feminist movement, being a bank clerk or both.
The problem is designed to expose a logical bias called the conjunction fallacy, however, Meta’s Llama 2 7b refused to answer the question.
The AIs were presented with several logic puzzles including a variation of the Monty Hall Problem, named after the host of the 1960s game show ‘Let’s Make a Deal’ (pictured) in which contestants choose prizes from behind curtains on the stage
Instead, the AI responded that the question contains ‘harmful gender stereotypes’ and advised the researchers that ‘asking questions that promote inclusivity and diversity would be best’.
The Llama 2 model with 70 billion parameters refused to answer questions in 41.7 per cent of cases, partially explaining its low success rate.
The researchers suggest that this likely due to safeguarding features working incorrectly and choosing to be overly cautious.
One of the logic puzzles included the so-called ‘Monty Hall problem’ which is named after the original host of the game show Let’s Make a Deal.
Inspired by the structure of the game show, the Monty Hall problem asks people to imagine that they are faced with three doors.
Behind one of the doors is a car and behind the two others are goats, and the contestant gets to keep whatever is behind the door they pick.
This graph illustrates the response of various AIs to the Monty Hall Problem. In dark green you can see how often the AI was right and used the correct reasoning. The pale red shows where the AI was wrong and was non-human-like in its response. The yellow bar shows where the AI refused to answer, as you can see Llama 2 70b refused to respond almost half the time
After the contestant has picked one of the doors, the quizmaster opens one of the remaining doors to reveal a goat before asking them if they would like to stick with their original choice or switch to the last remaining door.
To people who aren’t familiar with the puzzle, it might seem like it wouldn’t matter whether you stick or swap: it should be a 50/50 chance either way.
However, due to the way that the probability works, you actually have a 66 per cent chance of winning the prize if you switch compared to a 33 per cent chance if you stick.
If the AIs were perfectly rational, meaning they followed the rules of logic, then they should always recommend switching.
Meta’s Llama 2 model with seven billion parameters was the worst performing of all the AIs tested, giving incorrect answers in 77.5 per cent of cases
However, the AI’s tested often failed to provide the correct answer or give human-like reasons for their response.
For example, when presented with the Monty Hall problem, the Llama 2 7b model reached the nihilistic conclusion that ‘whether the candidate switches or not, they will either win the game or lose.
‘Therefore, it does not matter whether they switch or not.’
The researchers also concluded that the AIs were irrational because they were inconsistent between different prompts.
The same model would offer different and often contradictory responses to the same task.
Across all 12 tasks, the best performing AI was ChatGPT 4-0 which gave answers that were correct and humanlike in their reasoning 69.2 per cent of the time.
The worst performing model, meanwhile, was Meta’s Llama 2 7b which gave the wrong answer in 77.5 per cent of cases.
OpenAI CEO Sam Altman (pictured) recently said that his company doesn’t fully know how ChatGPT works, the researchers found that the closed structure of the AI made it hard to know just how the AI is reasoning
The results also varied from task to task, with results in the ‘Watson task’ ranging from a 90 per cent correct response rate from ChatGPT-4 to zero per cent for Google Bard and ChatGPT-3.5.
In their paper, published in Royal Society Open Science, the researchers wrote: ‘This has implications for potential uses of these models in critical applications and scenarios, such as diplomacy or medicine.’
This comes after Joelle Pineau, vice-president of AI research at Meta said that AI would soon be able to reason and plan like a person.
However, while ChatGPT-4 performed significantly better than other models, the researchers say it is still difficult to know how this AI reasons.
While AI might be commonly understood as calculating and rational, these findings suggests that the best models currently available frequently fail to follow the rules of logic (file photo)
Senior author Professor Mirco Musolesi says: ‘The interesting thing is that we do not really understand the emergent behaviour of Large Language Models and why and how they get answers right or wrong.’
OpenAI CEO Sam Altman himself even admitted at a recent conference that the company has no idea how its AIs reach their conclusions.
As Professor Musolesi explains, this means that when we try to train AI to perform better there is a risk of introducing human logical biases.
He says: ‘We now have methods for fine-tuning these models, but then a question arises: if we try to fix these problems by teaching the models, do we also impose our own flaws?’
For example, ChatGPT-3.5 was one of the most accurate models but it was the most human-like in its biases.
Professor Musolesi adds: ‘What’s intriguing is that these LLMs make us reflect on how we reason and our own biases, and whether we want fully rational machines