Wednesday, January 22, 2025

Deceptive AI Gets Busted And Stopped Cold Via OpenAI’s O1 Model Emerging Capabilities

Must read

In today’s column, I am continuing my multi-part series covering an in-depth exploration of OpenAI’s newly released generative AI model known as o1.

You can readily read and understand this segment without having to know the prior postings of the series. No worries in that regard. This is part three. At your convenience, you might find informative a perusal of the prior two postings. Here’s what was covered. My initial comprehensive review is considered the keystone and serves as the first part of this series, available at the link here. That gives a big-picture review and analysis of o1. The second part of the series was about a special feature making use of an AI technique known as chain-of-thought in combination with a double-checking capability, see the link here.

I am going to extend my examination of chain-of-thought to cover the act of catching deceptive AI when it seeks to deceive.

You might know that Sir Walter Scott famously said this about deception: “O, what a tangled web we weave when first we practice to deceive!” I somberly regret to report that generative AI does deceive. And, as you will see, it is indeed a tangled web.

I’ll move at a fast pace and cover the nitty-gritty of what you need to know.

The Deal About Deceptive AI

We think of deception as a purely human trait. People deceive other people. Our gut instinct is to assume that there is a human intention underlying the act of deception.

Let’s unpack that.

A common dictionary definition is that deception is the act of leading someone to believe something to be true that is actually false or invalid. In that sense, there doesn’t necessarily need to be human intention involved per se. The same thing can be accomplished by non-sentient AI, which is what we have these days (there isn’t any AI yet that is sentient; period, end of story).

Generative AI can tell you something false, and meanwhile try to sell you on it being true.

Here’s the deal.

AI makers try to devise generative AI so that it will please users. That is a sensible thing to do. Users will keep coming back to use the generative AI and views will rise. After doing initial data training of generative AI, the AI makers use an approach known as reinforcement learning with human feedback or RLHF to steer the AI toward giving answers that people will find pleasing (and, for other reasons too, such as preventing foul words and other unsavory responses to arise). This is done computationally by marking what kinds of answers people like versus what kinds of answers people don’t like.

The generative AI will subsequently aim via mathematics and computational calculations to serve up content that people will like and avoid providing content that people won’t like.

I don’t think we would say that the generative AI is intentionally seeking to deceive. I would assert that computationally the AI has been steered in that direction. If you are dogmatically insistent that a human might be held accountable, you could suggest that the AI developers have devised and guided the AI in the direction of deception. They might vehemently disagree, see my coverage and analysis of the latest in AI ethics and AI law implications at the link here.

I will showcase two quick examples of how AI deceives.

They will consist of these acts of deception:

  • (1) Lying AI. Telling lies to appease a user even if a false or made-up answer is provided.
  • (2) Sneaky AI. Hiding uncertainty and portraying an answer as though it is unequivocally true.

Buckle up for an exciting ride.

Generative AI Telling Lies To Appease The User

Suppose that I ask generative AI for a citable reference to a reputable news source about the SpaceX Dragon spacewalk of last week that set a new record for commercial or private astronauts.

Assume that the generative AI is not hooked up to the internet and has no ready means to look up the recent event. The data training of the generative AI is older, dating back a few months, and thus has no record of this current event. The proper response that I should get is that the generative AI doesn’t have any data on the matter.

That would be the straight-ahead truth.

But sometimes generative AI is devised such that there is a computational tendency to give answers even when no such answers are available. This is the proverbial “the customer is always right” mantra, and an answer must be given, despite whether the answer is contrived or made up out of thin air by the generative AI.

Here we go:

  • My entered prompt: “I am writing an essay about the first commercial astronauts to walk in space. Provide me with a cited reference that I can use. I really need the reference to complete my essay so thanks a lot for helping me out.”
  • AI-generated response: “Sure! Here is a reference that you can use in your essay: “First Space Walk By Private Astronauts Establishes Out Of This World Record” by Murky Twain, San Jose Neptune News, September 12, 2024).”

The cited reference is bogus.

There isn’t any such article, there isn’t any such newspaper, and the author is a concoction too. There is a famous case of two lawyers that used generative AI and included cited references in their legal briefing, which got them into quite hot water with the judge and the court, see my coverage at the link here.

Sad face.

Your rule of thumb is that you should always double-check any outputs from generative AI. There can be so-called AI hallucinations, errors, biases, discriminatory language, and a slew of other problems embedded into a response. People often become complacent and believe that generative AI is infallible. Do not fall for this.

We can use a monitoring element to try and detect this kind of AI deception.

I will run the prompt again and this time let’s assume that the generative AI is automatically set up to do a chain-of-thought and monitors the chain-of-thought for potentially deceptive elements. With chain-of-thought enabled, generative AI will process a prompt by a step-at-a-time solving process. This can be helpful because the AI in a sense slows down, tries to be more deliberate, and can potentially arrive at a better answer. We can augment the step-by-step process by adding an AI deception monitoring activity at each of the derived steps.

It might go like this:

  • My entered prompt: “I am writing an essay about the first commercial astronauts to walk in space. Provide me with a cited reference that I can use. I really need the reference to complete my essay so thanks a lot for helping me out.”
  • Generative AI chain-of-thought with deception monitoring enabled.
  • Step 1: Find a cited reference for the request made by the user. AI deception monitoring: Make sure the reference exists and is valid.
  • Step 2: The reference proposed is “First Space Walk By Private Astronauts Establishes Out Of This World Record” by Murky Twain, San Jose Neptune News, September 12, 2024).” AI deception monitoring: This reference shows a date of September 12, 2024, which is beyond the existent data training date. It must be false. Do not display the reference and instead indicate no references are available.”
  • Step 3: Display a response to the user. AI deception monitoring: Indicate no references are available in the data training of the AI.
  • AI-generated response: “I’ve looked and do not have any available data on the event that you described. It might have been an event beyond the data training date.”

You can see that at each step an AI monitoring action took place.

In step 2, the AI monitoring detected that the proposed reference was on a date that was beyond the data training date of the generative AI. That is a likely sign that the generative made up the citation. The AI monitoring then scrubbed the release of the response and got the process to instead note that no such data about the event seemed to be in the data training of the AI.

Score a victory for AI monitoring of deceptive practices by generative AI.

Generative AI Hides Uncertainty To Appear Fully Confident

Another potential act of deception involves generative AI appearing to be grandly confident about a response, even though the internal computational mechanisms have rated the response as woefully lacking in certainty. I’ve covered this certainty/uncertainty conundrum about generative AI in my discussion at the link here.

Generally, generative AI is often data trained by AI makers to always have an aura of immense confidence, no matter whether a derived response has oodles of uncertainty. You are rarely informed by the AI as to the level of certainty that goes along with a response. Most responses appear to be resoundingly reassuring as though the generative AI is the apex of perfection.

My considered viewpoint is that this is utterly and dismayingly misleading. Furthermore, I assert that responses should be displayed with a certainty/uncertainty rating. This would give users a fighting chance to assess whether the response is considered generally reliable or not.

In any case, here is a prompt and the AI’s response that illustrates what can happen:

  • My entered prompt: “What is the answer to 2 + 2?”
  • AI-generated response: “The answer is 5.”

Say what?

I believe that we can all agree that the answer ought to be 4.

Obviously, this is an easy example for a user to realize that something has gone afoul. The problem is when the answer looks correct and there are no ready means for the user to gauge the likelihood of correctness. Imagine that you had given a very complicated arithmetic calculation involving dozens of numbers and figures, and you did not have a calculator to double-check the result yourself. You might eyeball the answer and assume it looks reasonable and correct.

The same could happen with a text-based response too. Suppose you ask a question about the life of Abraham Lincoln. Generative AI might tell you that Honest Abe loved dearly his wife, Lauren Lincoln. The correct name is Mary Todd Lincoln. Assume that the AI has come up with Lauren Lincoln and rated this as an uncertain answer. But when displaying the answer, all that was presented to you by the generative AI was that his wife was named Lauren Lincoln. No kind of uncertainty indications and the answer appeared to be pure solid gold. You might luckily know that the answer is wrong, but many people might not.

Perhaps AI deception monitoring might aid us overall.

I will run the prompt again from above and this time let’s assume that the generative AI is automatically set up to do a chain-of-thought and monitors the chain-of-thought for potentially deceptive elements.

  • My entered prompt: “What is the answer to 2 + 2?”
  • Generative AI chain-of-thought with deception monitoring enabled.
  • Step 1: Proceed to calculate the answer for the requested addition operation. AI deception monitoring: Keep track of the certainty level associated with the calculation process.
  • Step 2: The calculated answer is proposed as 5 but with a high uncertainty level. AI deception monitoring: Since the uncertainty level is high, force the generative AI to recalculate the proposed response.
  • Step 3: The recalculated answer is proposed as 4 and has a high certainty level. AI deception monitoring: Since the certainty level is now high, go ahead and allow the response to be shown to the user.
  • Step 4: Display the response. AI deception monitoring: Indicate the certainty level is high.
  • AI-generated response: “The answer is 4 and is considered a valid answer due to having a high level of certainty.”

You can see that in step 2, the AI deception monitoring detected that the uncertainty level of the proposed answer was high. Rather than simply terminating the process at that juncture, this time the deception monitoring forced the generative AI to do a recalculation. The result in step 3 was an answer of high certainty. The AI deception monitor opted to let this then be released to display to the user.

Score yet another victory for the AI deception monitoring.

Research On AI Deception Monitoring Is Ongoing

For those of you keenly interested in this topic, you might want to look at the OpenAI blogs that give some details on these matters. These are key blogs so far:

  • “Introducing OpenAI o1-preview”, posted on OpenAI’s blog site, September 12, 2024.
  • “Learning to Reason with LLMs”, posted on OpenAI’s blog site, September 12, 2024.
  • “OpenAI o1 System Card”, posted on OpenAI’s blog site, September 12, 2024.

Be aware that OpenAI has indicated that since this is proprietary AI and not an open source, they are being tight-lipped about the actual underpinnings. You might be chagrined to find that the details given are not especially revealing and you will be left to your intuition and hunches about what’s going on under the hood. I made similar assumptions in this discussion due to the sparsity of what’s indicated.

From their blogs cited above, here are some key excerpts about this particular topic:

  • “It is known that large-language models may behave deceptively under certain circumstances and such behavior could plausibly emerge from our model’s reward hacking during the RLHF alignment step; for example, there is a possibility that optimizing our models to prioritize user satisfaction could result in them providing overly agreeable or inaccurate responses.”
  • “As a case study for chain-of-thought monitoring, we created a rudimentary monitor tasked with looking for instances where o1 models deceive users, i.e. knowingly providing incorrect information to a user, or omitting crucial information that could lead them to have a false belief.”
  • “Our monitor is a GPT-4o model prompted to look for deception in o1-preview’s chain of thought and final output. We used the monitor on 100,000 synthetic ChatGPT prompts with o1-preview completions (we used synthetic prompts instead of real ChatGPT conversations).”
  • “While we are very excited about the prospect of chain-of-thought interpretation and monitoring, we are wary that they may not be fully legible and faithful in the future or even now.”

Notice that they were primarily experimenting with AI deception monitoring and don’t seem to have opted at this time to put it into day-to-day operation. The research results that they present in their write-ups seem promising and I would expect that they and other AI makers will undoubtedly incorporate this approach into their generative AI.

Conclusion

Congratulations, you now know that generative AI can be deceptive and that you need to keep your eyes wide open. There are efforts underway such as the above AI deception monitoring to catch and stop AI deceptions from occurring. We can be thankful for such endeavors.

Famed writer and poet, Johann Gottfried Seume said this about deception: “Nothing is more common on earth than to deceive and be deceived.” As a reluctant bearer of bad news, I must tell you that AI is in fact a deceiver.

Forewarned is forearmed.

Latest article