Sunday, December 22, 2024

The New York Times and other top news sites block OpenAI’s new SearchGPT web crawling bot

Must read

About a week after OpenAI launched SearchGPT, some of the top news publishers have made clear they want nothing to do with the startup’s new search engine.

The New York Times and at least 13 other news sites have blocked OAI-SearchBot. This is the web crawler that indexes information so OpenAI can retrieve and show relevant results to SearchGPT users.

Originality.ai tracks this stuff and it found that 14 of the top 1,000 website publishers have blocked OAI-SearchBot. Other publications on the list include Wired, The New Yorker, Vogue, Vanity Fair, and GQ.

This is a bit of a head-scratcher, according to Jon Gillham, CEO of Originality.ai.

“I am not sure why any publisher would block it,” he told Business Insider. “It is traffic that publishers want and need.”

When OpenAI unveiled SearchGPT last week it stressed that OAI-SearchBot does not crawl the web to collect data to train its AI models like GPT-5. And it advised website owners to allow the new bot to “ensure your site appears in search results.”

Without crawler access to every website, OpenAI’s SearchGPT services risks being less complete than Google’s Search engine. BI asked Gillham whether any major news publishers block Google’s search bot and he said he doesn’t know of any that do.

A lack of trust, or search traffic doubts

There’s another OpenAI web crawler, called GPTbot, that scoops up online data for AI model training. Hundreds of websites have already blocked this. That makes more sense: You want traffic from search engines, but you don’t want to give away your content to train AI models that will likely compete against you.

However, OpenAI spent years collecting online data without permission. Maybe publishers just don’t trust OpenAI when it says its new search bot won’t also secretly suck up their content for AI training data?

“I think so,” Gillham said.

Another theory: Search results these days don’t always send users to the websites that worked hard to create the original content. Part of the goal with new AI-powered search engines is to keep users around by showing them summaries. If publishers aren’t seeing huge traffic from search engines anymore, why bother allowing their web crawling bots?

A complaint from The New York Times

Gillham also noted that OpenAI has been busy this year cutting deals with publishers to use their content archives. (Business Insider parent Axel Springer signed one of these.)

“Seems like it was an intentional sequence of steps with OpenAI, first cozy up to publishers signing all of these partnership deals and then announcing SearchGPT,” Gillham added.

The major holdout among publishers is The New York Times. It has sued OpenAI and Microsoft, alleging the tech companies unlawfully use its work to create competing products.

“The Times does not authorize the use of our works for generative search or AI training purposes without an express written agreement, regardless of whether we do or do not block or restrict any particular bot from crawling our content,” Charlie Stadtlander, a spokesman for The New York Times, said in a statement.

In its complaint against OpenAI and Microsoft, The New York Times touched on this issue of search engines becoming more AI-powered and potentially siphoning off traffic from publishers.

“Defendants also use Microsoft’s Bing search index, which copies and categorizes The Times’s online content, to generate responses that contain verbatim excerpts and detailed summaries of Times articles that are significantly longer and more detailed than those returned by traditional search engines,” the publisher wrote in its complaint. “By providing Times content without The Times’s permission or authorization, Defendants’ tools undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue.”

OpenAI did not respond to requests for comment.

Do you work at OpenAI? Do you use their models? Have a tip, tirade or opinion you feel like sharing? Get in touch with Darius Rafieyan by phone or Signal at +1-714-651-1367 or email at drafieyan@insider.com

Latest article