Sunday, December 22, 2024

Is Reddit trying to have its cake and eat it with Google and generative AI?

Must read

Image: Shutterstock

On the earnings call for Reddit’s very first quarter as a publicly-traded company, CEO Steve Huffman revealed that Reddit gained an additional 9.6 million users in Q1 2024 – its largest jump in three years – driven largely by increased Google traffic.

Huffman stated on the call that more than 60% of the newcomers to Reddit were logged-out users – so, users who don’t have an account (or aren’t signed in to one), but are nevertheless counted in Reddit’s metrics – and that Google “is driving the increase … largely in the logged-out users”. Reddit also saw a 27% year-over-year jump in logged-in users, whom Huffman calls “the core of our business”.

Reddit attributes this major influx from Google to some recent performance improvements. On the same call, Huffman announced that a new web platform for Reddit, dubbed ‘Shreddit’, has now been rolled out for all users.

The new platform is “more than twice as fast as the platform it replaces” – and as SEOs and web developers will be well aware, page speed is an important ranking factor for Google and part of Google’s Core Web Vitals metrics. Core Web Vitals measure how quickly a page loads for searchers, as well as how responsive it is to interactions, and its visual stability.

Or as Huffman put it: “Googlebot likes speed”.

Reddit’s status as a source of unguarded, authentic advice and information on a huge range of topics has never been more valuable, as searchers increasingly adopt the trick of appending ‘Reddit’ to searches in order to surface discussions and advice rather than product or article results that they may perceive to be gaming Google.

Huffman highlighted this on the earnings call, noting that Reddit’s growth is likely also because “in the AI era people value authentic content more or content written by humans, and that’s what Reddit is and that’s what Reddit has. … [T]here’s a thirst for authentic opinions and advice and commentary and just conversations.”

I made the same point in my previous piece on why generative AI is no substitute for online community – contrasting the quality of advice offered by a 2009 LiveJournal group with that of today’s generative AI. However, there’s a certain irony to Huffman celebrating the authenticity that leads searchers to favour Reddit – given that Reddit is also licensing its content as AI training data.

A magnifying glass held over the Reddit logo, causing the RE to appear larger.

The generative AI conundrum

The use of online content to train generative AI models without requesting permission has proven to be one of the major controversies surrounding generative AI, raising questions about the nature of copyright and intellectual property.

Without a high-quality set of training data, after all, these models would not exist – but publishers object to not having been consulted on this usage, and to not seeing any benefit from the major role they played in bringing large language models (LLMs) into being.

Reddit has played a prominent role in creating these datasets since the days of GPT-2, which was trained on webpages linked from Reddit with more than three upvotes, thus using the site as a marker of quality. Such pages also formed a major part of the training data for GPT-3.

Now, tech companies that want to train generative AI are casting the net progressively wider, looking for new bodies of text to bolster their training data. Reddit, which is increasingly valued as an information resource in its own right – not simply as a directory of webpages – has entered the picture as a source of training data.

Reports surfaced just ahead of Reddit’s prospective IPO in February that it had signed an AI content licensing deal worth $60 million annually. Less than a week later, Reddit and Google both confirmed their new partnership with official announcements.

Reddit is at pains to state that its partnership with Google – and other data licensing agreements that it’s pursuing, since Google is not the only beneficiary – is “Aligned with our belief that everyone should be able to find the information they need and the experiences they want online”.

The announcement reiterated Reddit’s status as “one of the internet’s largest open archives of authentic and constantly-updated human-generated conversations and experiences”. But can Reddit manage to have its cake – trade on its reputation for unvarnished authenticity – and eat it – license its content to train the very AI models it wants to stand apart from?

AI and Reddit: Uneasy bedfellows?

In February, in the wake of reports surrounding Reddit’s data licensing deal with Google, Kyle Orland at Ars Technica took a look at Reddit’s Form S-1 SEC filing that had been published ahead of the company’s planned IPO.

He observed that “while Reddit sees LLMs as a new revenue opportunity, the site also sees their popularity as a potential threat. The S-1 filing notes that “some users are also turning to LLMs such as ChatGPT, Gemini, and Anthropic” for seeking information, putting them in the same category of Reddit competition as “Google, Amazon, YouTube, Wikipedia, X, and other news sites”.”

It’s true that users have been employing LLMs for information seeking since the launch of ChatGPT – even when the chat tool was not connected to the internet and had a finite, time-restricted dataset. This has shaped a trend in web search that is gathering momentum: AI-powered search, referred to by Google as SGE (Search Generative Experience), which uses an LLM to synthesise the information about a topic and present an answer to the user’s query, usually complete with links to sources.

The problem with AI-powered search is that while it may link to sources of information, it’s also designed to replace them.

Bing was first off the starting block with Bing Chat in early 2023, while Google, after some stops and starts, is now testing SGE in the US and the UK. New players like Perplexity.ai whose proposition is founded on AI-powered search have been attracting more attention, and most recently, reports have suggested that ChatGPT could be launching a search product ahead of Google I/O on Tuesday (although Sam Altman has denied that OpenAIs product news will be a search engine).

The problem with AI-powered search is that while it may link to sources of information, it’s also designed to replace them. The whole appeal of generative AI in search is that the user doesn’t have to sift through an array of different search results to find the answer to a question, which is instead packaged neatly for them at the top of search.

And while search engines like Bing and Google have reiterated that they provide plenty of links to sources, the need to click on those sources is also being reduced. There’s conspicuously little data on how this style of search actually impacts clickthrough: when Bing finally added Bing Chat data into Bing Webmaster Tools, it was grouped together with web data and not presented on its own.

The potential impact depends considerably on just how prevalent generative AI-powered search proves to be. The addition of Bing Chat didn’t noticeably budge Bing’s share of search, and while Google’s dominance would be an instant boost to adoption, Google is taking it slow, as errors and issues in generative AI-powered search would be highly damaging to its reputation. There’s also the question of impact to Google’s business model, and one report from the Financial Times suggested Google could ultimately put SGE behind a paywall.

However, publishers have reason to be concerned given the rise in prevalence of what have been called ‘zero-click’ searches: searches that don’t continue on to a website because the information has already been provided on the SERP thanks to features like rich snippets, featured snippets, or Google’s Knowledge Panel. A 2022 SEMrush study put the percentage of zero-click searches at around 25%.

Even if publishers can rank for these searches, they ultimately don’t gain a visitor to their website who might browse other pages; and SGE seems set to exarcerbate that. Given how much Reddit is currently growing thanks to users finding and, crucially, visiting its website through Google, an interruption to that trend spells bad news for Reddit’s long-term health.

“You have to link to us”

So, what is Reddit thinking? One investor probed Huffman on the apparent contradictions of Reddit’s approach during the earnings call, saying, “…when I go to Google and do a Google search, I get a very clear set of links that say these are from Reddit, and I click in and I go to a Reddit page. That doesn’t happen in AI. Will that happen in AI? Is that the future that we see?”

In response, Huffman emphasised that when Reddit enters into data licensing agreements, “we … are very clear that, if you’re using Reddit’s data in an answer, like in your example, you have to link to Reddit or you have to say Reddit, you have to use our branding, you have to link to us.

“…will people be able to take Reddit’s data, not send any users to us, take credit for themselves and enrich themselves? Going forward, the answer is no.”

At least in theory, then, Reddit’s approach to generative AI is about taking control; this is also shown in Reddit’s newly-publicised policy that reiterates that any use of Reddit’s data requires a contract agreement (with some carve-out exceptions for research and non-commercial usage, but these will still need to go through Reddit officially). The argument from Reddit – and other publishers who have struck agreements around generative AI like the Financial Times and Thomson Reuters – seems to be that if generative AI is here to stay, they may as well see some benefit.

But as illustrated, medium-term benefits may lead to longer-term struggles. Many of Reddit’s users have not welcomed the news that their posts will be used to train generative AI, with some opting to delete past posts or replace them with nonsense to withdraw their contributions from the dataset.

Meanwhile, Google is trialling another initiative that spells bad news for Reddit on a different front. Last year, it announced a ‘new Search Labs experiment’ called Notes, which adds user commentary on search results directly on the SERP.

As Google writes, “Often, the best place to find answers to your most specific questions is with someone who’s been in your shoes before.” This is Google moving directly into Reddit’s territory – recognising that what searchers often want is human advice and expertise, and aiming to provide that on the SERP in a form that Google controls.

While Google might not be able to instantly substitute for the years of community and trust that Reddit has built (to say nothing of the moderation practices it will need to develop from scratch), it has recognised the demand for thoughts and input from real people and is looking to satisfy it.

Although Reddit is right to benefit as much as it can from Google’s traffic through organic search, in the long run it – like publishers – needs a sustainable strategy that doesn’t depend on Google sharing its bounty.

Latest article