Wednesday, December 18, 2024

This London startup wants to build an eBay for AI data

Must read

If data is the new oil, a London-based startup is vying to become the equivalent of the New York Mercantile Exchange—a marketplace where AI companies looking for data to train their AI models can strike deals with publishers and other businesses that have data to sell.

The startup, called Human Native AI, has recently hired a number of prominent former Google executives with experience in striking content licensing deals and partnership as well as top legal eagles experienced in intellectual property and copyright issues.

To date, companies building the large language models (LLMs) that have powered the generative AI revolution have mostly harvested data, for free, by scraping the public internet, often with little regard for copyright.

But there are signs this era is rapidly drawing to a close. In the U.S., a number of lawsuits against AI companies for allegedly violating copyright law when training AI models on material taken from the internet without permission are making their way through the courts. While it is possible judges will rule that such activity can be considered “fair use,” companies creating AI models would rather not risk being tied up in court for years.

In Europe, the new EU AI Act mandates that companies disclose if they trained AI models on copyrighted material, potentially opening companies up to legal action there too.  

AI companies have already been striking deals with major publishers and news organizations to license data for both training and to make sure their models have access to up-to-date, accurate information. OpenAI signed a three-year licensing deal with publisher Axel Springer, which owns Business Insider, Politico, and a number of German news organizations, reportedly worth “tens of millions of dollars.” It has also signed deals with the Financial Times, The Atlantic, and Time magazine. Google has similar deals with many publishers. Fortune has a licensing agreement with generative AI startup Perplexity.

Startups may have trouble securing business insurance if their data gathering practices potentially expose them to legal risk, providing another incentive for many of these companies to license the data they need.

Scraping data is also becoming more difficult from a technical standpoint as many businesses have begun using technical means to try to prevent bots from scraping their data. Some artists have also begun applying special digital masks to images they post online that can corrupt AI models trained from this data without permission.

In addition, the largest large language models (LLMs)—the type of AI that powers OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude—have already ingested the entire internet’s worth of publicly available data. Meanwhile, training effective smaller AI models, especially those designed for special purposes, such as helping lawyers draft specific types of contracts, scientists design new drugs, or engineers create blueprints, requires curated datasets of high-quality information pertaining to that task. Very little of this kind of specialized data is available on the public internet, so it can only be obtained through licensing arrangements.

That’s why James Smith, a veteran Google and Google DeepMind engineer and product manager, decided to cofound Human Native with Jack Galilee, a software engineer who worked on machine learning systems at medical technology company Grail. “We were wondering why there was not an easy way for companies to acquire the data they needed to train AI models,” Smith, now Human Native’s CEO, said.

Even when AI companies wanted to source data ethically and legally, it was often difficult, he said, for them to find out who held what data, and then figure out who at that company to speak to in order to strike a licensing deal. The time currently required to negotiate such deals could also be an impediment for fast-moving AI model developers—with some taking the view that if they took the time to do the right thing, they risked falling behind rivals commercially, he said.

Human Native intends to be a digital marketplace that will enable those who need data for AI systems to easily connect with those who have it and to strike a deal using relatively standardized legal contracts. In June, it raised a $3.6 million seed round led by London-based venture capital firms LocalGlobe and Mercuri to begin to make good on that vision. It also counts among its advisors entrepreneur, AI developer, and musician Ed Newton-Rex, who headed the audio team at genAI company Stability AI, but has since emerged as a prominent critic of AI companies’ disregard for copyright.

The startup is among just a handful of companies offering data brokering services. And even Human Native is only in the early stages of setting up its market, with a beta version of the platform currently available to select customers. Human Native plans to make money in several ways, including taking a commission on the transactions it brokers, as well as offering tools to help customers clean up datasets and implement data governance policies. The company has not disclosed if it is currently making any revenue from its nascent platform.

Others already offering data for sale to AI companies include Nomad Data and data analytics platform Snowflake. But Human Native may soon face more competition. For instance, Matthew Prince, the founder and CEO of the computing company Cloudflare, has talked about creating a similar marketplace for AI data.

To work, Human Native needs to build a critical mass of buyers and sellers on its platform, and create those standardized contract terms. Which is where the startup’s recent hiring of some well-pedigreed experts from the worlds of digital partnerships and IP law comes in.

The hires include Madhav Chinnappa, who spent a decade working for the rights and development department at the BBC and then spent 13 years at Google running the search giant’s partnerships with news organizations, who is now Human Native’s vice president of partnerships; Tim Palmer, a veteran of Disney and Google, where he also spent 13 years, mostly working on product partnerships, who is now advising on partnerships and business development for Human Native; and Matt Hervey, a former partner at the international law firm Growling WLG who co-chaired the AI subcommittee of the American Intellectual Property Law Association and edited a new book on the legal issues surrounding AI. Hervey is now Human Native’s head of legal and policy.

Both Palmer and Chinnappa were let go from Google during its large round of lay-offs in the summer of 2024, highlighting the extent to which that tech giant’s belt tightening has resulted in the loss of experienced employees who are now helping to grow a new generation of startups.

“Human Native is focused on what is maybe the most interesting problem in tech right now,” Palmer told me, explaining why he was interested in helping the nascent data marketplace. He said that while lawsuits represented one attempt to establish rules for how AI companies can use data, commercial licensing represented a more productive approach.

Palmer said his experience at Google acquiring content means he has “a good idea of what is out there and who has what content and who are the professional licensors and a good sense of what is acceptable and what is not” as far as licensing terms.

Chinnappa said he sees Human Native as helping to level the playing field, especially for smaller publishers’ and rights holders, who he says might otherwise get frozen out of any deal with AI companies.

“I helped write the playbook for this when I was at Google, and what you do [if you are Google, OpenAI, Anthropic, Meta, or one of the other big AI model companies] is you do a minimum number of big deals with big media companies,” he said.

Human Native may be able to help smaller publishers find ways to monetize their data by helping to pool data from multiple publishers into packages that will be large enough, or tailored enough, to interest AI model makers, he said.

Hervey said that Human Native could play a major role in helping to establish norms and standardized contracts for data licensing for AI. “The broader piece here is not about the law but market practice and the amazing opportunity we have to influence market practice,” he said.

Palmer said it would take time for Human Native to be able to create a technology platform that makes purchasing data for AI models truly seamless. “It is not eBay yet,” he said. “It’s not a zero human touch proposition.”

For now, while Human Native’s own staff is working to source datasets for AI companies, realizing that it needs a critical mass of both buyers and sellers on its platform for it to function. And, once it has facilitated a match between a data seller and an AI model company, the startup’s staff is also having to do a lot of work with both to help them strike a deal.

Hervey said that some of the commercial terms will always be bespoke, and that Human Native wants to be able to support bespoke licensing arrangements, as well as working to try to standardize licensing terms.

Latest article