Dhruv Rathee, Marques Brownlee, and PewDiePie YouTube video subtitles were used to train AI models, according to a tool shared by the Proof News outlet.
Anthropic, Nvidia, Apple, and Salesforce were among the leading tech firms that used a YouTube video subtitle dataset to train their AI models, according to the outlet
The outlet said it found subtitles from 173,536 YouTube videos that were pulled from over 48,000 channels, but warned that the tool could result in false negatives.
Some of the videos that were used to train AI included uploads by tech reviewer Marques Brownlee, apart from content creators such as PewDiePie and Dhruv Rathee, as well as news publications and talk shows worldwide.
Based on a search using the tool, a 2020 video by The Hindu was also seen in the results.
(For top technology news of the day, subscribe to our tech newsletter Today’s Cache)
Most of the videos were from 2020 or earlier, suggesting a cut-off of sorts.
Brownlee criticised companies that scraped video transcripts for AI training content.
“Fun fact, I pay a service (by the minute) for more accurate transcriptions of my own videos, which I then upload to YouTube’s back-end. So companies that scrape transcripts are stealing *paid* work in more than one way. Not great.,” posted Brownlee on X on Tuesday.
Anthropic and Salesforce confirmed using training datasets that included the scraped video subtitles, but did not accept any wrongdoing, per the outlet. Nvidia, Apple, Databricks, and Bloomberg did not confirm or deny the allegations.
The question of scraping YouTube videos—or their transcripts—to train AI models is a contentious one.
Earlier in the year, when OpenAI official Mira Murati was asked about whether the ChatGPT-maker used YouTube videos for AI training, she struggled with the question and could not answer clearly.