Thursday, September 19, 2024

Major tech companies used thousands of YouTube videos to train their AI models, report says

Must read

AI (artificial intelligence) has been the talk of the tech world ever since OpenAI unveiled ChatGPT in 2022. The AI tool, capable of answering any question, quickly went viral and people started experimenting with it. ChatGPT is trained via chipmaker Nvidia’s GPUs and these are often termed to be the backbone of the AI tool. With AI’s advancement, several questions have also surfaced from time to time. And one of these questions is how tech companies train their AI models and what all is required for the same.

One thing that we all know by now is that to train any AI model, huge amount of data is needed. And according to a new report, major tech companies are using content available on YouTube to train their AI tools. In an ongoing investigation by Proof News, it was revealed that several leading tech companies have been utilising subtitles from YouTube videos to train their artificial intelligence models. This practice is in direct violation of YouTube’s strict policies against downloading and using its content without explicit permission.

The dataset in question reportedly comprises transcripts from 173,536 YouTube videos across more than 48,000 channels. These channels range from educational sources like Khan Academy and MIT to popular creators such as MrBeast and Marques Brownlee. The dataset also includes some translated subtitles in languages such as German and Arabic, although it does not contain any visual content from the videos.

Eleuther AI, a non-profit research lab committed to promoting open science, compiled this dataset as part of a larger collection known as The Pile. The Pile is a compilation of various materials from sources like the European Parliament and English Wikipedia, released under a permissive licence for academic and research purposes.

A report in Wired revealed that David Pakman, a political commentator with over 2 million subscribers, expressed his frustration over the situation. More than 160 of his videos were found in the dataset without his consent. Pakman stressed the impact on his livelihood, pointing out that significant time and resources are invested in producing his content. He argued that AI companies profiting from this data should compensate the creators.

Published By:

Divyanshi Sharma

Published On:

Jul 17, 2024

Latest article