AI Companies Running Out of Training Data After Burning Through Entire Internet

@[email protected] · 11 months ago

AI Companies Running Out of Training Data After Burning Through Entire Internet

AutoTL;DR · 11 months ago

This is the best summary I could come up with:

As AI companies keep building bigger and better models, they’re running down a shared problem: sometime soon, the internet won’t be big enough to provide all the data they need.

While there are some companies, such as Dataology, which was formed by ex-Meta and Google DeepMind researcher Ari Morcos, looking into ways to train larger and smarter models with less data and resources, most big companies are looking into novel — and controversial — means of data training.

OpenAI, for instance, has per the WSJ’s sources discussed training GPT-5 on transcriptions from public YouTube videos — even as its own chief technology officer, Mira Murati, struggles to answer questions about whether its Sora video generator was trained using YouTube data.

Synthetic data, meanwhile, has been the subject of ample debate in recent months after researchers found last year that training an AI model on AI-generated data would be a digital form of “inbreeding” that would ultimately lead to “model collapse” or “Habsburg AI.”

While concerns about AI running out of data seem to have been spooking researchers for some time, researcher Pablo Villalobos told the newspaper that although his firm, Epoch, has estimated that AI will run out of usable training data within the next few years, there’s no reason for panic.

Then again, there is another obvious solution to this manufactured problem: AI companies could simply stop trying to create bigger and better models, given that aside from the training data shortage, they also use tons of electricity and expensive computing chips that require the mining of rare-earth minerals.

The original article contains 407 words, the summary contains 258 words. Saved 37%. I’m a bot and I’m open source!