Reddit is OpenAI’s Moat

Southern Wolf · edit-2 2 years ago

Reddit is OpenAI’s Moat

VonFluffington · 2 years ago

While the article makes sense, I think that it’s not just OpenAI, but just the data in general.

For the past decade or so, data has become one of the most important trade goods, and holds an incredible amount of power. Facebook and many other big social media websites were profitable because users gave them their data for free, and they, in turn, sold it to advertisers.

Advertising was perhaps the easiest or most legal option to sell data. With the rise of LLMs, there’s now a huge new market that they can capitalize on: Raw access to all the data and knowledge they have stored.

Southern Wolf · 2 years ago

Yes, it’s no longer compute power that is the limiting factor. OpenAI, Google, and other large corps can afford it straight out, and even smaller entities like StabilityAI can manage it by renting GPU’s. Heck, I saw an offer a few days back to rent H100’s for just a few bucks an hour. Those costs do add up, but that’s hardly cost prohibitive either.

Training data, both in quantity and quality, is now the defining feature that determines the “make or break” status for an LLM, and that’s not just a playing field for the largest corps. Even a GPT3/3.5 clone isn’t out of reach for a group like Stability, and smaller, more niche use models are capable of being trained on a fraction of the data needed for GPT3/3.5. There’s already attempts to have Co-Pilot style models run locally on machines which don’t need massive specs. Same goes for image generation diffusion models, as well as GANs again too. DALL-E and DALL-E 2 seemed incredible… Until Stable Diffusion launched and blew it out of the water. And MidJourney is by far the current king of that, blowing both DALL-E 2 and Stable Diffusion away. Adobe also has their’s coming soon (or already out?) for Photoshop, that they claim isn’t trained on copyrighted imagery, which if true means they have really pushed the bounds of what’s possible, given the early results I’ve seen from it.

So yes, training data will be the king maker for AI/ML models going forward. Much like you said, it fits with the trend of Big Data that’s been going on for roughly a decade or so now. That was born out of the desire to build custom advertising and analytic profiles, but it’s grown to power so much more than that now. Reddit is definitely a gold mine for such data.

Reddit is OpenAI’s Moat

Reddit is OpenAI’s Moat

Reddit is OpenAI’s moat