Reddit is OpenAI’s Moat

Southern Wolf · edit-2 2 years ago

Reddit is OpenAI’s Moat

VonFluffington · 2 years ago

While the article makes sense, I think that it’s not just OpenAI, but just the data in general.

For the past decade or so, data has become one of the most important trade goods, and holds an incredible amount of power. Facebook and many other big social media websites were profitable because users gave them their data for free, and they, in turn, sold it to advertisers.

Advertising was perhaps the easiest or most legal option to sell data. With the rise of LLMs, there’s now a huge new market that they can capitalize on: Raw access to all the data and knowledge they have stored.

Southern Wolf · 2 years ago

Yes, it’s no longer compute power that is the limiting factor. OpenAI, Google, and other large corps can afford it straight out, and even smaller entities like StabilityAI can manage it by renting GPU’s. Heck, I saw an offer a few days back to rent H100’s for just a few bucks an hour. Those costs do add up, but that’s hardly cost prohibitive either.

Training data, both in quantity and quality, is now the defining feature that determines the “make or break” status for an LLM, and that’s not just a playing field for the largest corps. Even a GPT3/3.5 clone isn’t out of reach for a group like Stability, and smaller, more niche use models are capable of being trained on a fraction of the data needed for GPT3/3.5. There’s already attempts to have Co-Pilot style models run locally on machines which don’t need massive specs. Same goes for image generation diffusion models, as well as GANs again too. DALL-E and DALL-E 2 seemed incredible… Until Stable Diffusion launched and blew it out of the water. And MidJourney is by far the current king of that, blowing both DALL-E 2 and Stable Diffusion away. Adobe also has their’s coming soon (or already out?) for Photoshop, that they claim isn’t trained on copyrighted imagery, which if true means they have really pushed the bounds of what’s possible, given the early results I’ve seen from it.

So yes, training data will be the king maker for AI/ML models going forward. Much like you said, it fits with the trend of Big Data that’s been going on for roughly a decade or so now. That was born out of the desire to build custom advertising and analytic profiles, but it’s grown to power so much more than that now. Reddit is definitely a gold mine for such data.

Orion (awooo) · 2 years ago

OpenAI definitely owes a lot to Reddit threads, I’ve even been able to trace a GPT-4 hallucination to a single thread where the things it was talking about appeared, but it seemed to have merge two completely different names together.

It definitely could be a contributing factor, the biggest players are caught in a war among themselves while trying to fend off open models at the same time. That may explain why everyone values their publicly accessible data all of a sudden.

Maybe even the recent stuff with YouTube (invidious and ad blockers) can be explained by this, maybe they want to set the stage for restricting access to videos. Why? Videos have proven to be a good way of training open-ended agents that play Minecraft for example. Google has PALM-E (which is based on an LM and another transformer for performing physical movements) and is working on general-purpose robots. They also said they were training on some kind of model that’s built to be multimodal from the very start, which will probably be a successor to that.

Bersl · 2 years ago

@awooo @southernwolf So instead of making training data opt-in, everyone’s just going to enclose the Internet even further.

Wonderful.

Orion (awooo) · 2 years ago

Yeah, pretty much…

tbh training robots on videos wouldn’t even be bad copyright wise, there’s nothing copyrightable about the way people move and do things, and people mostly want boring manual jobs to be automated (at least if we get rid of capitalism first so we don’t fucking starve). But of course Google wants to have an edge on its robots and they can get that by siloing off the data from everyone else…

AI research should be public and the results made as accessible as possible. I hate the intersection of AI and capitalism.

Southern Wolf · 2 years ago

It’s less so purely capitalism as it is corporatism. Especially so with Altman running around demanding they, and they alone, be given the ability to make AI’s. Emad and Stability AI prove you absolutely don’t need that model whatsoever. Further still, the potential commercial projects born out of what Stability released are… Many.

What can absolutely not be allowed is for OpenAI, or Google, to be given the sole right to create AI’s, enforced by law. That’s a scary world I 100% do not want to live in…

Orion (awooo) · 2 years ago

Meh, that’s the logical conclusion of capitalism.

I suspect these supposedly good companies will either rise and fall as they run out of VC money, or become another OpenAI or Google at one point, only using their initial investment to kick start their tech.

But also we have to think about getting replaced by automation anyway (large corporations having exclusive access to it only exacerbates it). It’s different from previous forms of technology, because it won’t really create enough jobs for people.

And while we’re at it, if we can get abundance of labour, why not give people a bit more agency over everything than just delegating it to some rich fucks who will turn around the moment they sniff out a way to make extra money and abuse in a multitude of ways to keep their influence?

Reddit is OpenAI’s Moat

Reddit is OpenAI’s Moat

Reddit is OpenAI’s moat