In the rapidly evolving world of artificial intelligence, large language models (LLMs) are pushing the boundaries of what’s possible in fields like natural ...
Perhaps it would also be useful to have a name for models that release their weights […]
open-weight?
I think the companies mostly stopped releasing the training data after a lot of them got sued for copyright infringement. I believe Meta’s first LLaMA still came with a complete list of datasets that went in. And I forgot the name if the project but the community actually recreated it due to the licensing of the official model at that time that only allowed research. But things changed since then. Meta opened up a lot. Training got more extensive and is still prohibitively expensive (maybe even more so). And the landscape got riddled with legal issues, compared to the very early days where it was mostly research with less attention by everyone.
Could this lead to increased difficulties in releasing open-source models? By keeping their models closed-source, companies may avoid potential copyright infringement issues that could arise from making their everything publicly available
Sure. That’s already how it is. Most of the OpenAIs, AI music generators, smart appliance vendors and AI companies out there keep everything a trade secret. They just offer a service. And by not telling how it works, and what went in, they avoid a) regulation and b) lawsuits. (And of course they get ahead of their competition.) I think liability is another issue, but it seems to me it’s still the wild west with AI, so this might just be an issue to services, not model data.
It’s only a handful of companies who release their models. If OpenAI hadn’t released OpenWhisper, there just wouldn’t be any good, local and multilingual speech recognition available. If Zuckerberg decides to stop releasing the Llama models, It’d be down to Mistral and maybe one or two other companies/institutes. And if I remember correctly, even Zuckerberg said the company lawyers advised, not to release Llama at all. I think it’s a delicate balance.
Plus we’ve seen that features can just vanish. We’ve seen pornography being removed from Stable Diffusion. It took the community quite some time and effort to get around and restore that functionality to some degree. But that shows how things rely on minor details and policies. And how arbitrary functions of “our” AI models can be removed at any time, if someone decides to. (For whatever reason.)
And they don’t tell a lot about the datasets for some time now. And it’s not like we could recreate those. Reddit sell their data for $60million. Microsoft has Github available to train on… Unless you’re a big tech company, you have none of that.
I mean, we don’t know what’s going to happen. AI history is just being written. I -personally- don’t think the current situation compares to open-source. AI models are limited in how we can study them. Most licenses restrict how they can be used. We can still modify them, but that may very well change at any point. And the overall situation is likely to change in some way, as lawmakers try to come up with regulation.
open-weight?
I think the companies mostly stopped releasing the training data after a lot of them got sued for copyright infringement. I believe Meta’s first LLaMA still came with a complete list of datasets that went in. And I forgot the name if the project but the community actually recreated it due to the licensing of the official model at that time that only allowed research. But things changed since then. Meta opened up a lot. Training got more extensive and is still prohibitively expensive (maybe even more so). And the landscape got riddled with legal issues, compared to the very early days where it was mostly research with less attention by everyone.
Could this lead to increased difficulties in releasing open-source models? By keeping their models closed-source, companies may avoid potential copyright infringement issues that could arise from making their everything publicly available
Sure. That’s already how it is. Most of the OpenAIs, AI music generators, smart appliance vendors and AI companies out there keep everything a trade secret. They just offer a service. And by not telling how it works, and what went in, they avoid a) regulation and b) lawsuits. (And of course they get ahead of their competition.) I think liability is another issue, but it seems to me it’s still the wild west with AI, so this might just be an issue to services, not model data.
It’s only a handful of companies who release their models. If OpenAI hadn’t released OpenWhisper, there just wouldn’t be any good, local and multilingual speech recognition available. If Zuckerberg decides to stop releasing the Llama models, It’d be down to Mistral and maybe one or two other companies/institutes. And if I remember correctly, even Zuckerberg said the company lawyers advised, not to release Llama at all. I think it’s a delicate balance.
Plus we’ve seen that features can just vanish. We’ve seen pornography being removed from Stable Diffusion. It took the community quite some time and effort to get around and restore that functionality to some degree. But that shows how things rely on minor details and policies. And how arbitrary functions of “our” AI models can be removed at any time, if someone decides to. (For whatever reason.)
And they don’t tell a lot about the datasets for some time now. And it’s not like we could recreate those. Reddit sell their data for $60million. Microsoft has Github available to train on… Unless you’re a big tech company, you have none of that.
I mean, we don’t know what’s going to happen. AI history is just being written. I -personally- don’t think the current situation compares to open-source. AI models are limited in how we can study them. Most licenses restrict how they can be used. We can still modify them, but that may very well change at any point. And the overall situation is likely to change in some way, as lawmakers try to come up with regulation.