- cross-posted to:
- technology
- cross-posted to:
- technology
European Union lawmakers are set to give final approval to the 27-nation bloc’s artificial intelligence law Wednesday, putting the world-leading rules on track to take effect later this year.
Lawmakers in the European Parliament are poised to vote in favor of the Artificial Intelligence Act five years after they were first proposed. The AI Act is expected to act as a global signpost for other governments grappling with how to regulate the fast-developing technology.
“The AI Act has nudged the future of AI in a human-centric direction, in a direction where humans are in control of the technology and where it — the technology — helps us leverage new discoveries, economic growth, societal progress and unlock human potential,” said Dragos Tudorache, a Romanian lawmaker who was a co-leader of the Parliament negotiations on the draft law.
Big tech companies generally have supported the need to regulate AI while lobbying to ensure any rules work in their favor. OpenAI CEO Sam Altman caused a minor stir last year when he suggested the ChatGPT maker could pull out of Europe if it can’t comply with the AI Act — before backtracking to say there were no plans to leave.
Doesn’t seem like it outside this:
Which makes me think that it’ll be used to require models to truly open their “source”
The FOSS community really needs to come up with a better definition and licensing model for LLMs and other neural networks, though. I’ve seen multiple times where people refer to freely provided pre-trained models as “open source”
AIs aren’t truly open source unless their training code and the training data is fully provided. Anything else is at most semi-obfuscated and definitely not “open”
I forgot to mention: That’s unlikely. It only requires a “summary”, which will be of limited use for reverse engineering the big models. It does, however, provide a club with which to beat small developers.
I don’t think many people who publish finetunes on huggingface (think github for AI models) will bother with this. I’m not sure what that would mean for the legality of HF on the whole.
HF already has mechanisms for sharing datasets through the hub so I don’t think this would be a big lift for them legally
Yes, and some of those datasets might be illegal in some EU countries, but that’s not the point. You need to have the copyright summary so that the model is compliant with EU regulations. Just hosting them for free download is probably fine, if I understand correctly.
Why do you need the training data? To me, if you can use it and modify it as you wish then it’s open source. If you need a copy of the training data then that’s a problem, even outside the EU.
Many (all?) of the so-called open source models have “ethical” restrictions on use, so technically not open. It’s close enough to me, for now. In the future, such clauses will become an issue. Imagine if printing presses came with restrictions on what you can and can’t print.
all models carry bias (see recent gemini headlines for an extreme example), and what exactly those are can range from important to extremely important, depending on the use case!
it’s also important if you want to iterate on a model: if you use the same data set and train the model slightly differently, you could end up with entirely different models!
these are just 2 examples, there’s many more.
also, you are thinking of LLMs, which is just one kind of model. this legislation applies to all AI models, not just LLMs!
(and your definition of open source is…unique.)
Meaning what?
I omitted requirements on freely sharing it as implied, but otherwise?
meaning the models training data is what lets you work around or improve on that bias. without the training data, that’s (borderline) impossible. so in order to tweak models and further development, you need to know what exactly went into the model, or you’ll spend a lot of wasted time guessing around.
you disregarded half of what makes an AI model. the half that actually results in a working model. without the training data, you’d only have some code that does…something.
and that something is entirely dependent on the training data!
so it’s essential, not optional, for any kind of “open source” AI, because without it you’re working with a black box. which is by definition NOT open source.
@[email protected]
Asking for the training data is more like asking for detailed design documentation in addition to source code, so that others can rewrite the code from scratch.
Neural networks are inherently black boxes. Knowing the training data does little to change that. Given the sheer volume of data used in training the interesting models, more than very high level knowledge is not possible in any case.
There are open datasets, as well as open models. If open source models are only those trained on open datasets, then we need a new word for the status of most models. As it is, open source model and open source dataset is pretty clear. There’s no need to make it complicated.
If it also a requirement that the data itself should be downloadable, then open source AI would be illegal in many countries. Much of the data will be under copyright, meaning that it can’t be shared in many countries. EG, the original Stable Diffusion was trained on an open dataset. The dataset only contained links to images, since sharing the actual images would have been illegal in their jurisdiction. Link rot being what it is, the original data was not available pretty quickly. It has been alleged that some of the links pointed to CSAM, so now even the links are a hot potato.
Do you have any source that explains how this would work?
Open sourcing the training method without open sourcing the training data is essentially like making only part of your full source open to the public.
Even going as far as making your training method source available, and a pre-trained kernel available (like what Mistral does) is essentially the same as what a lot of open source-adjacent companies provide.
A pre-trained neural kernel isn’t any different effectively than a pre-compiled binary library (like a dll). So what these companies are providing is closed-source binaries alongside the compilation instructions for them. But without the data that trained the kernel it can hardly be called “open source” as the actual “source” of the logic behind the kernel (the training data) is still closed to the public.
You can fine-tune and re-train and re-quantize the models all you want but you’re not really manipulating the “source” if all you have is the gptq or safetensors or some other pre-trained set of weights.
Damn its actually helping foss another good one by the eu. Yeah people calling the llama models foss is just plain wrong and giving the zucc more credit than the deserves.