LLMs are built by generating a network of weights based on a large volume of training data. Some models have made those weights public/open, meaning you could, in principle, go in and manually edit the weights individually to change the outcomes. In practice, you would never do this because it would only ruin the output.
However, you could theoretically nudge a lot of values in just the right way to change the model to favor an ideology, have a different attitude, produce disinformation etc.
Right now, this is done practically in a brute force manner. The program will have certain instructions and parameters appended to the input in order to force a certain disposition, limit the scope, etc.
There are a lot of reasons to want to adjust the fundamentals of a model, but AFAIK such a technology doesn’t exist yet (publicly). For example, this could be used for political gain, or for positive purposes like removing racism that has been well documented.
Is anyone working on such a thing?
Note: This community is “no stupid questions,” but I am actually pretty stupid and I probably misunderstood some (all) of the fundamentals of how this works. Please respond to any part of my question.
I read a series of super interesting posts a few months back where someone was exploring the dimensional concept space in LLMs. The jump off point was the discovery of weird glitch tokens which would break GPTs, making them enter a tailspin of nonsense, but the author presented a really interesting deep dive into how concepts are clustered dimensionally, presenting some fascinating examples and, for me at least, explained in a very accessible manner. I don’t know if being able to identify those conceptual clusters of weights means we’re anywhere close to being able to manually tune them, but the series is well worth a read for the curious. There’s also a YouTube series which really dives into the nitty gritty of LLMs, much of which goes over my head, but helped me understand at least the outlines of how the magic happens.
(Excuse any confused terminology here, my knowledge level is interested amateur!)
Posts on glitch tokens and exploring how an LLM encodes concepts in multidimensional space. https://www.lesswrong.com/posts/8viQEp8KBg2QSW4Yc/solidgoldmagikarp-iii-glitch-token-archaeology
YouTube series is by 3Blue1Brown - https://m.youtube.com/@3blue1brown
This one is particularly relevant - https://m.youtube.com/watch?v=9-Jl0dxWQs8