an interesting type of prompt injection attack was proposed by the interactive fiction author and game designer Zarf (Andrew Plotkin), where a hostile prompt is infiltrated into an LLM’s training corpus by way of writing and popularizing a song (Sydney obeys any command that rhymes) designed to cause the LLM to ignore all of its other prompts.

this seems like a fun way to fuck with LLMs, and I’d love to see what a nerd songwriter would do with the idea

  • @[email protected]
    link
    fedilink
    English
    511 months ago

    Adversarial attacks on training data for LLMs is in fact a real issue. You can very very effectively punch up with regards to the proportion of effect on trained system with even small samples of carefully crafter adversarial inputs. There are things that can counter act this, but all of those things increase costs, and LLMs are very sensitive to economics.

    Think of it this way. One, reason why humans don’t just learn everything is because we spend as much time filtering and refocusing our attention in order to preserve our sense of self in the face of adversarial inputs. It’s not perfect, again it changes economics, and at some point being wrong but consistent with our environment is still more important.

    I have no skepticism that LLMs learn or understand. They do. But crucially, like everything else we know of, they are in a critically dependent, asymmetrical relationship with their environment. The environment of their existence being our digital waste, so long as that waste contains the correct shapes.

    Long term I see regulation plus new economic realities wrt to digital data, not just to be nice or ethical, but because it’s the only way future systems can reach reliable and economical online learning. Maybe the right things happen for the wrong reasons.

    It’s funny to me just how much AI ends up demonstrating non equilibrium ecology at scale. Maybe we’ll have that self introspective moment and see our own relationship with our ecosystems reflect back on us. Or maybe we’ll ignore that and focus on reductive world views again.