This is a proposal by some AI bro to add a file called llms.txt that contains a version of your websites text that is easier to process for LLMs. Its a similar idea to the robots.txt file for webcrawlers.

Wouldn’t it be a real shame if everyone added this file to their websites and filled them with complete nonsense. Apparently you only need to poison 0.1% of the training data to get an effect.

  • haverholm
    link
    fedilink
    302 days ago

    Theoretically speaking, what level of nonsense are we talking about in order to really mess up the training model?

    a) Something that doesn’t represent the actual contents of the website (like posting “The Odyssey” to the llms.txt of a software documentation site),

    b) a randomly generated wall of real words out of context, or

    c) just straight lorem ipsum filler?

    • @[email protected]
      link
      fedilink
      29
      edit-2
      2 days ago

      Place output from another LLM in there that has thematically the same content as what’s on the website, but full of absolutely wrong information. Straight up hallucinations.

      • @[email protected]
        link
        fedilink
        122 days ago

        This. Research has shown that training LLMs on the output of other LLMs very rapidly induces total model collapse. It’s basically AI inbreeding.

      • haverholm
        link
        fedilink
        182 days ago

        Using one LLM to fuck up a lot more is poetic I suppose. I’d just rather not use them in the first place.

      • haverholm
        link
        fedilink
        52 days ago

        I’m trying to optimise my human efficiency vs effort here, but yeah. Get your point.

    • raoul
      link
      fedilink
      222 days ago

      We could respect this convention the same way the IA webcrawlers respect robot.txt 🤷‍♂️

      • DaGeek247
        link
        fedilink
        42 days ago

        I’ve had a page that bans by ip listed as ‘dont visit here’ on my robots.txt file for seven months now. It’s not listed anywhere else. I have no banned IPs on there yet. Admittedly, i’ve only had 15 visitors in that past six months though.

      • @draughtcyclist
        link
        22 days ago

        Seriously. I’ve never seen a convention so aggressively ignored. This isn’t the brilliant idea some think it is.