It’s pretty easy to see the problem here: The Internet is brimming with misinformation, and most large language models are trained on a massive body of text obtained from the Internet.

Ideally, having substantially higher volumes of accurate information might overwhelm the lies. But is that really the case? A new study by researchers at New York University examines how much medical information can be included in a large language model (LLM) training set before it spits out inaccurate answers. While the study doesn’t identify a lower bound, it does show that by the time misinformation accounts for 0.001 percent of the training data, the resulting LLM is compromised.

  • @ribhu
    link
    English
    414 hours ago

    How old is this study? The LLMs mentioned are Llama 2 and GPT 3.5 which in current terms are almost archaic

    • @Zron
      link
      English
      1614 hours ago

      Unfortunately, it’s a lot harder to rigorously test something than it is to shit a new product out into the wild with no regard for its impact.