When German journalist Martin Bernklautyped his name and location into Microsoft’s Copilot to see how his articles would be picked up by the chatbot, the answers horrified him. Copilot’s results asserted that Bernklau was an escapee from a psychiatric institution, a convicted child abuser, and a conman preying on widowers. For years, Bernklau had served as a courts reporter and the AI chatbot had falsely blamed him for the crimes whose trials he had covered.

The accusations against Bernklau weren’t true, of course, and are examples of generative AI’s “hallucinations.” These are inaccurate or nonsensical responses to a prompt provided by the user, and they’re alarmingly common. Anyone attempting to use AI should always proceed with great caution, because information from such systems needs validation and verification by humans before it can be trusted.

But why did Copilot hallucinate these terrible and false accusations?

  • @[email protected]
    link
    fedilink
    English
    1
    edit-2
    2 months ago

    More or less that. There’s a point during the path that the input is taking on the language model were the induced randomness can significantly affect the output or not. If all the weights are pointing to the same end node, because the “confidence” is high, the no matter the random seed, the output will be the same. When the seed greatly affect the final result is because the weights don’t point with that confidence to an unique end node, so the small randomness introduced at the beginning (the seed to say so) greatly change the result. It is here were you are most likely to get an hallucination.

    To put again in terms of the much more easier to view earlier neural networks. When you didn’t trail the model enough mario just made random movements without doing attempts to complete the level. Because the weights of the neurons could not reliably take the input and transform into an useful output. It os something that could be solved in smaller models. For larger models gets incredibly complicated because the massive amount of data. The complexity of the data. And the complexity of a proper training. But it’s not something imposible or that could not get rid of. The same you can get Mario to finally complete all levels every time without issues, you can get a non hallucinanting chat bot, it just takes more technology improvements.

    I suppose it could be said that the nature of language is chaotic like weather and not deterministic like a Mario level, and thus it would be actually “impossible” to get large results, like it’s impossible to get precise weather a month in advance. But I’m not sure there would be enough evidence to support that, as hallucinations are not just across the board, they just tend to happen on matters that had little training data. Matters with plenty of training data do not hallucinate even in today models.

    I searched slm online and found out that small models you said. I wasn’t refering to those. Those are just small large language models IMO if that makes any sense. A proper slm should also have a small purpose, cannot be general chat. I mostly refer to the current chatbots that point you to predefined answers, or summarizing ones. Nothing that could really elaborate a wrote answer word by word.

    Currently and to my knowledge. There isn’t any general language model that can just write up answers and that is good enough to not hallucinate. But certainly we are getting closer each year.

    Edit: I’ve been looking for an example, here https://www.tax.service.gov.uk/ask-hmrc/chat/self-assessment These kind of chatbots, they know when their answer is not precise and default to a polite “ask again” answer instead of just tell you the first “hallucination” that came to them. They are powered by similar AI technology but it’s not a general use and cannot write word by word. But it “knows” when te answer is precise or not.

    • @[email protected]
      link
      fedilink
      English
      22 months ago

      The example you shared is not an LLM. It’s a classic chatbot with pre-defined answers. It basically knows keyword to KB article. If no term is known, it will tell “I don’t know”. It will also suggest incorrect KB if picks one keyword, ignoring the rest of the context. It has no idea of the answer is correct by any means. At best somebody will periodically check a sample of questions that the user didn’t consider correct to evaluate the pairings, but it’s not AI, at least not a good one

      • @[email protected]
        link
        fedilink
        English
        1
        edit-2
        2 months ago

        If you read my answers you’ll see that I said they are not llm. They are language models powered by smaller datasets and with smaller neural networks.

        I picked a tax agency in particular because I know first hand that tax agencies (I would surprise me that UK didn’t use it) do use language models with neural networks, notice that again I’m not saying generative llm, to parse the question and select a proper answer. Not the keyword method you think they use.

        I would have provided the first hand example I know but it is spanish and people may not be able to effectively understand it. But I do know that tax agencies usually use very similar tools one country from another. So probably UK does use it. If you want to test the spanish one here it is. And sources on what type of AI is used.

        https://sede.agenciatributaria.gob.es/Sede/ayuda/herramientas-asistencia-virtual.html

        https://es.newsroom.ibm.com/2018-02-28-La-Agencia-Tributaria-utiliza-IBM-Watson-para-ayudar-a-las-empresas-en-la-gestion-del-IVA

        Again, because it seems that I need to repeat this so people can properly train on the info I’m writing, not LLM, not GPT, not a large general use language model. As for that amount of parameters cutting not confident answers would cut most answers, probably. At least with nowadays state of technology, things keep improving each year.

        Edit: found some english source on the matter https://www.investinspain.org/en/news/2024/ibm

        The chatbot it is still only in spanish and co-official languages still.

        • @[email protected]
          link
          fedilink
          English
          1
          edit-2
          2 months ago

          That’s what you’re missing. Those are not language models nor use neural networks. At best they use a classification NLP. They do not generate text, use pick pre-constructed answers based on the inputs. Because it this three’s no confidence beyond “what’s generally the correct based on this keyword”

          I’ve worked with IBM Watson. That existed and was used for basic bots a decade ago. You have you manually feed the terms to outputs.

          Y he usado la web de la agencia tributaria para confirmar lo que digo.