I wanted to extract some crime statistics broken by the type of crime and different populations, all of course normalized by the population size. I got a nice set of tables summarizing the data for each year that I requested.

When I shared these summaries I was told this is entirely unreliable due to hallucinations. So my question to you is how common of a problem this is?

I compared results from Chat GPT-4, Copilot and Grok and the results are the same (Gemini says the data is unavailable, btw :)

So is are LLMs reliable for research like that?

  • @ralakus
    link
    4
    edit-2
    4 months ago

    If you’re using an LLM, you should limit the output via a grammar to something like json, jsonl, or csv so you can load it into scripts and validate that the generated data matches the source data. Though at that point you might as well just parse the raw data and do it yourself. If I were you, I’d honestly use something like pandas/polars or even excel to get it done reliably without people bashing you for using the forbidden technology even if you can 100% confirm that the data is real and not hallucinated.

    I also wouldn’t use any cloud LLM solution like OpenAI, Gemini, Grok, etc. Since those can change and are really hard to validate and give you little to no control of the model. I’d recommend using a local solution like running an open weight model like Mistral Nemo 2407 Instruct locally using llama.cpp or vLLM since the entire setup will not change unless you manually go in and change something. We use a custom finetuned version of Mixtral 8x7B Instruct at work in a research setting and it works very well for our purposes (translation and summarization) despite what critics think.

    Tl;dr Use pandas/polars if you want 100% reliable (Human error not accounted). LLMs require lots of work to get reliable output from

    Edit: There’s lots of misunderstanding for LLMs. You’re not supposed to use the bare LLM for any tasks except extremely basic ones that could be done by hand better. You need to build a system around them for your specific purpose. Using a raw LLM without a Retrieval Augmented Generation (RAG) system and complaining about hallucinations is like using the bare ass Linux kernel and complaining that you can’t use it as a desktop OS. Of course an LLM will hallucinate like crazy if you give it no data. If someone told you that you have to write a 3 page paper on the eating habits of 14th century monarchs in Europe and locked you in a room with absolutely nothing except 3 pages of paper and a pencil, you’d probably write something not completely accurate. However, if you got access to the internet and a few databases, you could write something really good and accurate. LLMs are exceptionally good at summarization and translation. You have to give them data to work with first.