Assessing the relevance of responses from systems like web search engines, question answering tools, and knowledge bases is traditionally done by having humans judge whether the responses meet the user’s information need. However, with recent advances in large language models (LLMs) like ChatGPT, researchers have started exploring using LLMs themselves to make these relevance judgments automatically.

While some studies have found LLM judgments often agree with human judgments, there are several potential issues with fully automating relevance assessment using LLMs:

  • Bias towards the particular LLM used for judging
  • Bias against underrepresented user groups
  • Inability to detect misinformation or factually incorrect responses
  • Potential for a cycle where LLM-generated web content is used to train new LLMs
  • LLMs can hallucinate or generate false information

Rather than trying to fully replace humans with LLMs, the authors suggest exploring ways for humans and LLMs to collaborate. They propose a spectrum from full human judgment, to humans assisted by LLM summaries, to humans verifying LLM judgments, to potential full automation if LLMs prove reliable enough.

The key is to find the right balance where humans do what they are best at and LLMs assist with tasks they excel at. More research is needed on how to effectively combine human and artificial intelligence for relevance assessment in a cost-effective, fair, and high-quality way.

Summarized by Claude 3 Sonnet