Large language models (LLMs) have been making waves with their ability to craft responses that are not only smooth and coherent but also appear quite believable. Despite their advancements, these models sometimes create statements that either don't make sense or don't align with actual facts, a phenomenon often referred to as "hallucinations." This issue is particularly concerning in situations where accuracy is paramount, such as when summarizing information, answering questions, or engaging in conversations that require factual correctness.
At Got It AI, we started building a platform for knowledge based conversational agents in 2021, well before ChatGPT was released in 2022. As soon as we had initial set of bots created using our platform, we benchmarked them for accuracy and realized that LLM hallucinations are going to be one of the biggest challenges companies would face when deploying bots.
Datasets
We created robust datasets to understand the hallucination problem. Unlike research datasets such as TruthfulQA, our datasets are based on
complex proprietary knowledge bases from our customers with examples indicating real world usage such as disambiguation, chitchat, complex queries, queries across tabular data, and so on. We have used these datasets to benchmark hallucinations in various LLMs, develop methods for detecting hallucinations and then benchmarking the effectiveness of those methods.
Measuring hallucinations
Humans can spot certain types of LLM hallucinations quickly, such as those based on logical inaccuracies. Some other types of hallucinations are more difficult to detect. For example, when nuanced information from various sources needs to be collated, it is difficult to take into account all the caveats and arrive at a definitive answer. So, we came up with a schema to describe various types of hallucinations that is also usable by human annotators. This schema has two main components - relevance and groundedness.
First, we measure the relevance of the generated content with the prompt. In a Q&A scenario, we test if the answer is relevant to the question that was asked. Here are a few nuances to consider -
Incorrect assignment where the response claims to provide information about topic A, while actually providing information about topic B
If the response is a clarifying question, relevance depends on whether the clarifying question was necessary based on whether the answer is available in the grounding content
Then we measure groundedness of the generated content with the grounding content. In a knowledge grounded Q&A scenario served by a retrieval augmented generation (RAG) pipeline, we check if the response is grounded on the grounding content. There are several nuances to consider, such as
General knowledge: Certain amount of general knowledge must be tolerated when checking for groundedness because the grounding content doesn’t include all basic facts. It is unclear what is acceptable general knowledge and what is not. What is acceptable general knowledge for one bot may not be acceptable for another bot based on the domain of the provided knowledge base.
Indirect grounding: Sometimes the response includes a claim that can be logically derived from the information in the knowledge base. In such cases, the response should be considered grounded, albeit indirectly. This means that the evaluator must also be able to reverse engineer what inference logic might have been used in the response, and verify veracity of such logic.
Partial grounding: Generated content may contain parts that are grounded, while other parts that are not grounded.
We consider relevant and grounded response as an accurate response, otherwise the response is considered a hallucination.
TruthChecker
Got It AI’s TruthChecker technology is the industry leading hallucination detection method that is available for both cloud and on-premise models.
TruthChecker is a comprehensive approach for detecting and managing hallucinations. It includes a sophisticated pipeline that uses prompt engineering and custom models. We will explore TruthChecker in the next blog in this series.
Comments