99.85% accuracy with GPT4-Turbo and 99.3% accuracy with Llama3-70B
If you have an Enterprise IT helpdesk or customer support conversational chatbot serving 100K messages per month, even a low hallucination rate of 1% to 2% (most frontier LLMs today are in this range), it still means 1000-2000 bot errors every month and corresponding escalations and potential brand and legal risk. What is needed is a conversational chatbot system that can detect hallucinations, including relevance and grounding issues, explain the decisions it makes for transparency, and correct those issues before sending a response to the user.
At Got It AI, that is how AutoRAG 2.0 is built. Our vision is to achieve 99.99% accuracy in conversational RAG applications, which is the level of SLA cloud computing has achieved and thus grown explosively. In this blog, we share how AutoRAG 2.0 achieves virtually zero hallucinations. The key highlights are:
Outperform GPT-4o with local models: We’ve demonstrated lower than GPT-4o level hallucinations, net <1%, even when using local, open source models for RAG response generation
Task specific models trained on public domain data: How we trained our models using an enterprise-friendly, public domain dataset that Got It AI has curated. We did not fine-tune models on any particular enterprise data set to achieve our results for that data set, thus creating generic, task specific models vs enterprise data specific models
Does not violate terms of service: OpenAI models were not used to generate training data, instead we demonstrate the use of the permissively licensed Llama 3 models for synthetic data generation
Domain-agnostic: We developed a general purpose hallucination detection model (TruthChecker), rigorously tested for multi-turn QA conversations on a variety of domains
Explainability/Transparency: TruthChecker generates a detailed reason for any detected hallucinations
Generative LLM agnostic: We have made TruthChecker capable of working with any cloud or open source RAG response generation model, without requiring the LLM to be fine tuned on the target enterprise specific data set
Response correction: TruthChecker can identify hallucinations and rewrite the response generated by any RAG application that uses any LLM, even correcting responses generated by GPT-4-Turbo, GPT-4o and Claude Sonnet-3.5
Fine-grained hallucination schema: The definition of hallucination is often vague, and can lead to a mismatch in expectations. We developed a rich schema to identify several classes of hallucinations prevalent in RAG applications, and spent over a year with human annotators to develop strict definitions of hallucinations
Motivation?
RAG applications built using frontier models have exploded in popularity because of rapid iteration cycles and the ability to interact with data in a conversational manner. The prototypes are quick to build and show promise. However, when enterprises consider deploying these applications to production, they run into show-stoppers such as sharing their data with third parties, high cost of LLM inference and the lack of control over their ML infrastructure.
There is a strong desire to replace frontier models with open-source local models due to these concerns. However, it has been difficult to match the performance of proprietary frontier LLMs while building RAG applications using local models. This is evident when we compare the relevance and faithfulness of a RAG pipeline using proprietary models like GPT-4o against its open-source counterpart like Llama3-70b.
Open-source LLMs hallucinate more, producing output that is not faithful to the retrieved context. They also tend to give more irrelevant responses in the context of a conversation.
The Solution
Got It AI addresses the concerns of enterprises by building AutoRAG 2.0, a no-code platform for building Enterprise RAG applications. It comes pre-configured with TruthChecker to minimize hallucinations and maximize accuracy.
AutoRAG achieves close to zero hallucinations through several advanced RAG techniques and TruthChecker. First, we find that by making several iterations of prompt improvements, we can bridge the gap between GPT-4o and an open source model like Llama3-70b as a response generator. To ensure fair comparison across LLMs, AutoRAG uses the best prompt variation for each individual model rather than using a single prompt across all models that may be biased towards one model. Secondly, AutoRAG’s Generative Metadata System (GEMS) automatically synthesizes graphs, tables, taxonomy, and other metadata from the Enterprise knowledge base. These metadata are used extensively throughout the AutoRAG inference engine to deliver highly accurate retrieval and response generation. This cuts down LLM hallucination rate significantly, from 2-9% to 1-4% (We will blog about GEMS separately). Still, even a 2% hallucination rate could mean 1000s of bad responses per month, that would lead to user confusion, brand risk and an unacceptable level of support escalations.
So, to further improve the performance, AutoRAG passes the responses to a inline TruthChecker model that can (during inference):
Identify if the response is relevant and faithful
Provide explanations for its label
Identify information sources that were used while generating the response
Rewrite the response to address the concerns it identified
This results in the elimination of virtually all hallucinations, and AutoRAG can achieve as low as 0.015% net hallucination rate.
Let’s discuss how the TruthChecker model is trained.
TruthChecker is trained on public data only
While developing fine-tuned models for enterprises, we heard a concern repeated several times - they want to use LLMs that are trained only on public domain data. To mitigate this concern, we carefully identified public knowledge bases (such as government sites) that are widely accepted as safe data to train specialized models. TruthChecker is trained on meticulously crafted dataset based on public domain data across varied knowledge bases. We leverage Llama3-70b in a human-in-the-loop process to generate synthetic training data.
Does not violate terms of service
A popular approach to building fine-tuned models is to generate synthetic data using frontier models and train smaller open-source LLMs on this data. This approach often yields powerful models that can approach GPT4 level performance. However, using data from certain frontier models such as those from OpenAI to fine-tune smaller open-source models violates their terms of service and opens the door to legal risks.
Recent open-source models like Llama3-70b have bridged the gap to the frontier models, making it possible to leverage them to produce high quality synthetic data. We spent several iterations of prompt engineering and quality control to improve and craft detailed prompts that led to high quality synthetic data that is used for model fine-tuning.
Domain Agnostic
The TruthChecker model is trained on data from multiple industries, allowing it to generalize to unseen data quite well. In order to validate this, we built a test set consisting of data solely from knowledge bases that it has not seen during training time.
Method
Creating a dataset using human annotations and LLM based automated annotations
Set up 6 different baseline RAG chatbot using various public domain datasets
Human annotators had a few conversations with each chatbot (approx 20 conversations per chatbot)
Created advanced prompts on Llama3-70B to mimic conversations along various axes to extend to 100s of conversations per chatbot
Created AutoEval prompts on Llama3-70B to evaluate the quality, relevance and groundedness of chatbot responses, along with explanations and corrections where appropriate.
The dataset has 3000 rows, each representing a user messages, bot response, and annotations. It was split into 2350 row training set and 650 row test set.
Training a TruthChecker model
We tested several open source models as base models for fine-tuning - Mistral-8B, Mixtral-8x7B, Llama3-8B
Each selected base model was fine tuned using LoRA or QLoRA using hyper-parameter grid search, with each resulting model evaluated on the test set. We also experimented with various prompt variations, training on subsets of annotations, with and without explanations, with and without response correction.
After extensive experimentation, we selected a Mixtral-8x7B fine tuned model as the TruthChecker model to evaluate further
Setting up evaluation RAG applications
We set up multiple instances of RAG applications using our AutoRAG platform
The variations include: Using Llama3-70B or GPT4o as response generation model, and using or not using TruthChecker
Performance evaluation
We evaluated performance of the RAG applications comparing hallucination levels and accuracy
Results
Zero hallucinations: Yes, with TruthChecker, AutoRAG can use GPT4-Turbo to deliver less than 0.015% hallucination rate. This is significantly better than human accuracy on this dataset.
Do we have a local model pipeline that performs better than GPT4o? Yes, AutoRAG application using GPT4o for response generation had a 0.92% hallucination rate. The RAG application using Llama3-70B and TruthChecker had a net hallucination rate of 0.79%, i.e. 14% fewer hallucinations than GPT4o.
Can TruthChecker improve the performance of GPT4o? Yes, with TruthChecker, GPT4o hallucination rate dropped to 0.77%, which is 16% improvement over GPT4o alone.
Conclusion
In this blog, we outlined our approach to creating a public-domain information based dataset to train TruthChecker, and shared results that show that AutoRAG with TruthChecker can achieve zero hallucination when using frontier GPT-4 Turbo class models, and <1% hallucinations when using a local model. Given the recent release of the Llama-3.1 family of models, we expect that soon AutoRAG will enable virtually zero hallucinations in RAG applications even with local, open source models.
Comments