LLM hallucinations and RAG
To address the issue of hallucinations and enhance the reliability of Large Language Models (LLMs), researchers have developed a technique called Retrieval-Augmented Generation (RAG), which allows the model to retrieve knowledge from external sources. RAG has proven to be an important component of many LLM systems as it leverages up-to-date, domain-specific information to help the model produce more accurate responses.
Despite remarkable advances in LLMs and assistance from RAG, hallucinations remain a challenge even to state-of-the-art closed models and are particularly prevalent in smaller language models.
At Got It AI, we are employing knowledge distillation techniques to fine-tune small task-specific language models, while still maintaining conversational capabilities and ensuring low hallucination rates.
ELMAR training process
ELMAR, which stands for Enterprise Language Model ARchitecture, is Got It AI’s most advanced response generation model which plays a key role in our automated customer service chatbot. Given a user question and retrieved knowledge base articles, ELMAR can produce a relevant answer based on the context.
ELMAR was trained using a combination of knowledge distillation and instruction fine-tuning techniques. During the distillation process, we applied a weighted loss function to train the student model to imitate the teacher model’s responses, both in terms of phrasing and groundedness. In our setup, the base teacher model is Flan-UL2 which has 20 billion parameters and the base student model is Flan-T5-XL which has 3 billion parameters.
The training process consists of three steps:
Fine-tune the Flan-UL2 teacher model using a large-scale synthetic dataset (7500 rows)
Distill knowledge from the teacher model to the smaller Flan-T5-XL student model
Fine-tune the student model using data from our customer’s articles (2400 rows) to obtain a specialized model for that particular customer.
With this training pipeline, we observed a 85% reduction in model size while preserving the question-answering capabilities. The resulting model has an acceptable base hallucination rate of 11.48% on the test set, according to our evaluation tool for response quality and factuality called AutoEval. The held-out test set contains 400 synthetic examples written by experts and 450 examples from live customer conversations.
Active Learning loop
In order to improve ELMAR for response generation, we implemented an active learning loop through continual model retraining with live customer interactions. More specifically, we performed two rounds of active learning using live conversations collected over a period of one month. The chatbot’s response quality and accuracy is assessed using the AutoEval tool, which does not require any manual annotations.
In the first round of active learning, we augmented ELMAR’s training dataset with the first half of live data, which contains about 500 user messages and bot responses. The resulting model is called ELMAR Active Learning 1 (ELMAR-AL1). When evaluated on the same test set as above, ELMAR-AL1 showed a hallucination rate of 9.62%.
We then performed a second round of active learning, in which the training data for ELMAR was augmented with the whole month of live customer conversations. Compared to the first round, the training set for the second round contains an additional 650 user questions and bot responses. This model, called ELMAR-AL2, showed a respectable base hallucination rate of 7.13% on the same test set, which is on par with GPT-3.5 and Mixtral-8x7B according to our benchmarking. Moreover, we observed that ELMAR’s base hallucination rate decreases by about 2% with each round of active learning.
TruthChecker further reduces hallucinations
TruthChecker is our hallucination manager, whose role is to detect ungrounded responses from LLMs. The TruthChecker model can assess the factuality and relevance of LLM responses based on the given user question, conversation history, and retrieved knowledge base articles. TruthChecker is based on Flan-T5’s encoder-decoder architecture and was trained using a synthetic dataset of hallucinations, generated from a large collection of documents and articles.
Depending on the use case, we can fine-tune this generic TruthChecker model for a specific customer or a specific LLM. In this instance, we fine-tuned TruthChecker to detect hallucinations in ELMAR as well as ELMAR-AL1 and ELMAR-AL2. We evaluated the performance for TruthChecker at the specific threshold when it classifies 10% of responses as hallucinations. At a 10% rejection rate, TruthChecker is able to correctly identify 50% of the hallucinations from ELMAR-AL2. This means that about half of the model’s hallucinations are detected and not shown to the end user.
With the combination of ELMAR-AL2 and its corresponding fine-tuned TruthChecker model, we achieved a 3.60% net hallucination rate which is comparable to GPT-4 in accuracy.
ELMAR-AL2 (3B) is a much smaller model than GPT-4 (very large) and Mixtral (8 x 7B), yet shows competitive performance in its hallucination rate when combined with TruthChecker. This significantly reduces deployment and maintenance costs while ensuring the accuracy of our conversational AI applications.
Future Work
In a recent blog post, we explored Mixtral and found it to be the most accurate open-source model that we have evaluated in our RAG pipeline thus far. When TruthChecker is utilized in conjunction with Mixtral, we obtained a 3.40% residual hallucination rate, which is similar to GPT-4’s performance.
Seeing that Mixtral noticeably outperforms Flan-UL2 and Flan-T5, we expect increased model robustness if Mixtral was used as the base model for ELMAR. We are currently conducting this experiment and expecting that Mixtral, when fine-tuned with the knowledge base, will lead to a significant improvement in ELMAR’s factual accuracy.
Comments