In this blog post, we will share the results of our Mixtral based RAG pipeline, which consists of Mixtral-8x7B fine-tuned models for response generation and for truth-checking. Base Mixtral has a 8% hallucination rate, which reduced to 4.5% after fine tuning, making it our best response generation model to date. When used alongside a Mixtral fine tuned TruthChecker model, we achieved a net 2.0% hallucination rate, which is better than OpenAI’s GPT-4 hallucination rate of 2.1% on the same dataset.
Mixtral-8x7B baseline performance
Mixtral-8x7B is one of the most powerful open-source large language models (LLMs) available at the moment. Its instruction-tuned variant, named “Mixtral-Instruct”, has demonstrated impressive results on various benchmarks. When integrated into our RAG pipeline, Mixtral-Instruct achieved 8% base hallucination rate without any fine-tuning, making it the most accurate open-source model we have evaluated so far and matching GPT-3.5-Turbo’s performance. As the logical next step, we wanted to fine-tune Mixtral on our customer’s dataset to see if we can further improve the model’s performance in our automated chatbot system.
Fine-tuning Mixtral-8x7B for response generation
We started by fine-tuning Mixtral on the response generation task. For this, we used a high-quality dataset of 2400 questions and answers based on a given knowledge base. The examples in this dataset are either synthetically generated by experts or taken from live customer interactions with the chatbot, with customer permission. Out of these, 2100 examples were randomly sampled for training and the remaining 300 examples were held out for testing.
During instruction fine-tuning, we applied a simple prompt which contains the context along with the user message and asks the model to generate a suitable response. We used both Mixtral and Mixtral-Instruct as the base model and fine-tuned each of them for 5 epochs.
To assess the model’s performance, we employed our evaluation tool for response quality and factuality called AutoEval. The resulting base hallucination rate on the test set is 4.67% for fine-tuned Mixtral and 4.33% for fine-tuned Mixtral-Instruct.
Base model | Training epochs | AutoEval Relevance | AutoEval Groundedness | AutoEval Correctness | Hallucination rate |
Mixtral-8x7B-v0.1 | 5 | 99.3% | 96.0% | 95.3% | 4.67% |
Mixtral-8x7B-Instruct-v0.1 | 5 | 98.3% | 97.3% | 95.7% | 4.33% |
We later performed additional testing on the fine-tuned Mixtral-Instruct response generation model using 1800 recent live customer conversations. According to Autoeval, the model maintained a low base hallucination rate of 4.50% on this larger, more challenging test set. This is the best response generation model we have trained so far, which consistently produces responses that are helpful to the user and faithful to the given context, as indicated by both AutoEval and by human testers.
TruthChecker: Fine-tuning Mixtral-8x7B for hallucination detection
In our RAG pipeline, once a response is generated, it is checked for relevance and factuality using a TruthChecker model. Given a knowledge base, a user message, and a bot response, TruthChecker can assess the response’s faithfulness to the knowledge base and relevance to the user message. A response that is classified as inaccurate by TruthChecker will not be displayed to the end user.
To obtain a corresponding TruthChecker model for the fine-tuned Mixtral-Instruct response generation model, we applied two training approaches:
Fine-tune Flan-T5 via distillation step-by-step
Fine-tune Mixtral-Instruct with AutoEval’s outputs
The first approach with Flan-T5 is our standard recipe for training TruthChecker. As an initial fine-tuning step, Flan-T5 was trained using a large synthetic dataset of hallucinations, generated from various documents and articles. This generic TruthChecker model is then further fine-tuned on Mixtral responses so that it becomes specialized in detecting hallucinations from Mixtral. At a 5% rejection rate, Flan-T5 TruthChecker is able to correctly identify 51% of hallucination cases, resulting in a 2.2% residual hallucination rate.
The second approach, which involves fine-tuning Mixtral-Instruct for hallucination detection, is a new TruthChecker training method for our team. We trained Mixtral-Instruct to analyze the relevance and faithfulness of the LLM response, identify any mistakes in the response, and provide reasoning for its outputs. This Mixtral based TruthChecker model is able to correctly filter out 56% of hallucinations at a 4.5% rejection rate, resulting in a 2.0% net residual hallucination rate for the RAG pipeline.
Conclusion
Our fine-tuned Mixtral response generation model has proven to deliver highly accurate responses with a low base hallucination rate of 4.5%. When used in combination with a corresponding TruthChecker model, our pipeline achieves better than GPT-4 level performance in factuality and faithfulness. This is an encouraging result which shows that open-source models, when fine-tuned on a specific task, can be competitive with some of the best closed models currently available.
In a future blog, we would like to describe in more detail our experience with fine-tuning Mixtral and share some insights we learned along the way.
Comentarios