Retrieval-Augmented Generation (RAG) systems combine pretrained knowledge of a Large Language Model (LLM) with additional business knowledge, such as enterprise documents and knowledge bases for various use cases (e.g. HR support, IT Service Management, customer support, business analytics, etc.). The knowledge is typically indexed into and retrieved from vector databases. This “in-context learning” in a RAG system allows LLMs to generate responses that contain proprietary information which was not present when the model was initially trained.
However, the complexity of constructing robust RAG systems poses substantial challenges for developers. Failure to address these challenges correctly can result in significant brand and/or financial consequences. Several open-source frameworks have been recently developed for constructing RAG systems, such as LangChain and LlamaIndex, along with components used to build RAG pipelines. Thousands of such components - data loaders, parsers, embedding models, closed/open source LLMs, vector databases, and so on - are now available. Rather than simplifying development of RAG systems, this wide selection has resulted in a paradox of choice, wherein finding the right set of components that work well together, and configuring them to achieve desired system capabilities has proven to be very difficult for software developers. “Naïve” RAG systems are prone to a litany of gotchas, listed below, which must be overcome before the system is ready for real-world usage.
In response to these challenges, over the last two years, Got It AI has built AutoRAG, an opinionated RAG platform specifically designed to streamline the development and deployment of highly capable and performant RAG systems in the cloud or VPCs or on-premise servers. By providing pre-configured best practices, fine-tuned models for key RAG associated tasks, and optimized settings tailored for enterprise use cases, AutoRAG simplifies the implementation process, reducing the need for extensive customization and tuning, while still allowing for customizations where necessary. This platform enables developers, even those with very limited AI expertise, to efficiently configure, deploy and manage powerful RAG systems. AutoRAG thus ensures faster deployment, enhanced reliability, and superior performance, thereby enabling rapid enterprise adoption of GenAI. It is available in two product configurations, each of which has been deployed in enterprise use cases.
Let’s dive deeper.
Why is it hard to build an Enterprise-class RAG system?
Although it is very easy to get a langchain-based RAG demo chatbot “working”, the road from there to actually deploying a production system is mostly uncharted and fraught with many roadblocks and gotchas.
Let’s see what a typical developer embarking on this journey would need to overcome (if creating a system without the benefits of Got It AI’s AutoRAG platform).
Selecting an orchestration framework
The core of a RAG system is a pipeline of components used to retrieve information, call LLMs, and so on. The orchestration of such a pipeline is typically handled by orchestration frameworks such as
Each framework offers distinct features, necessitating a meticulous selection process to align with specific use cases. Thousands of modules available -
Loaders
Parsers
Extractors
Evaluators
Embedding models
Vector databases
Retrieval systems
Output parsers
Pre and post processors
Orchestrating these components into a cohesive and efficient system is a daunting task. The diversity in tools and models requires not just technical expertise but also a strategic approach to system integration to ensure compatibility and optimal performance.
Ingesting and processing enterprise data
During the process of ingesting enterprise documents and knowledge bases, a RAG developer needs to
choose what types of unstructured, structured and semi-structured data to support,
decide how to augment the data,
decide how to chunk the data,
select methods for data embedding and indexing into the vector database,
set up retrieval system (for example a weighted mix of semantic embedding similarity combined with keyword-based retrieval)
All these choices affect the efficiency and reliability of the end system.
Selecting a foundation model (LLM)
Various open and closed source LLMs are available. While accessing LLMs via third party APIs is convenient, that option may not be available due to security and privacy concerns, and the choice is also typically gated by the Cloud platform used by the business.
Open source LLMs offer less expensive and faster inference characteristics, but deploying such models requires significantly more infrastructure work.
The choice between GPUs, AI accelerators, and other specialized hardware affects both the performance and cost-efficiency of such models.
Model optimizations, such as kernel optimizations, quantization, and distillation are essential for reducing latency and improving throughput, yet they require a sophisticated understanding of machine learning techniques.
Inference at scale introduces additional challenges and requires an even deeper understanding of hardware characteristics (such as memory bandwidth), kernel optimizations, networking, load balancing, scheduling, batching, inference on edge devices, and so on.
Although many frameworks claim support for “Any LLM”, LLMs are not interchangeable. When initially selecting or later switching between LLMs, one must
tune prompts for that LLM,
potentially adjust model fine-tuning, distillation, and/or quantization scripts, and
adjust UI to mask any loss of features
As an example, for one of our customers, we needed to switch from a strong foundation LLM in our RAG pipeline to a smaller, more cost-efficient, but less capable language model. We had to rework the entire pipeline, which resulted in splitting 2 LLM calls into 8 separate calls (some in parallel to minimizing latency).
Evaluating the Performance of RAG Systems
The techniques for evaluating RAG systems are still being developed. Academic datasets and benchmarks such as TruthfulQA, FaithDial, halueval, Ragas and Finance Bench are good starting points, but they do not capture the nuances of real-world RAG use cases, such as off topic questions, product comparison questions that requires tabular information lookup, and so on.
We have blogged in detail about this before and have outlined our approach for developing proprietary datasets for RAG evaluation. A typical developer building a RAG pipeline does not have access to these datasets, nor do they have a nuanced understanding of how RAG systems need to be evaluated, which often leads to unwanted surprises during production usage.
Gotchas!
After putting together a RAG pipeline, a developer faces several gotchas as they try to make the system robust enough to handle real-world usage.
Missing or ambiguous content: Incorrect or confusing content gets retrieved but LLM hallucinates or replies with “I don’t know”
Reading comprehension: Content with the correct answer is retrieved, but LLM may not be able to understand difficult, cross-referenced content or content represented in tables with merged cells
Incomplete context: The retrieved content snippets may not have full context, so LLM misrepresents the information
Risks with general knowledge: LLM may use outdated business-related information from its original training, or may be overly cautious and won’t use common sense knowledge
Query ambiguity: When user query is ambiguous, LLM may make assumptions and provide an answer instead of asking user to clarify
Complex queries: Certain user queries require query decomposition, multiple retrievals, partial evaluations, and agentic request handling
Too much relevant information: Facts needed to answer a question may be spread across too many documents, so LLM cannot have all the information it needs
Wrong response format: If JSON response is requested, LLM may produce invalid JSON or valid JSON but using a different schema. LLM may not follow instructions on how to represent tables, lists, images, references.
Not relevant: LLM may answer a different question, may provide unnecessary information
Not complete: LLM may omit important information that was retrieved
Not reliable: LLM may hallucinate and make up stuff, misrepresent retrieved information or use outdated information from its pretraining
Brand risk: LLM may not follow business-specific policies around brand, tone and personality
Recognizing the complexities of building a robust RAG pipeline, Got It AI decided to build an opinionated RAG platform that natively includes solutions for these issues, saving users months of experimentation and engineering.
What is an “Opinionated” framework?
An “opinionated” framework enforces conventions and best practices an expert would use and applies constraints to ensure reliable, optimal results. A non-opinionated framework leaves all decisions to the developer and does not include mechanisms to prevent unreliable or suboptimal results.
Let’s take an example from web development frameworks.
Node.js with Express.js is an example of an un-opinionated web development framework, where developers are given the freedom to mix and match from thousands of available components allows for high customization and flexibility, but it also demands more decision-making and can lead to greater complexity.
On the other hand, Ruby on Rails is a prime example of an opinionated web development framework. Rails enforces a set of conventions and best practices, providing a clear structure and predefined ways of accomplishing tasks. This "convention over configuration" approach simplifies the development process, reduces the amount of boilerplate code, and accelerates the time to market by minimizing the decisions developers need to make.
We saw a similar trend for RAG application development with the recent explosion of open source options leading to cognitive overload. So, we decided to build an opinionated RAG framework and a fully enterprise-ready platform around it that would simplify RAG application development.
Got It AI’s AutoRAG is an Industry-first Opinionated RAG Platform
The industry is in dire need of a streamlined, efficient, and effective approach to building RAG applications. Enter Got It AI's AutoRAG - the industry-first opinionated RAG platform designed specifically for enterprise use cases.When we say opinionated, we mean that the AutoRAG platform comes pre-configured with best practices and optimized settings for enterprise use cases. We address almost all the gotchas listed above through
advanced content pre-processing
prompt engineering
specialized model fine-tuning
runtime heuristics
disambiguation
sentiment-aware behavior
hallucination detection and correction
Other RAG frameworks/platforms are not opinionated enough
Nvidia has developed a “NIM” architecture which includes some of the key endpoints and containers that could be used in a RAG pipeline. However, that does not make an opinionated RAG platform. The endpoints and containers are “shells” and the difficult work of choosing configurations, tuning strategies and data ingestion approaches must still be done by a sophisticated RAG developer.
An industry consortium has created a platform called OPEA which offers a simple reference approach with some sample Langchain based code. Again, this offering would require a very sophisticated developer to make opinionated choices to create a usable RAG platform.
Finally, some start-ups are offering various pieces of the RAG pipelines - vector databases, vector embedding APIs, fine-tuned models, specialized data ingestors etc. These are components and cannot on their own deliver a RAG platform - again a very sophisticated developer would have to bring these components together and make opinionated choices to create a usable RAG platform.
System Configuration Options
We have distilled down the many available choices around LLMs, RAG techniques, user query complexity, etc. into two system configurations. The goal of a pre-configured system is to minimize additional services/integration work that could add to the Total Cost of Ownership for an enterprise customer.
Honeybadger Nimble, high performance | Cheetah Powerful, versatile | |
Summary | Faster, uses smaller models, cheaper to operate, accurate, basic use case scenarios | Utilizes larger models, very accurate, complex use case scenarios |
Use cases | Conversational Q&A over industry-standard knowledge bases. Simple factual answers, predefined process automation | Conversational Q&A with Complex queries over more complex data sources/documents tabular data. Logical inferences, dynamic process automation |
Data sources, parsers, augmentation | 100s of connectors from llamahub and other open source libraries, specialized parsers for specific sources, PDFs, web pages; advanced data augmentation, taxonomy extraction, automated entity tagging, hierarchical indexes | same as HoneyBadger |
RAG techniques | Configurable RAG pipeline, entity disambiguation, choice of vector DBs and indexing methods, soft and hard caching | → + advanced prompts, Multi-stage RAG pipelines, Recursive RAG, Agentic RAG (roadmap) |
Trust and Safety Guardrails | TruthChecker, PII filtering | → +Truth Correction, Prompt Injection Prevention, Policy |
OSS LLMs | 0.7B to 30B (self hosted or through APIs) | 8x7B Mixtral MoE, Llama3 70B or better (self hosted or through APIs) |
Private LLMs | Less capable private API LLMs like GPT-3.5-Turbo | More capable private API LLMs like Claude Opus, GPT-4o |
Once you select one of these configurations, within minutes, a RAG application with sane defaults will be ready for you to evaluate. You can then continue tuning its configuration as desired.
Benefits of Got It AI’s AutoRAG Platform
Choice of LLM
AutoRAG can be configured to use a variety of LLMs, including Mixtral, Llama3, Flan T5, GPT-4, Claude, Gemini, and so on. You can also use Got It AI’s fine-tuned RAG response generation and TruthChecker models described below. We fine tune our models for specific tasks so that they do not need to be further fine-tuned for specific customers. To support Active Learning, which improves model accuracy post initial deployment, an enterprise may choose to do enterprise specific fine tuning of a model.
Pre-tuned for specific use cases
AutoRAG includes RAG pipelines tuned to handle various common use cases such as agent assist, customer support virtual agents, copilots. We plan to add more use cases around content creation, summarization and others.
Hallucination detection and response correction
One of the key innovations we’ve made in AutoRAG is the development of a proprietary recipe for a model based Truthchecker, which evaluates the responses generated for relevance and groundedness as an independent check by a separate LM in a RAG pipeline before the generated responses are sent to the user.
We’ve blogged previously that we have been able to achieve better than GPT-4-Turbo accuracy using Mixtral MoE models as task specific LLMs for response generation and TruthChecking, as well as TruthCorrection which involves regenerating a grounded and relevant response instead of a factually incorrect and/or irrelevant response.
A Customizable, Extensible and Enterprise-ready Product
While AutoRAG comes with sane defaults and pre-tuned configurations, it is designed to be fully customizable - pipeline modules have an opinionated set of configurations/decisions, and the developer may change them as needed.
After getting started with an AutoRAG project, developers can also extend it for specific needs, tweaking existing modules or adding new modules, which may even come from other developers or open source, to accommodate specific project requirements.
Automated evaluation and improvement
Got It AI’s platform comes with automated evaluation capabilities to understand relevance, groundedness, trustworthiness, reliability, and scalability characteristics of a RAG solution. This automated evaluation capability is also a cornerstone of AutoRAG’s fully automated active learning system that continuously tunes models and system configuration to achieve optimal performance. To make automated evaluation possible we’ve had to define a clear set of SOPs for what is groundedness/factual and what is relevant, and we’ve spent over one year with human annotators to achieve achieve very high inter-annotator agreement between human annotators and our auto evaluation system.
Proof Points
AutoRAG has been deployed in Enterprise cloud and VPC environments, on real world enterprise use cases, passing stringent security, system architecture and compliance requirements. Enterprise IT teams with minimal GenAI skill sets have been able to leverage AutoRAG into their enterprise use cases.
We have deployed both the HoneyBadger system configuration for one of our enterprise customers (a consumer electronics firm), and the Cheetah configuration for another enterprise customer (a financial services firm). Through these deployments we have gained a vast amount of experience in generalization of our configurations and scalability/performance.
The deployed HoneyBadger RAG system works as a Customer Service Autopilot. It ingests data from an industry-standard knowledge base which is supported out of the box by AutoRAG. It uses a 3B parameter fine-tuned response generator model, and a 0.7B TruthChecker model. User queries are typical customer service questions.
The Cheetah RAG system in pilot deployment works as a Sales team Copilot. It ingests data from over 1000 complex documents which include many types of pds documents with tabular data. The user queries can be more complex, spanning multiple documents, that help a sales person answer questions about and compare across dozens of internal products and features/benefits.
Conclusion
Got It AI's AutoRAG is an opinionated RAG platform. By providing tuned pipelines, pre-configured system settings, a choice of LLMs, ease of use, customizability, and enterprise-class features, AutoRAG radically simplifies how powerful RAG applications can be built and deployed. AutoRAG comes packaged as a scalable, generalizable product in two system configurations, each of which has been deployed - with two separate enterprise customers.
Kommentare