Enhancing LLMs using Retrieval-Augmented Generation (RAG) models

Rachit Narang GEN AI 26.10.2025 | 0

Highlights: Retrieval-Augmented Generation (RAG) models are transforming AI by combining Large Language Models (LLMs) with external memory to improve accuracy, reduce hallucinations, and provide context-aware outputs.

In today’s post, we’ll explore RAG’s key concepts, challenges, and cutting-edge advancements and learn how RAG models enhance the reliability of LLM models and increase their adaptability across diverse applications. So let’s begin!

Tutorial Overview:

LLMs today & their Many Challenges
- Output Accuracy
- Next-Word Prediction
Emerging Approaches & Solutions
Contextualization using Retreival-Augmented Generation (RAG)
- Optimizing the System
- Frozen RAG
Advanced Retrieval Methods
- Sparse Retrieval
- Dense Retrieval
- FAISS by Facebook
- ColBERT Model
- State-Of-The-Art Models (SOTA)
Contextualizing the Retriever for the Generator
- RePlug Model
- RAG Model by Lewis et al. (2020)
- REALM Model
- Multimodal RAG

1. LLMs today & their Many Challenges

Output Accuracy

While language models are impressive, they still face significant challenges, especially in enterprise settings where high accuracy is essential. Here are some of the main issues and potential solutions for making language models more reliable:

Hallucination: Language models sometimes generate incorrect information with high confidence, which can be misleading.
Attribution: It’s often unclear why a model produces certain outputs, making it hard to trace the source of its “knowledge.”
Staleness: Models go out of date quickly, especially if trained on static data. Regular updates are required to keep the information relevant.
Revisions: For data privacy and regulatory compliance (e.g., GDPR), companies may need to delete or revise certain data, but this is currently difficult in models.
Customization: Organizations want to tailor models to specific use cases or proprietary data, but integrating custom data effectively remains a challenge.

Next-Word Prediction

The foundation of many language models, including GPT, is next word prediction. The model takes a sequence of words as input and predicts the following word, a simple concept that has been around since the early 1990s. This method, however, originally suffered from usability issues, as users had to input complex and often awkward prompts to guide the model’s behavior effectively. Here is the most critical challenge that some of the current approaches are facing:

Broken User Interface: Early language models required carefully crafted prompts, making it difficult for users to achieve desired outcomes. The interaction lacked naturalness, as users had to guess the exact phrasing the model would understand.

2. Emerging Approaches & Solutions

Solutions for Improved Interaction in Next-Word Prediction:

Prompting: The model is trained to respond to specific prompts, allowing users to provide straightforward instructions rather than convoluted phrases.
Instruction Tuning: By exposing the model to data that teaches it how to follow direct instructions, developers improved its ability to understand and act upon user prompts. This tuning allows the model to respond more naturally and accurately to varied user commands.
Alignment with Human Preferences: Fine-tuning the model to align with user expectations further enhances the model’s responsiveness and reliability. This alignment makes the model’s output more consistent with what users expect or find helpful.

These advancements have significantly improved the user interface of language models, making them easier and more intuitive to use for both casual and professional applications.

Solutions for Improving Output Accuracy:

One emerging approach is to couple language models with external memory. This setup allows models to reference up-to-date, customizable information without relying solely on static, pretrained data. External memory integration is a promising direction to tackle many of these issues, bringing us closer to language models that are both accurate and adaptable.

3. Contextualization using Retrieval-Augmented Generation (RAG)

An effective way to enhance language models is to incorporate external memory through Retrieval-Augmented Generation (RAG). This method provides the model with additional context, allowing it to draw on relevant information stored outside of its parameters. Here’s how it works:

1. Input and Prompt: Like standard models, RAG uses an input sequence and a prompt to guide the generation of output.

2. Retrieval of External Context: Instead of relying solely on internal parameters, the model accesses an external context by retrieving relevant documents or data. This is achieved through a retriever component, which finds documents based on encoded queries. This retriever serves as the model’s “open book,” allowing it to look up information rather than memorizing everything.

3. Generator Output: Using both the input prompt and the retrieved context, the model generates an output that reflects both its internal knowledge and the externally provided information.

This setup resembles an open-book exam approach, where the model doesn’t need to store all information internally. Instead, it can access external sources, making it less dependent on parameterized memory (traditional memorization within neurons). This combination of parametric (internal memory) and non-parametric (external retrieval) elements makes RAG a powerful tool for tasks requiring specific or up-to-date knowledge.

The key advantage of using an external retriever with a language model lies in its flexibility and customizability. Since the retriever component (or index) is separate from the model itself, it can be swapped in or out as needed, allowing users to adjust the source of information based on specific needs. This approach not only makes the model adaptable to different use cases but also allows regular updates to the index, ensuring that the information remains fresh and relevant.

Another benefit is grounding. By pulling in external information, the language model is less likely to “hallucinate” or generate unsupported claims, as it relies on verified sources. This grounding allows for attribution, meaning the model’s output can be traced back to a specific source, enhancing accountability and transparency. Grounding also opens up possibilities for multimodal integration, where text generation can be informed by images, audio, or other media. Overall, Retrieval-Augmented Generation provides a robust, reliable foundation for language models, addressing common issues like staleness, hallucination, and attribution.

Optimizing the system

Optimizing a system that includes components like a retriever and generator involves addressing a multitude of questions, each of which influences performance and efficiency. Key considerations include:

Data Structuring: Should documents be processed as full texts, paragraphs, sentences, or smaller chunks? How you structure data impacts retrieval relevance and speed.
Encoding Strategy: How should we encode both queries and documents to best capture the information needed for retrieval and response generation?
Retrieval Timing and Criteria: When is the optimal time to retrieve documents, and under what criteria? Do we retrieve every query or only when context shifts?
Prompting and Context Passing: How do we craft prompts to elicit the desired output, and how do we ensure that context is effectively integrated throughout the model’s workflow?
Processing and Verification: Once the generator provides an output, what steps are necessary for post-processing, and how do we verify that the output aligns with user needs?

Answering these questions requires a careful balance of efficiency, scalability, and accuracy, and different use cases may require different approaches to each component of the system. This ongoing experimentation and refinement help in creating a robust and optimized retrieval-augmented generation system.

Frozen RAG

Frozen RAG, or Retrieval-Augmented Generation without any training, is a setup where no updates are made to the model parameters. It operates entirely in “test mode” with “in-context learning” as the basis of its performance. Here’s a simplified breakdown of how it works:

Chunking and Embedding: Documents are split into fixed-sized chunks and embedded using a pre-trained encoder. These embeddings are stored in a vector database (VDB), ready for quick retrieval.
Query Encoding and Retrieval: Queries are also embedded with a similar model, and the vector database performs a similarity search to find relevant document chunks.
In-Context Learning: The retrieved information is passed as context to a frozen language model (LLM) during inference. This model, which has not been further trained or fine-tuned with the documents, uses the provided context to generate relevant outputs.

Frozen RAG relies on the language model’s ability to perform “in-context learning” based on retrieved information. This setup can feel limited since there’s no adaptability, but it’s straightforward and eliminates the need for complex training or fine-tuning. The question that arises, however, is whether we can enhance this system with more sophisticated approaches to surpass the limitations of the frozen RAG setup.

4. Advanced Retrieval Methods

Sparse Retrieval

Sparse retrieval methods, such as TF-IDF and BM25, are foundational in document retrieval systems. TF-IDF (Term Frequency-Inverse Document Frequency) scores document relevance by weighting words based on their frequency within the document and across the corpus. This allows it to highlight rare but significant terms, giving more weight to words that are unique to specific documents.

BM25, an extension of TF-IDF, fine-tunes this retrieval approach by incorporating parameters (like \(k1\) and \(b\)) to adjust the impact of term frequency and document length, making it more effective. Interestingly, “BM25” stands for “Best Match 25” because it was the 25th iteration that yielded the best performance.

These sparse retrieval methods were crucial in early neural open-domain question answering systems, such as DrQA. DrQA utilized BM25 to retrieve relevant documents from Wikipedia in response to user queries. The retrieved text was then passed to a neural network-based document reader, which extracted the answer from the retrieved documents, representing one of the earliest neural systems that combined retrieval and generation to tackle open-domain questions.

Dense Retrieval

Dense Retrieval techniques, such as OrQA and Dense Passage Retriever (DPR), utilize dense embeddings rather than sparse term-based representations like BM25. This approach leverages semantic similarity by using vector representations, which are particularly effective in capturing synonyms and related terms. Unlike sparse retrieval, dense retrieval finds relevance based on meaning rather than exact term overlap.

Systems like OrQA (Lee et al., 2019) and DPR (Karpukhin et al., 2020) employ BERT embeddings to represent entire sentences or passages, enabling more accurate retrieval of semantically relevant documents. For instance, DPR uses a supervised training approach to fine-tune the retriever on question-answer pairs, outperforming BM25 in top-k retrieval accuracy, as shown in the adjacent graph.In dense retrieval, the dot product is commonly used as the scoring function due to its computational efficiency, allowing for faster and more effective matching of queries with relevant passages.

FAISS by Facebook

FAISS (Facebook AI Similarity Search) is an advanced tool developed by Facebook AI Research to handle billion-scale similarity searches efficiently, particularly when using GPUs. It focuses on semantic similarity by calculating cosine similarity between embeddings, which allows it to find items that are semantically related within large-scale datasets. The core of FAISS lies in its support for maximum inner product search (MIPS) and approximate nearest neighbor (ANN) search, making it highly effective for retrieving relevant information from massive collections.

FAISS is particularly beneficial for use in vector databases, which rely on dense vector representations to capture the meaning of data. Many of today’s popular vector databases are essentially re-implementations or adaptations of the principles developed in FAISS, optimized in various programming languages like Rust and Go. By leveraging these principles, FAISS allows for extremely fast similarity searches across vast datasets, significantly speeding up tasks that require high-performance retrieval, such as document search, recommendation systems, and large-scale machine learning applications.

ColBERT Model

ColBERT (Khattab et al., 2020) introduces a “late interaction” approach for query-document matching, which goes beyond traditional dot product operations used in representation-based similarity. Unlike simpler models that rely on one-to-one vector comparisons, ColBERT computes maximum similarity scores between individual words in the query and document embeddings. This approach retains the contextual richness of each word, allowing for more nuanced scoring.

ColBERT utilizes Siamese networks with dual BERT-based encoders (or other encoders) for the query and document. Instead of collapsing all information into single vectors and performing a simple dot product, ColBERT allows each word embedding in the query to interact with each word embedding in the document, which enhances interpretability and precision. The model was humorously named after “The Late Show with Stephen Colbert,” reflecting the “late interaction” technique it employs.

This model offers a balance between complexity and interpretability, making it highly effective for tasks requiring deeper semantic matching between queries and documents.

SOTA (State-Of-The-Art) Methods

SOTA methods in information retrieval are now leveraging hybrid models that combine both sparse and dense representations to improve search efficiency and relevance.

• SPLADE (Formal et al., 2021): This method merges sparse and dense retrieval by using a query expansion technique, allowing sparse models to incorporate synonyms and context-based associations that dense models excel at. For example, if a query mentions “Indonesia,” SPLADE can suggest related terms like “rainforest” or “orangutans” for better alignment with the document.

• DRAGON (Lin et al., 2023): DRAGON focuses on dense retrieval by using progressive data augmentation, continually training with increasingly challenging negative samples. This process sharpens the model’s representation of relevant and irrelevant information, making it a powerful choice for dense retrieval tasks.

Hybrid search strategies, which combine results from models like SPLADE (sparse-based) and DRAGON (dense-based), are becoming common. By blending sparse methods (e.g., BM25) with dense embeddings, hybrid approaches capture both semantic richness and computational efficiency, achieving superior relevance and performance in retrieval tasks.

5. Contextualizing the Retriever for the Generator

In recent developments of Retrieval-Augmented Generation (RAG) models, researchers are focusing on how retrievers can be optimized for specific generative tasks, even in situations where the language model’s internal weights are inaccessible, such as when using API-based models like GPT-4.

A prime example is the RePlug framework (Shi et al., 2023), which improves retriever performance by minimizing the KL divergence between the retrieved documents and the generator’s expected outputs. RePlug works by first computing retrieval likelihoods for top documents, then evaluating the language model perplexity for each document’s relevance. This framework is particularly powerful because it is model-agnostic; it can integrate with any encoder-decoder architecture or language model as long as perplexity scores are available. The adaptability of RePlug highlights the ongoing advancements in retrieval methods that are grounded in evidence, making RAG systems more accurate and reliable.

RePlug Model

In 2020, Retrieval-Augmented Generation (RAG) introduced a new approach to retriever-generator architectures. Unlike previous setups where the generator remained fixed and only the retriever was updated, RAG enables end-to-end backpropagation through both the retriever and generator, allowing adjustments across the model for improved accuracy.

In this structure, a query encoder processes the input query, retrieving the top-K relevant documents using Maximum Inner Product Search (MIPS). The generator then combines this retrieved information to generate a contextual response. RAG’s design makes it highly adaptable for tasks like question answering and fact verification, where it can iteratively refine both retrieval and generation processes, leading to more reliable and relevant outputs. This setup marks a significant step forward by integrating the strengths of both retrieval and generation in a unified framework.

RAG Model by Lewis et al. (2020)

The RAG model (Retrieval-Augmented Generation) by Lewis et al. (2020) introduces two primary approaches to document retrieval during generation: the RAG-Sequence Model and the RAG-Token Model.

RAG-Sequence Model: This approach retrieves the top-K relevant documents once and uses the same retrieved document for generating the entire sequence. The retrieved document acts as a single latent variable, which is marginalized across the sequence to get the final probability. This makes it efficient for tasks where the context remains constant throughout the sequence.
RAG-Token Model: In contrast, the RAG-Token model retrieves a different document for each generated token. This allows the generator to select content from multiple documents, providing more flexibility and relevance for each token. The probability distribution is computed iteratively, with marginalization occurring at each token step.

The core idea is that RAG optimizes the retrieval process by learning when to retrieve or using hyperparameters to set retrieval frequency. This optimization contrasts with “frozen” RAG versions, which do not adapt as well. The original RAG paper emphasizes this end-to-end optimization, showing that a learned retrieval approach significantly outperforms frozen alternatives.

6. End-to-End Contextualization

In the groundbreaking RAG (Retrieval-Augmented Generation) model proposed by Lewis et al. (2020), a non-parametric retriever is combined with a parametric generator. This setup allows end-to-end backpropagation through both the retriever and the generator, unlike previous models that kept the generator static. RAG includes two modes: RAG-Sequence, where a single document is used to generate an entire sequence, and RAG-Token, which allows the model to retrieve a different document for each token generated. This flexibility can improve accuracy but introduces challenges in determining the frequency of retrievals during generation.

REALM Model

REALM by Guu et al. (2020) pushed this concept further by enabling backpropagation through the entire retrieval system, including the document encoder. This visionary work introduced async updates and the challenge of updating the document encoder across a vast corpus, like re-encoding the internet after every update. While conceptually powerful, this method is resource-intensive due to the constant re-encoding requirements.

Timing of Retrieval

In advanced Retrieval-Augmented Generation, timing of retrieval is crucial. Models like RAG-Token and RAG-Sequence retrieve either per token or sequence, while newer strategies, such as the FLARE model (Forward-Looking Active Retrieval Augmentation) by Jiang et al. (2023), introduce active retrieval augmentation. Here, the language model determines the optimal moments to retrieve additional context based on specific needs during generation, rather than at a fixed interval. This allows for efficient use of compute resources by only retrieving when necessary.

The above diagram illustrates how FLARE actively decides retrieval points to complete a task, like generating a summary about a specific topic, by selectively querying relevant facts. The bar chart shows performance across datasets, highlighting how this dynamic retrieval improves accuracy compared to static approaches.

Multimodal RAG

The expansion of Retrieval-Augmented Generation (RAG) into the multimodal realm demonstrates exciting potential by combining language models with visual understanding. Recent advancements, such as the LENS model, augment frozen language models with vision capabilities using a computer vision pipeline, feeding visual context to the model. This approach rivals models like Flamingo by DeepMind in Visual Question Answering (VQA) tasks, though replicating Flamingo is challenging due to the lack of open-source accessibility.

In this setup, images are processed through an image encoder, akin to how text is retrieved and fed to a language model, effectively creating cross-modal retrieval systems. This trend reflects a broader push in the AI community toward multimodal integration, indicating a future where language models interact not only with text but also with diverse sensory inputs like images, video, and possibly audio.

The diagram highlights architectures that combine image and text encoders, leveraging frameworks like RA-CM3 and XTRA for vision-augmented language processing. This field, blending vision with language, is rapidly advancing and represents a promising avenue for enhanced AI comprehension.

That’s it. We have reached the end of this post. Before we go, let’s quickly revise what we learnt today.

The History of AI, the Evolution of Transformers & the Rise of ‘Attention’

Early LLMs face challenges like broken user interface in next-word prediction, hallucinations, attribution issues, staleness, data revision, and lack of customization
RAG (Retrieval-Augmented Generation) is an emerging approach that combines language models with external memory
The combining of model with external memory is called ‘Contextualization’ and it resembles an open-book exam approach where the model can access external sources
Key advantage of contextualization lies in its flexibility, customization, and grounding which results in better attribution, accountability and transparency
Grounding also opens up possibilities for multimodal integration
There are two types of retrieval methods – Sparse and Dense
Frozen RAG operates without training, relying on pre-trained models and in-context learning
FAISS by Facebook and other advanced retrieval tools enable scalable and efficient similarity searches across datasets
Multimodal RAG expands capabilities by integrating visual, text, and audio contexts for various tasks

Summary

Retrieval-Augmented Generation (RAG) is paving the way for smarter, context-aware AI systems by integrating external memory with powerful language models. As advancements continue, RAG holds the potential to revolutionize how AI interacts with and processes information. We hope you enjoyed this tutorial post. We’ll be back with more such interesting topics to help you stay up-to-date with the latest technological advancements in the field of AI, Computer Vision and Deep Learning. See you next time! 🙂

Enhancing LLMs using Retrieval-Augmented Generation (RAG) models