Skip to content
Made For Builders iconoMade For Builders

DefinedTerm · Glossary

What is RAG (Retrieval-Augmented Generation)

Retrieval-Augmented Generation (RAG) is an AI architecture in which a language model's response is grounded by dynamically retrieved documents rather than relying solely on information encoded in model weights during training. The pattern was formalised by Lewis et al. at Facebook AI Research (NeurIPS 2020). In a RAG pipeline, a user query is first used to retrieve a set of relevant passages from an external knowledge source — a vector database, a search index, or a live web crawl — which are then prepended to the model's context window before generation. RAG is the dominant architecture behind Perplexity, Bing Copilot, Google AI Overviews, and ChatGPT's web-browsing mode.

edu-lopez-paradaPublicado Actualizado

Full definition

Retrieval-Augmented Generation (RAG) is an AI system design pattern that separates the knowledge retrieval step from the language generation step, combining the strengths of search systems (precision, recency, source traceability) with those of large language models (fluent synthesis, reasoning, multi-step inference).

The canonical RAG pipeline has three stages:

  1. Indexing: a corpus of documents is chunked, converted to dense vector embeddings using an encoder model, and stored in a vector database (e.g., Pinecone, Weaviate, pgvector).
  2. Retrieval: at query time, the user's input is embedded using the same encoder and a nearest-neighbour search returns the most semantically similar passages. Hybrid retrieval combines dense (semantic) and sparse (BM25 keyword) search to balance recall and precision.
  3. Generation: the retrieved passages are inserted into the language model's context window — typically as a system prompt prefix — and the model generates a response grounded in that material. In citation-enabled interfaces, the source URLs are surfaced alongside the answer.

Advanced variants include re-ranking (a cross-encoder scores retrieved passages before they enter the context), iterative RAG (the model issues follow-up queries if initial retrieval is insufficient), and agentic RAG (a planning layer decides when and how to retrieve).

Why it matters in 2026

RAG is the architectural reason why content quality and accessibility directly determine whether a business appears in AI-generated answers. Unlike traditional SEO — where ranking signals include link graph, click-through rates, and Core Web Vitals — RAG retrieval is primarily driven by semantic relevance and document authority.

For a construction company or tradesperson, RAG means that a well-structured service page, FAQ document, or glossary entry can be surfaced verbatim in an answer to a query like "What is the typical cost of a kitchen renovation in London?" — provided the page is crawlable, semantically relevant, and written with high citability.

The inverse also holds: pages blocked by robots.txt, hidden behind interstitials, or written in vague marketing language have near-zero probability of appearing in RAG-powered answers, regardless of their traditional SEO performance.

How it works

At the retrieval stage, semantic similarity is computed as the cosine distance between the query embedding and each document chunk embedding. The top-k chunks (typically k = 3 to 10) are selected. A re-ranker may then reorder them based on relevance to the specific query rather than general similarity.

At the generation stage, the model receives a prompt grounding instruction of the form: answer using only the provided context; if the context does not contain the answer, say so. This grounding reduces hallucination because the model is directed to the retrieved material rather than its parametric memory — though it does not eliminate hallucination entirely, as the model may still misread or extrapolate beyond the provided context.

Difference from fine-tuning

ApproachKnowledge storageUpdate frequencyTraceabilityCost to update
RAGExternal indexReal-time or near-real-timeHigh — sources citedLow — add or update documents
Fine-tuningModel weightsRequires full retrainingLow — no source attributionHigh — GPU compute and iteration cycles
Prompt engineeringContext window onlyPer-requestMediumZero — but limited by context length

RAG and fine-tuning are complementary rather than mutually exclusive; many production systems use a fine-tuned base model with a RAG retrieval layer on top.

Related terms

Hallucination (LLM), Citability, Fan-out query.

Fuentes

Términos relacionados

  • hallucination-llm
  • citability
  • fan-out-query