DefinedTerm · Glossary
What is RAG (Retrieval-Augmented Generation)
Retrieval-Augmented Generation (RAG) is an AI architecture in which a language model's response is grounded by dynamically retrieved documents rather than relying solely on information encoded in model weights during training. The pattern was formalised by Lewis et al. at Facebook AI Research (NeurIPS 2020). In a RAG pipeline, a user query is first used to retrieve a set of relevant passages from an external knowledge source — a vector database, a search index, or a live web crawl — which are then prepended to the model's context window before generation. RAG is the dominant architecture behind Perplexity, Bing Copilot, Google AI Overviews, and ChatGPT's web-browsing mode.
Full definition
Retrieval-Augmented Generation (RAG) is an AI system design pattern that separates the knowledge retrieval step from the language generation step, combining the strengths of search systems (precision, recency, source traceability) with those of large language models (fluent synthesis, reasoning, multi-step inference).
The canonical RAG pipeline has three stages:
- Indexing: a corpus of documents is chunked, converted to dense vector embeddings using an encoder model, and stored in a vector database (e.g., Pinecone, Weaviate, pgvector).
- Retrieval: at query time, the user's input is embedded using the same encoder and a nearest-neighbour search returns the most semantically similar passages. Hybrid retrieval combines dense (semantic) and sparse (BM25 keyword) search to balance recall and precision.
- Generation: the retrieved passages are inserted into the language model's context window — typically as a system prompt prefix — and the model generates a response grounded in that material. In citation-enabled interfaces, the source URLs are surfaced alongside the answer.
Advanced variants include re-ranking (a cross-encoder scores retrieved passages before they enter the context), iterative RAG (the model issues follow-up queries if initial retrieval is insufficient), and agentic RAG (a planning layer decides when and how to retrieve).
Why it matters in 2026
RAG is the architectural reason why content quality and accessibility directly determine whether a business appears in AI-generated answers. Unlike traditional SEO — where ranking signals include link graph, click-through rates, and Core Web Vitals — RAG retrieval is primarily driven by semantic relevance and document authority.
For a construction company or tradesperson, RAG means that a well-structured service page, FAQ document, or glossary entry can be surfaced verbatim in an answer to a query like "What is the typical cost of a kitchen renovation in London?" — provided the page is crawlable, semantically relevant, and written with high citability.
The inverse also holds: pages blocked by robots.txt, hidden behind interstitials, or written in vague marketing language have near-zero probability of appearing in RAG-powered answers, regardless of their traditional SEO performance.
How it works
At the retrieval stage, semantic similarity is computed as the cosine distance between the query embedding and each document chunk embedding. The top-k chunks (typically k = 3 to 10) are selected. A re-ranker may then reorder them based on relevance to the specific query rather than general similarity.
At the generation stage, the model receives a prompt grounding instruction of the form: answer using only the provided context; if the context does not contain the answer, say so. This grounding reduces hallucination because the model is directed to the retrieved material rather than its parametric memory — though it does not eliminate hallucination entirely, as the model may still misread or extrapolate beyond the provided context.
Difference from fine-tuning
| Approach | Knowledge storage | Update frequency | Traceability | Cost to update |
|---|---|---|---|---|
| RAG | External index | Real-time or near-real-time | High — sources cited | Low — add or update documents |
| Fine-tuning | Model weights | Requires full retraining | Low — no source attribution | High — GPU compute and iteration cycles |
| Prompt engineering | Context window only | Per-request | Medium | Zero — but limited by context length |
RAG and fine-tuning are complementary rather than mutually exclusive; many production systems use a fine-tuned base model with a RAG retrieval layer on top.
Related terms
Hallucination (LLM), Citability, Fan-out query.
Fuentes
Términos relacionados
- hallucination-llm
- citability
- fan-out-query