DefinedTerm · Glossary

What is RAG (Retrieval-Augmented Generation)

Retrieval-Augmented Generation (RAG) is an AI architecture in which a language model's response is grounded by dynamically retrieved documents rather than relying solely on information encoded in model weights during training. The pattern was formalised by Lewis et al. at Facebook AI Research (NeurIPS 2020). In a RAG pipeline, a user query is first used to retrieve a set of relevant passages from an external knowledge source — a vector database, a search index, or a live web crawl — which are then prepended to the model's context window before generation. RAG is the dominant architecture behind Perplexity, Bing Copilot, Google AI Overviews, and ChatGPT's web-browsing mode.

edu-lopez-paradaPublicado 27 May 2026Actualizado 27 May 2026

Full definition

Retrieval-Augmented Generation (RAG) is an AI system design pattern that separates the knowledge retrieval step from the language generation step, combining the strengths of search systems (precision, recency, source traceability) with those of large language models (fluent synthesis, reasoning, multi-step inference).

The canonical RAG pipeline has three stages:

Indexing: a corpus of documents is chunked, converted to dense vector embeddings using an encoder model, and stored in a vector database (e.g., Pinecone, Weaviate, pgvector).
Retrieval: at query time, the user's input is embedded using the same encoder and a nearest-neighbour search returns the most semantically similar passages. Hybrid retrieval combines dense (semantic) and sparse (BM25 keyword) search to balance recall and precision.
Generation: the retrieved passages are inserted into the language model's context window — typically as a system prompt prefix — and the model generates a response grounded in that material. In citation-enabled interfaces, the source URLs are surfaced alongside the answer.

Advanced variants include re-ranking (a cross-encoder scores retrieved passages before they enter the context), iterative RAG (the model issues follow-up queries if initial retrieval is insufficient), and agentic RAG (a planning layer decides when and how to retrieve).

Why it matters in 2026

RAG is the architectural reason why content quality and accessibility directly determine whether a business appears in AI-generated answers. Unlike traditional SEO — where ranking signals include link graph, click-through rates, and Core Web Vitals — RAG retrieval is primarily driven by semantic relevance and document authority.

For a construction company or tradesperson, RAG means that a well-structured service page, FAQ document, or glossary entry can be surfaced verbatim in an answer to a query like "What is the typical cost of a kitchen renovation in London?" — provided the page is crawlable, semantically relevant, and written with high citability.

The inverse also holds: pages blocked by robots.txt, hidden behind interstitials, or written in vague marketing language have near-zero probability of appearing in RAG-powered answers, regardless of their traditional SEO performance.

How it works

At the retrieval stage, semantic similarity is computed as the cosine distance between the query embedding and each document chunk embedding. The top-k chunks (typically k = 3 to 10) are selected. A re-ranker may then reorder them based on relevance to the specific query rather than general similarity.

At the generation stage, the model receives a prompt grounding instruction of the form: answer using only the provided context; if the context does not contain the answer, say so. This grounding reduces hallucination because the model is directed to the retrieved material rather than its parametric memory — though it does not eliminate hallucination entirely, as the model may still misread or extrapolate beyond the provided context.

Difference from fine-tuning

Approach	Knowledge storage	Update frequency	Traceability	Cost to update
RAG	External index	Real-time or near-real-time	High — sources cited	Low — add or update documents
Fine-tuning	Model weights	Requires full retraining	Low — no source attribution	High — GPU compute and iteration cycles
Prompt engineering	Context window only	Per-request	Medium	Zero — but limited by context length

RAG and fine-tuning are complementary rather than mutually exclusive; many production systems use a fine-tuned base model with a RAG retrieval layer on top.

What is RAG (Retrieval-Augmented Generation)

Full definition

Why it matters in 2026

How it works

Difference from fine-tuning

Related terms

Fuentes

Términos relacionados