DefinedTerm · Glossary
What Is RAG (Retrieval-Augmented Generation)
RAG (Retrieval-Augmented Generation) is an architecture that combines a generative language model with an external information retrieval module. Rather than responding solely from parametric knowledge, the model retrieves relevant text fragments from an external knowledge base at inference time and uses them as context to generate the response. RAG is the most widely adopted technical strategy for reducing hallucinations and keeping an LLM's knowledge current without retraining.
Full definition
RAG (Retrieval-Augmented Generation) is a paradigm that extends the capability of a large language model (LLM) by connecting it to an external knowledge base that can be queried at inference time. The concept was formally introduced by Lewis et al. in 2020 and has become the de facto standard for question-answering systems, AI agents, and generative search engines.
The basic RAG architecture has three components:
Retriever. Receives the user's query and searches an indexed knowledge base for the most relevant text fragments. Modern retrievers combine dense retrieval (vector embeddings) and sparse retrieval (BM25) to maximize result precision.
Re-ranker. Reorders the retrieved fragments by relevance before passing them to the generator, filtering noise and prioritizing those most pertinent to the specific query.
Generator. Receives the original query together with the retrieved fragments and produces the final response, grounded in external context rather than relying exclusively on parametric knowledge.
Why it matters in 2026
RAG is the dominant technical answer to the hallucination problem. By conditioning generation on verifiable text fragments, the model has less room to fabricate data. A broad study published in MDPI (2025) on hallucination mitigation strategies identified RAG as the technique with the highest adoption in production systems.
For generative search engines — Google AI Mode, Perplexity, ChatGPT Search — RAG is the foundation on which cited responses are built. When an engine cites a source in its response, in most cases it is because its RAG system retrieved a fragment from that source and the generator used it as context. This means the citability of a source depends substantially on whether its content passes the relevance and quality filters of the engine's retriever.
The second practical consequence for home services and construction companies is corporate RAG: deploying RAG over internal knowledge bases (product catalogs, technical manuals, service FAQs) to build AI agents that answer accurately without hallucinating company-specific data.
How it works
The standard RAG flow follows these steps:
- The user's query arrives at the system.
- The retriever encodes the query as a vector and searches the vector index for the most similar fragments.
- The retrieved fragments — top-k, typically between 3 and 10 — pass through a re-ranker that reorders them by relevance.
- The selected fragments are concatenated with the original query and sent to the LLM as context.
- The LLM generates the response conditioned on that context, citing sources when the system requires it.
Advanced variants include iterative RAG (the output of each generation step guides a new retrieval), hybrid RAG (combines dense and sparse retrieval), and agentic RAG (the agent decides when and what to retrieve based on its reasoning state).
Difference from other knowledge-update strategies
| Strategy | How it updates knowledge | Cost | Added latency |
|---|---|---|---|
| RAG | Retrieves external context at inference | Low | Moderate |
| Fine-tuning | Retrains the model on new data | High | None in production |
| Prompt engineering | Injects context directly into the prompt | None | None |
| RAG + fine-tuning | Combines retrieval and training | Very high | Moderate |
RAG is the preferred option when knowledge changes frequently — prices, regulation, inventory — because it does not require retraining the model. Fine-tuning is better suited for adapting the model's style or linguistic domain, not for injecting updateable facts. Prompt engineering is only viable when the relevant context is brief and known in advance.
Related terms
LLM Hallucination, Citability, Fan-out query.
Fuentes
Términos relacionados
- hallucination-llm
- citabilidad-llm
- fan-out-query