Glossary

RAG (Retrieval-Augmented Generation)

Retrieval-Augmented Generation (RAG) is an architecture pattern that combines a retrieval system (like vector search) with a generative LLM — fetching relevant documents at query time and feeding them to the model as context, dramatically reducing hallucinations and enabling answers grounded in private or up-to-date data.

Updated: —
Words: 884
Category: AI / GenAI

RAG (Retrieval-Augmented Generation)

RAG is an architecture pattern where a generative AI system fetches relevant external information at query time and includes it in the LLM prompt, rather than relying solely on the model's training knowledge. The flow: user asks a question → retrieval system finds the most relevant passages from a document store → those passages are concatenated into the prompt → the LLM generates an answer grounded in the retrieved context.

RAG solves three core LLM limitations: (1) knowledge cutoff (models don't know events after training), (2) hallucinations (models invent facts when uncertain), and (3) inability to access private data. By 2026, RAG is the dominant pattern for enterprise AI deployments — a Menlo Ventures survey of 600 AI leaders found 51% of production AI applications use RAG, more than any other architecture.

How RAG works

A typical RAG pipeline has four stages:

Indexing (offline) — Documents are chunked into passages (typically 200–800 tokens), embedded into vectors via an embedding model (e.g., OpenAI text-embedding-3-large, Cohere Embed v3), and stored in a vector database (Pinecone, Weaviate, pgvector).
Retrieval (per query) — User query is embedded and the vector store returns the top-K (usually 5–20) nearest passages by cosine similarity. Hybrid systems also use BM25 keyword search and rerank with a cross-encoder.
Augmentation — Retrieved passages are formatted into the prompt with instructions ("Answer using only the information below; cite source IDs").
Generation — The LLM produces an answer, ideally with citations that link back to the source passages.

Modern RAG goes well beyond this baseline: query rewriting, multi-hop retrieval, agentic retrieval, hybrid sparse/dense search, contextual chunking, and structured RAG over knowledge graphs are all common in 2026 production systems.

Why RAG matters

A 2025 Stanford study found that adding RAG to a frontier LLM reduced hallucination rates on factual questions from 27% to under 4% — a 7x improvement — without any change to the underlying model. For enterprise use cases (customer support, document Q&A, legal/medical search), that delta is the difference between "interesting demo" and "production-ready."

RAG is also the answer to a major data-governance problem: you don't need to send your private corpus to OpenAI or fine-tune a model on it. Documents stay in your vector store; only relevant snippets are sent to the LLM at query time. This unlocks AI for regulated industries (finance, healthcare, legal) that can't expose proprietary data to third parties.

Examples of RAG in production

Perplexity — Web-scale RAG; retrieves search results and synthesizes cited answers.
NotebookLM (Google) — RAG over user-uploaded documents using Gemini for synthesis.
GitHub Copilot Chat — RAG over your codebase; retrieves relevant files before generating code.
Glean — Enterprise search RAG across Slack, Drive, Notion, Salesforce, etc.
ChatGPT with web browsing — RAG over real-time search results.

How PostKit relates to RAG

PostKit doesn't currently use a vector-store RAG architecture — its pipeline runs over a small, structured input (one brand profile + platform rules + chosen marketing pipeline) that fits easily in a single prompt without retrieval.

However, three potential PostKit features map cleanly to RAG:

Past-post recall — Retrieve a brand's previously high-performing posts as inspiration for new content (RAG over the user's own post history).
Trending topic ingestion — Retrieve current platform trends and weave them into generated content (RAG over a trends index).
Competitor inspiration — Retrieve high-engagement posts from a brand's competitive set for stylistic reference (RAG over a curated competitive corpus).

Each of these is on the long-term roadmap. The reason PostKit hasn't shipped them yet is the same reason most product teams should start without RAG: it adds operational complexity (vector DB, embedding pipeline, eval rigor) that's only justified once the simpler prompt-only system hits a clear ceiling.

Frequently asked questions

Is RAG better than fine-tuning? Different problems. RAG is better for changing knowledge (real-time data, private docs). Fine-tuning is better for changing behavior (style, format, tone). Often used together.

What's a vector database? A database optimized for nearest-neighbor search over high-dimensional vectors. Popular options in 2026: Pinecone, Weaviate, Qdrant, Milvus, and pgvector (Postgres extension).

What's an embedding? A vector (typically 768–3072 dimensions) that represents the meaning of a piece of text. Semantically similar texts have nearby embeddings. Generated by embedding models trained alongside or separately from LLMs.

How big should chunks be? Typical: 200–800 tokens. Smaller chunks improve retrieval precision but lose context. Larger chunks preserve context but dilute retrieval signal. Contextual chunking (adding chunk-level summaries) helps both.

What's "agentic RAG"? RAG where the LLM — acting as an AI agent — decides what to retrieve, reformulates queries, performs multi-step retrieval, and reasons over results before answering. Significantly higher quality than single-shot RAG on complex queries.

What's RAG vs prompt-stuffing? Prompt-stuffing puts all your data in every prompt regardless of relevance — wasteful and slow. RAG selects only the relevant subset per query — efficient and scalable.

Does RAG work with very long context windows? Yes, but the value diminishes as context grows. With 1M-token context, you can stuff a whole document set; with 8k context, RAG is essential. RAG also reduces cost (smaller prompts) and improves quality (focused context).

Sources

Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Meta AI, 2020)
Menlo Ventures — State of Generative AI in the Enterprise (2025)
Stanford HAI — Foundation Model Transparency Index (2026)

Related comparisons

Glossary

RAG (Retrieval-Augmented Generation)

Updated: —
Words: 884
Category: AI / GenAI

RAG (Retrieval-Augmented Generation)

How RAG works

A typical RAG pipeline has four stages:

Indexing (offline) — Documents are chunked into passages (typically 200–800 tokens), embedded into vectors via an embedding model (e.g., OpenAI text-embedding-3-large, Cohere Embed v3), and stored in a vector database (Pinecone, Weaviate, pgvector).
Retrieval (per query) — User query is embedded and the vector store returns the top-K (usually 5–20) nearest passages by cosine similarity. Hybrid systems also use BM25 keyword search and rerank with a cross-encoder.
Augmentation — Retrieved passages are formatted into the prompt with instructions ("Answer using only the information below; cite source IDs").
Generation — The LLM produces an answer, ideally with citations that link back to the source passages.

Why RAG matters

Examples of RAG in production

Perplexity — Web-scale RAG; retrieves search results and synthesizes cited answers.
NotebookLM (Google) — RAG over user-uploaded documents using Gemini for synthesis.
GitHub Copilot Chat — RAG over your codebase; retrieves relevant files before generating code.
Glean — Enterprise search RAG across Slack, Drive, Notion, Salesforce, etc.
ChatGPT with web browsing — RAG over real-time search results.

How PostKit relates to RAG

However, three potential PostKit features map cleanly to RAG:

Past-post recall — Retrieve a brand's previously high-performing posts as inspiration for new content (RAG over the user's own post history).
Trending topic ingestion — Retrieve current platform trends and weave them into generated content (RAG over a trends index).
Competitor inspiration — Retrieve high-engagement posts from a brand's competitive set for stylistic reference (RAG over a curated competitive corpus).

Frequently asked questions

Sources

Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Meta AI, 2020)
Menlo Ventures — State of Generative AI in the Enterprise (2025)
Stanford HAI — Foundation Model Transparency Index (2026)

RAG (Retrieval-Augmented Generation)

How RAG works

Why RAG matters

Examples of RAG in production

How PostKit relates to RAG

Frequently asked questions

Related terms

Sources

Related comparisons

RAG (Retrieval-Augmented Generation)

How RAG works

Why RAG matters

Examples of RAG in production

How PostKit relates to RAG

Frequently asked questions

Related terms

Sources

Related comparisons