Metadata
- Title: REFRAG: Rethinking RAG based Decoding
- Authors: Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan
- Affiliations: Meta Superintelligence Labs; National University of Singapore; Rice University
- Date: September 3, 2025
- arXiv: 2509.01092v1
- Correspondence: Aritra Ghosh (arighosh@meta.com)
- Code: Will be available at https://github.com/facebookresearch/refrag
Executive Summary
- What it is: REFRAG is an efficient decoding framework for Retrieval-Augmented Generation (RAG) that replaces most token-level context with precomputed, compressed chunk embeddings, selectively expanding only important chunks during decoding.
- Why it matters: It dramatically reduces time-to-first-token (TTFT), memory, and cost for long-context LLM inference common in RAG, multi-turn, and long-document tasks, while preserving accuracy.
- Headline results:
- Up to ~30.85× TTFT acceleration (3.75× over prior SOTA, CEPE), without loss in perplexity.
- Extends effective context window by up to 16× via compression, enabling better accuracy at similar or lower latency.
- With k=16 compression at 16K context: 16.53× TTFT speedup with cached embeddings (8.59× without cache), up to 6.78× throughput acceleration.
- Average 9.3% perplexity improvement over CEPE at k=16 across four datasets; parity with CEPE at k=32 with higher speed.
Main Themes
- Specialized optimization for RAG: RAG contexts are dominated by retrieved passages, many of which are irrelevant or mutually independent, yielding block-diagonal attention structures where most cross-passage attention is negligible.
- Compute and memory are wasted on irrelevant context: REFRAG exploits this sparsity by compressing context chunks into precomputed embeddings and only expanding when necessary.
- No architecture surgery to the decoder: Works with existing decoder-only LLMs (e.g., LLaMA) and lightweight encoders (e.g., RoBERTa), using a projection layer to align spaces.
How REFRAG Works
- Compress, sense, expand pipeline:
1) Compress: Split context into k-sized chunks; encode each chunk with a lightweight encoder (e.g., RoBERTa) to produce chunk embeddings; project them to decoder token-embedding space.
2) Sense (selective compression): A reinforcement learning (RL) policy decides which chunks to expand back to full tokens and which to keep as compressed embeddings.
3) Expand: Expanded chunks are placed at arbitrary positions, preserving autoregressive causality (“compress anywhere”).
- Inputs to the decoder: Question tokens + a sequence that mixes token embeddings (expanded chunks) and projected chunk embeddings (compressed chunks).
- Key properties:
- Precomputation and caching: Chunk embeddings are precomputable and reusable across inferences.
- Complexity scaling: Attention costs scale with number of chunks instead of number of tokens.
- Autoregressive preservation: Unlike CEPE (prefix-only), REFRAG supports compression anywhere in the prompt, enabling multi-turn and agentic workflows.
Training and Optimization
- Continual pre-training (CPT) to align encoder and decoder:
- Reconstruction phase (decoder frozen): Learn encoder+projection to reconstruct original tokens from chunk embeddings, encouraging reliance on contextual rather than parametric memory.
- Next paragraph prediction: Unfreeze decoder to predict next tokens with compressed context.
- Curriculum learning: Gradually increase chunk count/length; ablations show curriculum is essential.
- RL-based selective compression:
- Policy takes chunk embeddings; reward is negative perplexity on next paragraph prediction.
- Sequential selection with masking; GRPO/PPO-style training with normalized grouped rewards.
- Outperforms random and perplexity-heuristic selection; enables dynamic, on-the-fly compression rate adjustment without recomputing embeddings.
- Fine-tuning:
- Supervised instruction tuning for downstream tasks (RAG QA, multi-turn conversations, summarization).
- Trained on SlimPajama (Books + ArXiv; 20B tokens) for CPT; LLaMA-2-7B primary decoder; RoBERTa encoders.
Performance Highlights
- Latency and throughput
- TTFT: Up to ~30.85× acceleration (3.75× over CEPE). At 16K context and k=16: 16.53× with cached embeddings, 8.59× without cache.
- TTIT: Up to ~3× acceleration in long-context scenarios.
- Throughput: Up to 6.78× over baseline LLaMA; CEPE can degrade TTIT due to extra projections.
- Memory: KV cache reduced roughly by factor k on context-dominant inputs; attention and KV costs scale with chunk count.
- Perplexity/accuracy
- No loss in perplexity vs full context in many settings; matches CEPE at higher compression (k=32).
- k=16: 9.3% average perplexity gain over CEPE across four datasets.
- Context extension
- Effective window extended up to 16× via compression, improving downstream accuracy without higher latency.
RAG Results (Business-Relevant)
- Setup: 400M-passage corpus (Wikipedia + CommonCrawl), DRAGON+ retriever; evaluation across 16 RAG tasks (e.g., NQ, FEVER, MMLU, BoolQ, SIQA, PIQA, HellaSwag, Winogrande).
- Equal passages (strong retriever):
- With up to 10 passages, REFRAG matches LLaMA accuracy while delivering ~5.26× TTFT speedup.
- Equal latency:
- 8 passages in REFRAG vs 1 passage in LLaMA yields +1.22% average improvement across 16 tasks.
- Weak retriever (realistic noise):
- With 10 passages: +0.71% accuracy and ~5.26× TTFT speedup vs LLaMA.
- Equal latency (8 vs 1 passage): +1.93% average gain across 16 tasks.
- Multi-choice tasks: REFRAG shows especially strong gains under both short- and long-context regimes.
Multi-Turn Conversation with RAG
- Datasets: TopiOCQA, ORConvQA, QReCC.
- Outcome: REFRAG consistently outperforms LLaMA fine-tunes as turns/passages increase because compression enables retaining more conversational history and evidence within fixed latency budgets.
Long-Document Summarization
- Datasets: ArXiv and PubMed long-form summarization (article-to-abstract).
- Same decoder-token budget: REFRAG achieves the best ROUGE scores; higher compression (e.g., k=16) can outperform lower compression at small decoder budgets by packing more content.
Comparisons to Alternatives
- CEPE (parallel context encoder):
- Pros for REFRAG: Larger speedups (TTFT, throughput), better perplexity at common settings, supports compression anywhere (CEPE is prefix-only and less suitable for multi-turn/agentic uses).
- REPLUG:
- REFRAG matches or exceeds performance at equal latency with more passages included.
- LLaMA-32K:
- REFRAG avoids OOM failures at high passage counts and delivers higher accuracy under equal latency/token budgets.
Engineering and Integration Notes
- Models: Decoder-only LLM (e.g., LLaMA-2-7B, validated also with LLaMA-3.1-8B and LLaMA-3.2-3B); lightweight encoder (RoBERTa-Base/Large).
- Caching: Store and reuse chunk embeddings for frequently seen or persistent corpus passages.
- Policy: RL-based selective expansion; effective compression rate k/(1 − p + k p) depending on expansion fraction p.
- Hardware: Training used FSDP on H100 clusters; inference latency measured on A100; bfloat16 throughout.
- Retriever: DRAGON+; standard 400M-passage store with passages <200 words.
Business Applications and Impact
- High-throughput search and question answering: Substantially lower latency per query with equal or better answer quality; reinvest saved latency to consider more retrieved passages.
- Customer support and agent workflows: Keep longer dialog and richer evidence within the same latency/compute budget; improves response quality and continuity.
- Knowledge management and long-doc analytics: Summarize or reason over long documents within standard context limits by compressing context.
- Cost efficiency at scale: Reduced TTFT and KV cache memory enable higher QPS, fewer GPUs, or both; caching of chunk embeddings amortizes costs across repeated content.
- Platform differentiation: “Compress anywhere” lets you add or move evidence across the prompt (not just prefixes), enabling flexible agentic orchestration.
Practical Adoption Guidance
- Start where retrieval dominates: Deploy REFRAG on RAG pipelines where most context tokens are from retrieved passages with low cross-passage interaction.
- Precompute aggressively: Build an embedding cache tied to your passage store; update as the corpus refreshes.
- Tune compression rates: Begin with k=8–16; use the RL policy to expand critical chunks dynamically. Monitor accuracy–latency trade-offs.
- Maintain causality for agents: Use compress-anywhere to keep tool outputs and memory inserted at arbitrary turns without breaking autoregression.
- Evaluate under your retriever: Benefits are larger with weaker or noisy retrievers; reinvest latency savings in more passages.
Limitations and Open Questions
- Dependence on RAG structure: Gains rely on block-diagonal attention typical of concatenated, diverse passages; benefits for dense cross-attending contexts may be smaller.
- Training recipe complexity: Reconstruction + curriculum + CPT + RL adds training complexity; pretraining resources required.
- Embedding freshness and storage: Precompute/store lifecycle and cache invalidation for dynamic corpora must be engineered.
- Extreme compression: Very high compression (e.g., k=64) degrades performance; practical ceiling at k≈32 per reported results.
Notable Quotes from the Text
- “In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query.”
- “We demonstrate a 30.85× the time-to-first-token acceleration (3.75× improvement to previous work) without loss in perplexity.”
- “REFRAG supports compression of token chunks at arbitrary positions … while preserving the autoregressive nature of the decoder.”
- “This ‘compress anywhere’ capability is further enhanced by a lightweight reinforcement learning (RL) policy that selectively determines when full chunk token input is necessary.”
- “With k = 32, TTFT acceleration reaches 32.99× compared to LLaMA … while maintaining similar performance to CEPE.”
- “With a compression rate of 16, we achieve a 9.3% average perplexity improvement over CEPE across four datasets.”
Key Numbers at a Glance
- TTFT speedup: Up to ~30.85× (≈3.75× over CEPE)
- Throughput speedup: Up to 6.78× over LLaMA
- TTIT: Up to ~3× faster in long context scenarios
- Context extension: Up to 16× effective window
- Perplexity: +9.3% vs CEPE at k=16; parity at k=32
- Equal latency RAG (strong retriever): +1.22% avg across 16 tasks
- Equal latency RAG (weak retriever): +1.93% avg across 16 tasks
Data and Benchmarks
- Pretraining data: SlimPajama subset (Books, ArXiv), 20B tokens
- RAG evaluation: NQ, FEVER, TQA, WebQA, FreebaseQA, MS MARCO, MMLU, BoolQ, SIQA, PIQA, HellaSwag, Winogrande, ECQA, StrategyQA, etc.
- Multi-turn: TopiOCQA, ORConvQA, QReCC
- Summarization: ArXiv and PubMed long-document summarization
- Baselines: LLaMA (No/Full context), LLaMA-32K, CEPE/CEPED, REPLUG
Why It Matters
- RAG is the dominant pattern for knowledge-intensive LLMs; REFRAG turns retrieval-time structure into decoding-time efficiency.
- Enables latency-sensitive, scalable deployments (search, support, analytics) without sacrificing quality and while supporting multi-turn agentic use cases.
Links
- Paper: arXiv: 2509.01092v1
- Code (when released): https://github.com/facebookresearch/refrag
Created with AI



