Srikanth Bhakthan
AI on AI Podcast
REFRAG — Rethinking RAG-based Decoding
0:00
-25:50

REFRAG — Rethinking RAG-based Decoding

#RAG #meta #superintelligence

Metadata

- Title: REFRAG: Rethinking RAG based Decoding

- Authors: Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan

- Affiliations: Meta Superintelligence Labs; National University of Singapore; Rice University

- Date: September 3, 2025

- arXiv: 2509.01092v1

- Correspondence: Aritra Ghosh (arighosh@meta.com)

- Code: Will be available at https://github.com/facebookresearch/refrag


Executive Summary

- What it is: REFRAG is an efficient decoding framework for Retrieval-Augmented Generation (RAG) that replaces most token-level context with precomputed, compressed chunk embeddings, selectively expanding only important chunks during decoding.

- Why it matters: It dramatically reduces time-to-first-token (TTFT), memory, and cost for long-context LLM inference common in RAG, multi-turn, and long-document tasks, while preserving accuracy.

- Headline results:

- Up to ~30.85× TTFT acceleration (3.75× over prior SOTA, CEPE), without loss in perplexity.

- Extends effective context window by up to 16× via compression, enabling better accuracy at similar or lower latency.

- With k=16 compression at 16K context: 16.53× TTFT speedup with cached embeddings (8.59× without cache), up to 6.78× throughput acceleration.

- Average 9.3% perplexity improvement over CEPE at k=16 across four datasets; parity with CEPE at k=32 with higher speed.

Main Themes

- Specialized optimization for RAG: RAG contexts are dominated by retrieved passages, many of which are irrelevant or mutually independent, yielding block-diagonal attention structures where most cross-passage attention is negligible.

- Compute and memory are wasted on irrelevant context: REFRAG exploits this sparsity by compressing context chunks into precomputed embeddings and only expanding when necessary.

- No architecture surgery to the decoder: Works with existing decoder-only LLMs (e.g., LLaMA) and lightweight encoders (e.g., RoBERTa), using a projection layer to align spaces.

How REFRAG Works

- Compress, sense, expand pipeline:

1) Compress: Split context into k-sized chunks; encode each chunk with a lightweight encoder (e.g., RoBERTa) to produce chunk embeddings; project them to decoder token-embedding space.

2) Sense (selective compression): A reinforcement learning (RL) policy decides which chunks to expand back to full tokens and which to keep as compressed embeddings.

3) Expand: Expanded chunks are placed at arbitrary positions, preserving autoregressive causality (“compress anywhere”).

- Inputs to the decoder: Question tokens + a sequence that mixes token embeddings (expanded chunks) and projected chunk embeddings (compressed chunks).

- Key properties:

- Precomputation and caching: Chunk embeddings are precomputable and reusable across inferences.

- Complexity scaling: Attention costs scale with number of chunks instead of number of tokens.

- Autoregressive preservation: Unlike CEPE (prefix-only), REFRAG supports compression anywhere in the prompt, enabling multi-turn and agentic workflows.

Training and Optimization

- Continual pre-training (CPT) to align encoder and decoder:

- Reconstruction phase (decoder frozen): Learn encoder+projection to reconstruct original tokens from chunk embeddings, encouraging reliance on contextual rather than parametric memory.

- Next paragraph prediction: Unfreeze decoder to predict next tokens with compressed context.

- Curriculum learning: Gradually increase chunk count/length; ablations show curriculum is essential.

- RL-based selective compression:

- Policy takes chunk embeddings; reward is negative perplexity on next paragraph prediction.

- Sequential selection with masking; GRPO/PPO-style training with normalized grouped rewards.

- Outperforms random and perplexity-heuristic selection; enables dynamic, on-the-fly compression rate adjustment without recomputing embeddings.

- Fine-tuning:

- Supervised instruction tuning for downstream tasks (RAG QA, multi-turn conversations, summarization).

- Trained on SlimPajama (Books + ArXiv; 20B tokens) for CPT; LLaMA-2-7B primary decoder; RoBERTa encoders.

Performance Highlights

- Latency and throughput

- TTFT: Up to ~30.85× acceleration (3.75× over CEPE). At 16K context and k=16: 16.53× with cached embeddings, 8.59× without cache.

- TTIT: Up to ~3× acceleration in long-context scenarios.

- Throughput: Up to 6.78× over baseline LLaMA; CEPE can degrade TTIT due to extra projections.

- Memory: KV cache reduced roughly by factor k on context-dominant inputs; attention and KV costs scale with chunk count.

- Perplexity/accuracy

- No loss in perplexity vs full context in many settings; matches CEPE at higher compression (k=32).

- k=16: 9.3% average perplexity gain over CEPE across four datasets.

- Context extension

- Effective window extended up to 16× via compression, improving downstream accuracy without higher latency.

RAG Results (Business-Relevant)

- Setup: 400M-passage corpus (Wikipedia + CommonCrawl), DRAGON+ retriever; evaluation across 16 RAG tasks (e.g., NQ, FEVER, MMLU, BoolQ, SIQA, PIQA, HellaSwag, Winogrande).

- Equal passages (strong retriever):

- With up to 10 passages, REFRAG matches LLaMA accuracy while delivering ~5.26× TTFT speedup.

- Equal latency:

- 8 passages in REFRAG vs 1 passage in LLaMA yields +1.22% average improvement across 16 tasks.

- Weak retriever (realistic noise):

- With 10 passages: +0.71% accuracy and ~5.26× TTFT speedup vs LLaMA.

- Equal latency (8 vs 1 passage): +1.93% average gain across 16 tasks.

- Multi-choice tasks: REFRAG shows especially strong gains under both short- and long-context regimes.

Multi-Turn Conversation with RAG

- Datasets: TopiOCQA, ORConvQA, QReCC.

- Outcome: REFRAG consistently outperforms LLaMA fine-tunes as turns/passages increase because compression enables retaining more conversational history and evidence within fixed latency budgets.

Long-Document Summarization

- Datasets: ArXiv and PubMed long-form summarization (article-to-abstract).

- Same decoder-token budget: REFRAG achieves the best ROUGE scores; higher compression (e.g., k=16) can outperform lower compression at small decoder budgets by packing more content.

Comparisons to Alternatives

- CEPE (parallel context encoder):

- Pros for REFRAG: Larger speedups (TTFT, throughput), better perplexity at common settings, supports compression anywhere (CEPE is prefix-only and less suitable for multi-turn/agentic uses).

- REPLUG:

- REFRAG matches or exceeds performance at equal latency with more passages included.

- LLaMA-32K:

- REFRAG avoids OOM failures at high passage counts and delivers higher accuracy under equal latency/token budgets.

Engineering and Integration Notes

- Models: Decoder-only LLM (e.g., LLaMA-2-7B, validated also with LLaMA-3.1-8B and LLaMA-3.2-3B); lightweight encoder (RoBERTa-Base/Large).

- Caching: Store and reuse chunk embeddings for frequently seen or persistent corpus passages.

- Policy: RL-based selective expansion; effective compression rate k/(1 − p + k p) depending on expansion fraction p.

- Hardware: Training used FSDP on H100 clusters; inference latency measured on A100; bfloat16 throughout.

- Retriever: DRAGON+; standard 400M-passage store with passages <200 words.

Business Applications and Impact

- High-throughput search and question answering: Substantially lower latency per query with equal or better answer quality; reinvest saved latency to consider more retrieved passages.

- Customer support and agent workflows: Keep longer dialog and richer evidence within the same latency/compute budget; improves response quality and continuity.

- Knowledge management and long-doc analytics: Summarize or reason over long documents within standard context limits by compressing context.

- Cost efficiency at scale: Reduced TTFT and KV cache memory enable higher QPS, fewer GPUs, or both; caching of chunk embeddings amortizes costs across repeated content.

- Platform differentiation: “Compress anywhere” lets you add or move evidence across the prompt (not just prefixes), enabling flexible agentic orchestration.

Practical Adoption Guidance

- Start where retrieval dominates: Deploy REFRAG on RAG pipelines where most context tokens are from retrieved passages with low cross-passage interaction.

- Precompute aggressively: Build an embedding cache tied to your passage store; update as the corpus refreshes.

- Tune compression rates: Begin with k=8–16; use the RL policy to expand critical chunks dynamically. Monitor accuracy–latency trade-offs.

- Maintain causality for agents: Use compress-anywhere to keep tool outputs and memory inserted at arbitrary turns without breaking autoregression.

- Evaluate under your retriever: Benefits are larger with weaker or noisy retrievers; reinvest latency savings in more passages.

Limitations and Open Questions

- Dependence on RAG structure: Gains rely on block-diagonal attention typical of concatenated, diverse passages; benefits for dense cross-attending contexts may be smaller.

- Training recipe complexity: Reconstruction + curriculum + CPT + RL adds training complexity; pretraining resources required.

- Embedding freshness and storage: Precompute/store lifecycle and cache invalidation for dynamic corpora must be engineered.

- Extreme compression: Very high compression (e.g., k=64) degrades performance; practical ceiling at k≈32 per reported results.

Notable Quotes from the Text

- “In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query.”

- “We demonstrate a 30.85× the time-to-first-token acceleration (3.75× improvement to previous work) without loss in perplexity.”

- “REFRAG supports compression of token chunks at arbitrary positions … while preserving the autoregressive nature of the decoder.”

- “This ‘compress anywhere’ capability is further enhanced by a lightweight reinforcement learning (RL) policy that selectively determines when full chunk token input is necessary.”

- “With k = 32, TTFT acceleration reaches 32.99× compared to LLaMA … while maintaining similar performance to CEPE.”

- “With a compression rate of 16, we achieve a 9.3% average perplexity improvement over CEPE across four datasets.”

Key Numbers at a Glance

- TTFT speedup: Up to ~30.85× (≈3.75× over CEPE)

- Throughput speedup: Up to 6.78× over LLaMA

- TTIT: Up to ~3× faster in long context scenarios

- Context extension: Up to 16× effective window

- Perplexity: +9.3% vs CEPE at k=16; parity at k=32

- Equal latency RAG (strong retriever): +1.22% avg across 16 tasks

- Equal latency RAG (weak retriever): +1.93% avg across 16 tasks

Data and Benchmarks

- Pretraining data: SlimPajama subset (Books, ArXiv), 20B tokens

- RAG evaluation: NQ, FEVER, TQA, WebQA, FreebaseQA, MS MARCO, MMLU, BoolQ, SIQA, PIQA, HellaSwag, Winogrande, ECQA, StrategyQA, etc.

- Multi-turn: TopiOCQA, ORConvQA, QReCC

- Summarization: ArXiv and PubMed long-document summarization

- Baselines: LLaMA (No/Full context), LLaMA-32K, CEPE/CEPED, REPLUG

Why It Matters

- RAG is the dominant pattern for knowledge-intensive LLMs; REFRAG turns retrieval-time structure into decoding-time efficiency.

- Enables latency-sensitive, scalable deployments (search, support, analytics) without sacrificing quality and while supporting multi-turn agentic use cases.

Links

- Paper: arXiv: 2509.01092v1

- Code (when released): https://github.com/facebookresearch/refrag


Created with AI

Ready for more?