Exploring Optimal Prefetching Sizes in RaLMSpec to Enhance Retrieval-Augmented Generation Efficiency

Authors

  • Shuxin Liu

DOI:

https://doi.org/10.54097/hhff9g78

Keywords:

Retrieval-Augmented Generation, RaLMSpec, Prefetching Optimization, Cache Utilization, Model Efficiency.

Abstract

Retrieval-augmented generation (RAG) frameworks like RaLMSpec enhance language model performance by integrating external knowledge. A key method to accelerate RaLMSpec efficiency is the prefetching, which determines the number of documents to retrieve in advance to balance retrieval speed and cache utilization. This study introduces and tests both static and dynamic prefetching strategies to optimize performance in RaLMSpec. Static prefetching uses fixed sizes, while dynamic prefetching adjusts based on real-time factors including task complexity, cache hit rates, and retrieval latency. Experiments across multiple datasets, retrievers, and language models demonstrate that dynamic prefetching significantly reduces latency by 18% on average, outperforming static strategies. Dynamic prefetching adapts to varying task demands, providing better balance between retrieval and caching efficiency. Among static strategies, a prefetch size of 64 offers the best trade-off between latency reduction and cache utilization. The results highlight that dynamic prefetching is optimal for environments with fluctuating task complexity, while static prefetching with a size of 64 is effective for predictable tasks. This study provides valuable insights for improving RAG system efficiency and suggests future directions, including machine learning-based adaptations and hardware optimizations.

Downloads

Download data is not yet available.

References

[1] Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, ... LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288, 2023.

[2] Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung H W, Sutton C, Gehrmann S, ... PaLM: Scaling language modeling with pathways. arXiv preprint arXiv: 2204.02311, 2022.

[3] Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, Lewis M, Yih W-T, Rocktäschel T, Riedel S, Kiela D. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv preprint arXiv: 2005.11401v4, 2021.

[4] Khattab O, Santhanam K, Li X L, Hall D, Liang P, Potts C, Zaharia M. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP. arXiv preprint arXiv: 2212.14024, 2022.

[5] Ram O, Levine Y, Dalmedigos I, Muhlgay D, Shashua A, Leyton-Brown K, Shoham Y. In-context retrieval-augmented language models. arXiv preprint arXiv: 2302.00083, 2023.

[6] Zhang Z, Zhu A, Yang L, Xu Y, Li L, Phothilimthana P M, Jia Z. Accelerating retrieval-augmented language model serving with speculation. arXiv preprint arXiv: 2401.14021, 2023.

[7] Izacard G, Grave E. Leveraging passage retrieval with generative models for open-domain question answering. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021: 874 - 880.

[8] Karpukhin V, Oğuz B, Min S, Lewis P, Wu L, Edunov S, Chen D, Yih W-T. Dense passage retrieval for open-domain question answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. arXiv preprint arXiv:2004.04906.

[9] Khattab O, Zaharia M. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020: 39 - 48. ACM.

[10] Borgeaud S, Mensch A, Hoffmann J, Cai T, Rutherford E, Millican K, Van Den Driessche G B, Lespiau J-B, Damoc B, Clark A, et al. Improving language models by retrieving from trillions of tokens. Proceedings of the 39th International Conference on Machine Learning, 2022: 2206 - 2240. PMLR.

[11] Zhao P, Zhang H, Yu Q, Wang Z, Geng Y, Fu F, Yang L, Zhang W, Jiang J, Cui B. Retrieval-augmented generation for AI-generated content: A survey. arXiv preprint arXiv: 2402.19473v6, 2024.

[12] Guu K, Lee K, Tung Z, Pasupat P, Chang M. Retrieval-augmented language model pre-training. Proceedings of the 37th International Conference on Machine Learning, 2020: 3929 - 3938. PMLR.

[13] Asai A, Min S, Zhong Z, Chen D. ACL 2023 tutorial: Retrieval-based language models and applications. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Tutorials), 2023. Association for Computation Linguistics.

Downloads

Published

11-05-2025

How to Cite

Liu, S. (2025). Exploring Optimal Prefetching Sizes in RaLMSpec to Enhance Retrieval-Augmented Generation Efficiency. Highlights in Science, Engineering and Technology, 138, 24-31. https://doi.org/10.54097/hhff9g78