Developing a RAG-PDF Reader Using Instructor XL and Falcon 7B


Notice

This is an unedited manuscript accepted for publication and provided as an Article in Press for early access at the author’s request. The article will undergo copyediting, typesetting, and galley proof review before final publication. Please be aware that errors may be identified during production that could affect the content. All legal disclaimers of the journal apply.

Year : 2025 | Volume : 16 | Issue : 01 | Page : –
    By

    Daisy Vyas,

  • Prachi Sharma,

  • Saroj Prajapat,

  • Saumya Mittal,

  1. Student, Department of Computer Science & Engineering, Mody University of Science & Technology, Laxmangarh, Rajasthan, India
  2. Student, Department of Computer Science & Engineering, Mody University of Science & Technology, Laxmangarh, Rajasthan, India
  3. Student, Department of Computer Science & Engineering, Mody University of Science & Technology, Laxmangarh, Rajasthan, India
  4. Student, Department of Computer Science & Engineering, Mody University of Science & Technology, Laxmangarh, Rajasthan, India

Abstract

This paper outlines the development of a Retrieval-Augmented Generation (RAG) application, designed to efficiently extract, retrieve, and synthesize insightful responses from complex PDF documents. Leveraging advanced models like Instructor XL for generating high-quality semantic embeddings and Falcon 7B for sophisticated language generation, this system provides a robust solution for document comprehension in academic, research, and professional environments. By implementing efficient PDF text processing, embedding storage with FAISS for rapid similarity-based retrieval, and real-time response generation, the application transforms unstructured data into accessible, contextually accurate information. A user-friendly interface, built with Streamlit, enables seamless document interaction, allowing users to upload PDFs, submit queries, and receive accurate, coherent responses in real-time. This paper also discusses potential enhancements for scalability across multiple domains, including multilingual support, making the system a versatile tool for research, education, and industry applications. Key features include enhanced document interaction, efficient information retrieval, and generation, providing significant value in data-rich environments where precise information access is critical.

Keywords: Application, Development, RAG, NLP, LLMs, Instructor XL and Falcon 7B.

[This article belongs to Journal of Computer Technology & Applications (jocta)]

How to cite this article:
Daisy Vyas, Prachi Sharma, Saroj Prajapat, Saumya Mittal. Developing a RAG-PDF Reader Using Instructor XL and Falcon 7B. Journal of Computer Technology & Applications. 2024; 16(01):-.
How to cite this URL:
Daisy Vyas, Prachi Sharma, Saroj Prajapat, Saumya Mittal. Developing a RAG-PDF Reader Using Instructor XL and Falcon 7B. Journal of Computer Technology & Applications. 2024; 16(01):-. Available from: https://journals.stmjournals.com/jocta/article=2024/view=191735


References

  1. Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst. 2020;33:9459-74.
  2. Izacard G, Grave E. Leveraging passage retrieval with generative models for open domain question answering. arXiv Preprint arXiv:2007.01282. 2020 Jul 2.
  3. Almazrouei E, Alobeidli H, Alshamsi A, Cappelli A, Cojocaru R, Debbah M, et al. The Falcon series of open language models. arXiv Preprint arXiv:2311.16867. 2023 Nov 28.
  4. Chen M, Tworek J, Jun H, Yuan Q, Pinto HP, Kaplan J, et al. Evaluating large language models trained on code. arXiv Preprint arXiv:2107.03374. 2021 Jul 7.
  5. Guu K, Lee K, Tung Z, Pasupat P, Chang M. Retrieval augmented language model pre-training. In: Proceedings of the International Conference on Machine Learning; 2020 Nov 21; PMLR. p. 3929-38.
  6. Karpukhin V, Oğuz B, Min S, Lewis P, Wu L, Edunov S, et al. Dense passage retrieval for open-domain question answering. arXiv Preprint arXiv:2004.04906. 2020 Apr 10.
  7. Reimers N. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv Preprint arXiv:1908.10084. 2019.
  8. Li Y, Wu J, Luo X. BERT-CNN based evidence retrieval and aggregation for Chinese legal multi-choice question answering. Neural Comput Appl. 2024 Apr;36(11):5909-25.
  9. Liu Y, Lu W, Cheng S, Shi D, Wang S, Cheng Z, et al. Pre-trained language model for web-scale retrieval in Baidu search. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining; 2021 Aug 14. p. 3365-75.
  10. Xiao S, Liu Z, Han W, Zhang J, Shao Y, Lian D, et al. Progressively optimized bi-granular document representation for scalable embedding-based retrieval. In: Proceedings of the ACM Web Conference 2022; 2022 Apr 25. p. 286-96.

Regular Issue Subscription Original Research
Volume 16
Issue 01
Received 13/11/2024
Accepted 10/12/2024
Published 31/12/2024


Loading citations…