Developing a RAG-PDF Reader Using Instructor XL and Falcon 7B

Year : 2025 | Volume : 16 | Issue : 01 | Page : 45 49
    By

    Daisy Vyas,

  • Prachi Sharma,

  • Saroj Prajapat,

  • Saumya Mittal,

  1. Student, Department of Computer Science and Engineering, Mody University of Science & Technology, Laxmangarh, Rajasthan, India
  2. Student, Department of Computer Science and Engineering, Mody University of Science & Technology, Laxmangarh, Rajasthan, India
  3. Student, Department of Computer Science and Engineering, Mody University of Science & Technology, Laxmangarh, Rajasthan, India
  4. Student, Department of Computer Science and Engineering, Mody University of Science & Technology, Laxmangarh, Rajasthan, India

Abstract

This study outlines the development of a Retrieval-Augmented Generation (RAG) application, designed to efficiently extract, retrieve, and synthesize insightful responses from complex PDF documents. Leveraging advanced models like Instructor XL for generating high-quality semantic embeddings and Falcon 7B for sophisticated language generation, this system provides a robust solution for document comprehension in academic, research, and professional environments. By implementing efficient PDF text processing, embedding storage with FAISS for rapid similarity-based retrieval, and real-time response generation, the application transforms unstructured data into accessible, contextually accurate information. A user-friendly interface, built with Streamlit, enables seamless document interaction, allowing users to upload PDFs, submit queries, and receive accurate, coherent responses in real-time. This study also discusses potential enhancements for scalability across multiple domains, including multilingual support, making the system a versatile tool for research, education, and industry applications. Key features include enhanced document interaction, efficient information retrieval and generation, providing significant value in data-rich environments where precise information access is critical.

Keywords: Application, development, RAG, NLP, LLMs, Instructor XL, Falcon 7B

[This article belongs to Journal of Computer Technology & Applications ]

How to cite this article:
Daisy Vyas, Prachi Sharma, Saroj Prajapat, Saumya Mittal. Developing a RAG-PDF Reader Using Instructor XL and Falcon 7B. Journal of Computer Technology & Applications. 2024; 16(01):45-49.
How to cite this URL:
Daisy Vyas, Prachi Sharma, Saroj Prajapat, Saumya Mittal. Developing a RAG-PDF Reader Using Instructor XL and Falcon 7B. Journal of Computer Technology & Applications. 2024; 16(01):45-49. Available from: https://journals.stmjournals.com/jocta/article=2024/view=191735


References

  1. Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst. 2020; 33: 9459–74.
  2. Izacard G, Grave E. Leveraging passage retrieval with generative models for open domain question answering. arXiv Preprint arXiv:2007.01282. 2020 Jul 2.
  3. Almazrouei E, Alobeidli H, Alshamsi A, Cappelli A, Cojocaru R, Debbah M, et al. The Falcon series of open language models. arXiv Preprint arXiv:2311.16867. 2023 Nov 28.
  4. Chen M, Tworek J, Jun H, Yuan Q, Pinto HP, Kaplan J, et al. Evaluating large language models trained on code. arXiv Preprint arXiv:2107.03374. 2021 Jul 7.
  5. Guu K, Lee K, Tung Z, Pasupat P, Chang M. Retrieval augmented language model pre-training. In: Proceedings of the International Conference on Machine Learning; PMLR. 2020 Nov 21; 3929–38.
  6. Karpukhin V, Oğuz B, Min S, Lewis P, Wu L, Edunov S, et al. Dense passage retrieval for open-domain question answering. arXiv Preprint arXiv:2004.04906. 2020 Apr 10.
  7. Reimers N. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv Preprint arXiv:1908.10084. 2019.
  8. Li Y, Wu J, Luo X. BERT-CNN based evidence retrieval and aggregation for Chinese legal multi-choice question answering. Neural Comput Appl. 2024 Apr; 36(11): 5909–25.
  9. Liu Y, Lu W, Cheng S, Shi D, Wang S, Cheng Z, et al. Pre-trained language model for web-scale retrieval in Baidu search. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021 Aug 14; 3365–75.
  10. Xiao S, Liu Z, Han W, Zhang J, Shao Y, Lian D, et al. Progressively optimized bi-granular document representation for scalable embedding-based retrieval. In: Proceedings of the ACM Web Conference. 2022 Apr 25; 286–96.

Regular Issue Subscription Original Research
Volume 16
Issue 01
Received 13/11/2024
Accepted 10/12/2024
Published 31/12/2024


Login


My IP

PlumX Metrics