Daisy Vyas,
Prachi Sharma,
Saroj Prajapat,
Saumya Mittal,
- Student, Department of Computer Science and Engineering, Mody University of Science & Technology, Laxmangarh, Rajasthan, India
- Student, Department of Computer Science and Engineering, Mody University of Science & Technology, Laxmangarh, Rajasthan, India
- Student, Department of Computer Science and Engineering, Mody University of Science & Technology, Laxmangarh, Rajasthan, India
- Student, Department of Computer Science and Engineering, Mody University of Science & Technology, Laxmangarh, Rajasthan, India
Abstract
This study outlines the development of a Retrieval-Augmented Generation (RAG) application, designed to efficiently extract, retrieve, and synthesize insightful responses from complex PDF documents. Leveraging advanced models like Instructor XL for generating high-quality semantic embeddings and Falcon 7B for sophisticated language generation, this system provides a robust solution for document comprehension in academic, research, and professional environments. By implementing efficient PDF text processing, embedding storage with FAISS for rapid similarity-based retrieval, and real-time response generation, the application transforms unstructured data into accessible, contextually accurate information. A user-friendly interface, built with Streamlit, enables seamless document interaction, allowing users to upload PDFs, submit queries, and receive accurate, coherent responses in real-time. This study also discusses potential enhancements for scalability across multiple domains, including multilingual support, making the system a versatile tool for research, education, and industry applications. Key features include enhanced document interaction, efficient information retrieval and generation, providing significant value in data-rich environments where precise information access is critical.
Keywords: Application, development, RAG, NLP, LLMs, Instructor XL, Falcon 7B
[This article belongs to Journal of Computer Technology & Applications ]
Daisy Vyas, Prachi Sharma, Saroj Prajapat, Saumya Mittal. Developing a RAG-PDF Reader Using Instructor XL and Falcon 7B. Journal of Computer Technology & Applications. 2024; 16(01):45-49.
Daisy Vyas, Prachi Sharma, Saroj Prajapat, Saumya Mittal. Developing a RAG-PDF Reader Using Instructor XL and Falcon 7B. Journal of Computer Technology & Applications. 2024; 16(01):45-49. Available from: https://journals.stmjournals.com/jocta/article=2024/view=191735
References
- Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst. 2020; 33: 9459–74.
- Izacard G, Grave E. Leveraging passage retrieval with generative models for open domain question answering. arXiv Preprint arXiv:2007.01282. 2020 Jul 2.
- Almazrouei E, Alobeidli H, Alshamsi A, Cappelli A, Cojocaru R, Debbah M, et al. The Falcon series of open language models. arXiv Preprint arXiv:2311.16867. 2023 Nov 28.
- Chen M, Tworek J, Jun H, Yuan Q, Pinto HP, Kaplan J, et al. Evaluating large language models trained on code. arXiv Preprint arXiv:2107.03374. 2021 Jul 7.
- Guu K, Lee K, Tung Z, Pasupat P, Chang M. Retrieval augmented language model pre-training. In: Proceedings of the International Conference on Machine Learning; PMLR. 2020 Nov 21; 3929–38.
- Karpukhin V, Oğuz B, Min S, Lewis P, Wu L, Edunov S, et al. Dense passage retrieval for open-domain question answering. arXiv Preprint arXiv:2004.04906. 2020 Apr 10.
- Reimers N. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv Preprint arXiv:1908.10084. 2019.
- Li Y, Wu J, Luo X. BERT-CNN based evidence retrieval and aggregation for Chinese legal multi-choice question answering. Neural Comput Appl. 2024 Apr; 36(11): 5909–25.
- Liu Y, Lu W, Cheng S, Shi D, Wang S, Cheng Z, et al. Pre-trained language model for web-scale retrieval in Baidu search. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021 Aug 14; 3365–75.
- Xiao S, Liu Z, Han W, Zhang J, Shao Y, Lian D, et al. Progressively optimized bi-granular document representation for scalable embedding-based retrieval. In: Proceedings of the ACM Web Conference. 2022 Apr 25; 286–96.

Journal of Computer Technology & Applications
| Volume | 16 |
| Issue | 01 |
| Received | 13/11/2024 |
| Accepted | 10/12/2024 |
| Published | 31/12/2024 |
Login
PlumX Metrics