WEBPAGE EXTRACTION AND RETRIEVAL CHATBOT

Year : 2025 | Volume : 03 | Issue : 02 | Page : 1 7
    By

    P. V.Kavitha,

  • S Vishal Kumar,

  • S Anush Kumar,

  1. Assistant Professor, Department of Artificial Intelligence and Data Science Sri Ramakrishna Engineering College Coimbatore, Tamil Nadu, India
  2. Student, Department of Artificial Intelligence and Data Science Sri Ramakrishna Engineering College Coimbatore, Tamil Nadu, India
  3. Student, Department of Artificial Intelligence and Data Science Sri Ramakrishna Engineering College Coimbatore, Tamil Nadu, India

Abstract

Web scraping is a fundamental technique for automating data extraction in big data applications. While multiple implementations exist, few leverage Python’s Beautiful Soup library for efficient and structured data retrieval. This project aims to develop a web scraper and retrieval system that extracts relevant information from web pages, stores it in a vector database (Milvus), and enables intelligent querying using semantic search and generative AI. The system is designed to scrape Wikipedia articles, extract textual content, and transform it into high-dimensional embeddings using a pre-trained Sentence-BERT model. These embeddings are indexed and stored in Milvus, an optimized similarity search engine. To retrieve information, user queries are embedded and matched against stored vectors, retrieving the most relevant content. The retrieved text is further enhanced with Google’s Gemini API, generating refined and context-aware responses. A Fast API-based interface facilitates data loading, querying, and response generation, ensuring seamless interaction. The system was tested by scraping Wikipedia pages, storing data, and retrieving meaningful insights. The entire process, from extraction to response generation, demonstrated efficiency, completing within seconds. The primary limitation involves reliance on Wikipedia’s structure, limiting adaptability to highly dynamic web pages. Additionally, Beautiful Soup, optimized for speed, does not extract all embedded data such as dynamically loaded JavaScript content. Future enhancements could involve expanding to broader websites, incorporating real-time data updates, and integrating advanced NLP models for improved comprehension. This implementation is beneficial for developers exploring web scraping, AI-driven search applications, and small-scale data analytics projects.

Keywords: Web Scraping, Data Extraction, Beautiful Soup, Semantic Search, Natural Language Processing (NLP), Sentence-BERT, Vector Database, Milvus, Fast API, Generative AI, Google Gemini API, Information Retrieval, Knowledge Discovery, Big Data Analytics, Data Visualization, Content Embedding, AI-driven Search.

[This article belongs to International Journal of Satellite Remote Sensing ]

How to cite this article:
P. V.Kavitha, S Vishal Kumar, S Anush Kumar. WEBPAGE EXTRACTION AND RETRIEVAL CHATBOT. International Journal of Satellite Remote Sensing. 2025; 03(02):1-7.
How to cite this URL:
P. V.Kavitha, S Vishal Kumar, S Anush Kumar. WEBPAGE EXTRACTION AND RETRIEVAL CHATBOT. International Journal of Satellite Remote Sensing. 2025; 03(02):1-7. Available from: https://journals.stmjournals.com/ijsrs/article=2025/view=235430


References

  1. Taipalus T. Vector Database Management Systems: Fundamental Concepts, Use-Cases, and Current Challenges. Faculty of Information Technology, University of Jyväskylä; 2024.
  2. Yin C, Zhang Z. A Study of Sentence Similarity Based on the All-minilm L6-v2 Model With “Same Semantics, Different Structure”. Dept. of Statistics and Data Science; 2024.
  3. Karad V. Comprehensive Study of Google Gemini and Text Generating Models: Understanding Capabilities and Performance. MIT World Peace University, Pune; 2024.
  4. Pant S, Yadav N, Milan, Sharma M, Bedi Y, Raturi A. Web Scraping Using Beautiful Soup. In: Int Conf Knowledge Eng Commun Syst (ICKECS); 2024.
  5. Huan X, Zhou H. Integrating Advanced Language Models and Vector Database for Enhanced AI Query Retrieval in Web Development. Int J Adv Comput Sci Appl. 2024.
  6. Chitturi K. Understanding Vector Embeddings in AI Search: A Technical Deep Dive. Int J Comput Eng Technol. 2024.
  7. Chen J. Model Algorithm Research based on Python Fast API. Front Sci Eng. 2023.
  8. Latif R, Abodayeh A, Hejazi R, Najjar W, Shihadeh L. Web Scraping for Data Analytics: A BeautifulSoup Implementation. In: Int Conf Women Data Sci, Prince Sultan Univ (WiDS PSU); 2023.
  9. Sirisuriya SCM de S. Importance of Web Scraping as a Data Source for Machine Learning Algorithms – Review. Dept. of Computer Science, Gen Sir John Kotelawala Defence Univ, Sri Lanka; 2023.
  10. Matise J. Simmering Data: Using Beautiful Soup and Python to Scrape Data from Web Pages. NORC, Univ of Chicago; 2022.
  11. Bansal P, Ouda A. Study on Integration of FastAPI and Machine Learning for Continuous Authentication of Behavioral Biometrics. 2022.
  12. Wang J, Yi X, Guo R, Jin H, Xu P, Li S, Wang X. Milvus: A Purpose-Built Vector Data Management System. 2021.
  13. Vairavasundaram, Mahenthar CSJ, Varadarajan V, Kotecha K. A Deep Learning Model Based on BERT and Sentence Transformer for Semantic Keyphrase Extraction on Big Social Data. 2021.
  14. Sugandhika C, Ahangama S, Ahangama S. Modelling Wikipedia’s Information Quality using Informativeness, Reliability and Authority. In: Int Conf Adv Comput (ICAC); 2021.
  15. Galli C, Donos N, Calciolar E. Performance of 4 Pre-Trained Sentence Transformer Models in the Semantic Query of a Systematic Review Dataset on Peri-Implantitis. 2020.

Regular Issue Subscription Original Research
Volume 03
Issue 02
Received 10/07/2025
Accepted 13/08/2025
Published 31/12/2025
Publication Time 174 Days


Login


My IP

PlumX Metrics