Detection of Phishing Website URLs and Email/SMS Using Random Forest and Multinomial Naive Bayes


Notice

This is an unedited manuscript accepted for publication and provided as an Article in Press for early access at the author’s request. The article will undergo copyediting, typesetting, and galley proof review before final publication. Please be aware that errors may be identified during production that could affect the content. All legal disclaimers of the journal apply.

Year : 2025 | Volume : 16 | Issue : 01 | Page : 22-30
    By

    Sridevi Ravada,

  • Harika Kasukurthi,

  • Deekshitha Govindu,

  • Sowmya Vara,

  • Akshaya Enumukkala,

  1. Assistant Professor, Department of Information Technology, Gayatri Vidya Parishad College of Engineering for Women, Visakhapatnam, Andhra Pradesh, India
  2. Student, Department of Information Technology, Gayatri Vidya Parishad College of Engineering for Women, Visakhapatnam, Andhra Pradesh, India
  3. Student, Department of Information Technology, Gayatri Vidya Parishad College of Engineering for Women, Visakhapatnam, Andhra Pradesh, India
  4. Student, Department of Information Technology, Gayatri Vidya Parishad College of Engineering for Women, Visakhapatnam, Andhra Pradesh, India
  5. Student, Department of Information Technology, Gayatri Vidya Parishad College of Engineering for Women, Visakhapatnam, Andhra Pradesh, India

Abstract

Currently, phishing attacks via SMS/email and URL have become significant threat to cybersecurity, posing risks to both individuals and organizations alike. Phishing attacks typically involve the creation of fraudulent websites or the dissemination of deceptive emails and SMS messages to trick users into disclosing sensitive information such as passwords, credit card numbers or personal details. To respond to these attacks, we develop a robust system for the detection of phishing URLs and SMS/emails using machine learning techniques. For phishing website URL detection, we extracted various features from HTML content using Beautiful Soup, and then applied supervised learning algorithms such as Decision Trees, Random Forest, and XGBoost, and Multinomial Naïve Bayes to classify URLs as Legitimate or Phishing. We achieved promising results with the Random Forest, which demonstrated high accuracy in distinguishing between legitimate and phishing URLs. For email/SMS, TF-IDF Vectorization, Natural language preprocessing is used and then applied supervised learning algorithms such as Multinomial Naïve Bayes, support vector classifier (SVC), Random Forest, Decision Tree, AdaBoost and XGBoost. We achieved promising results with the Multinomial Naïve Bayes, which demonstrated high accuracy in distinguishing between spam and not spam.

Keywords: Phishing attacks, multinomial Naïve Bayes, TF-IDF vectorization, natural language preprocessing, random forest, beautiful soup

[This article belongs to Journal of Computer Technology & Applications (jocta)]

How to cite this article:
Sridevi Ravada, Harika Kasukurthi, Deekshitha Govindu, Sowmya Vara, Akshaya Enumukkala. Detection of Phishing Website URLs and Email/SMS Using Random Forest and Multinomial Naive Bayes. Journal of Computer Technology & Applications. 2024; 16(01):22-30.
How to cite this URL:
Sridevi Ravada, Harika Kasukurthi, Deekshitha Govindu, Sowmya Vara, Akshaya Enumukkala. Detection of Phishing Website URLs and Email/SMS Using Random Forest and Multinomial Naive Bayes. Journal of Computer Technology & Applications. 2024; 16(01):22-30. Available from: https://journals.stmjournals.com/jocta/article=2024/view=191737


References

  1. Ahammad SH, Kale SD, Upadhye GD, Pande SD, Babu EV, Dhumane AV, Bahadur MD. Phishing URL detection using machine learning methods. Adv Eng Softw. 2022 Nov 1; 173: 103288.
  2. Salloum S, Gaber T, Vadera S, Shaalan K. Phishing email detection using natural language processing techniques: a literature survey. Procedia Comput Sci. 2021 Jan 1; 189: 19–28.
  3. Gualberto ES, De Sousa RT, Vieira TP, Da Costa JP, Duque CG. The answer is in the text: Multi-stage methods for phishing detection based on feature engineering. IEEE Access. 2020 Dec 9; 8: 223529–47.
  4. Mutalib NH, Sabri AQ, Wahab AW, Abdullah ER, AlDahoul N. Explainable deep learning approach for advanced persistent threats (APTs) detection in cybersecurity: a review. Artif Intell Rev. 2024 Nov; 57(11): 1–47.
  5. Raghunathan A, Xie SM, Yang F, Duchi J, Liang P. Understanding and mitigating the tradeoff between robustness and accuracy. [Preprint]. arXiv:2002.10716. 2020 Feb 25. DOI: https://doi.org/10.48550/arXiv.2002.10716.
  6. El Aassal A, Baki S, Das A, Verma RM. An in-depth benchmarking and evaluation of phishing detection research for security needs. IEEE Access. 2020 Jan 28; 8: 22170–92.
  7. Harun NZ, Jaffar N, Kassim PS. Physical attributes significant in preserving the social sustainability of the traditional Malay settlement. In Reframing the Vernacular: Politics, Semiotics, and Representation. Cham: Springer International Publishing; 2020; 225–238.
  8. Divakaran DM, Oest A. Phishing detection leveraging machine learning and deep learning: A review. IEEE Secur Priv. 2022 Jun 14; 20(5): 86–95.
  9. Akanchha A. Exploring a robust machine learning classifier for detecting phishing domains using SSL certificates. Thesis. Halifax, Nova Scotia: Dalhousie University; 2020. Available form https://dalspace.library.dal.ca/items/445ef57f-5c6b-4232-a05c-3f4073238a63
  10. Liu DJ, Geng GG, Jin XB, Wang W. An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment. Comput Secur. 2021 Nov 1; 110: 102421.
  11. Rao RS, Pais AR. Detection of phishing websites using an efficient feature-based machine learning framework. Neural Comput Appl. 2019 Aug; 31(1): 3851–73.
  12. Cao Y, Han W, Le Y. Anti-phishing based on automated individual white-list. In Proceedings of the 4th ACM workshop on Digital identity management. 2008 Oct 31; 51–60.
  13. Agarwal S, Kaur S, Garhwal S. SMS spam detection for Indian messages. In 2015 IEEE 1st International Conference on Next Generation Computing Technologies (NGCT). 2015 Sep 4; 634–638.
  14. Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Al Najada H. Survey of review spam detection using machine learning techniques. J Big Data. 2015 Dec; 2: 23(24p).
  15. Radhakrishnan A, Vaidhehi V. Email Classification using Machine learning algorithms. Int J Eng Technol (IJET). 2017 Apr; 9(2): 335–40.
  16. Hota HS, Shrivas AK, Hota R. An ensemble model for detecting phishing attack with proposed remove-replace feature selection technique. Procedia Comput Sci. 2018 Jan 1; 132: 900–7.

Regular Issue Subscription Review Article
Volume 16
Issue 01
Received 25/10/2024
Accepted 19/12/2024
Published 31/12/2024


Loading citations…