Decoding Big Data: A Practical Comparison Between Hadoop and Spark

Year : 2024 | Volume : 11 | Issue : 03 | Page : 15 23
    By

    Saunved Palve,

  • Ammar Abdulhussain,

  • Om Sawant,

  • Ayush Yadav,

  • Aditya Kasar,

  1. Student, Department of Computer Engineering, Shri Vile Parle Kelavani Mandal’s Narsee Monjee Institute of Management Studies (SVKM’S NMIMS), Navi Mumbai, Maharashtra, India
  2. Student, Department of Computer Engineering, Shri Vile Parle Kelavani Mandal’s Narsee Monjee Institute of Management Studies (SVKM’S NMIMS), Navi Mumbai, Maharashtra, India
  3. Student, Department of Computer Engineering, Shri Vile Parle Kelavani Mandal’s Narsee Monjee Institute of Management Studies (SVKM’S NMIMS), Navi Mumbai, Maharashtra, India
  4. Student, Department of Computer Engineering, Shri Vile Parle Kelavani Mandal’s Narsee Monjee Institute of Management Studies (SVKM’S NMIMS), Navi Mumbai, Maharashtra, India
  5. Assistant Professor, School of Technology, Shri Vile Parle Kelavani Mandal’s Narsee Monjee Institute of Management Studies (SVKM’S NMIMS), Navi Mumbai, Maharashtra, India

Abstract

This paper conducts a comprehensive comparison of Apache Hadoop and Apache Spark, two essential frameworks in the big data era. The rapid expansion of data possesses challenges in terms of volume, variety, and velocity, which necessitate advanced processing solutions. Hadoop, utilizing its MapReduce paradigm, provides scalable and fault-tolerant storage, whereas Spark, built upon Hadoop, introduces in-memory processing to increase speed and flexibility. This study includes a detailed examination of their features, strengths, and limitations, offering insights through experiments in statistical analysis, machine learning, and database operations. The work contributes valuable perspectives for both practitioners and researchers, enabling them to make informed decisions in the ever-changing landscape of big data analytics. Our experiments reveal that Apache Spark achieves a notable average speedup of 41.57% over Apache Hadoop in statistical and machine learning applications, underscoring its superiority in big data analytics. However, it is important to note that Hadoop excels in database management systems, demonstrating superior performance in particular scenarios.

Keywords: Hadoop, MapReduce, Spark, Sqoop, average word length, machine learning, database management (DBMS)

[This article belongs to Recent Trends in Parallel Computing ]

How to cite this article:
Saunved Palve, Ammar Abdulhussain, Om Sawant, Ayush Yadav, Aditya Kasar. Decoding Big Data: A Practical Comparison Between Hadoop and Spark. Recent Trends in Parallel Computing. 2024; 11(03):15-23.
How to cite this URL:
Saunved Palve, Ammar Abdulhussain, Om Sawant, Ayush Yadav, Aditya Kasar. Decoding Big Data: A Practical Comparison Between Hadoop and Spark. Recent Trends in Parallel Computing. 2024; 11(03):15-23. Available from: https://journals.stmjournals.com/rtpc/article=2024/view=172338


References

  1. Anjum B. MapReduce–The scalable distributed data processing solution. In: Topics in Parallel and Distributed Computing: Enhancing the Undergraduate Curriculum: Performance, Concurrency, and Programming on Modern Platforms. 2018. p. 173–90.
  2. Naga Malleswari TYJ, Vadivu G. MapReduce: A technical review. Indian J Sci Technol. 2016;9:1–6. DOI: 10.17485/ijst/2016/v9i1/78964.
  3. Feller E, Ramakrishnan L, Morin C. On the performance and energy efficiency of Hadoop deployment models. 2013 IEEE International Conference on Big Data, Silicon Valley, CA, USA. 2013. pp. 131–6. DOI: 10.1109/BigData.2013.6691564.
  4. Hong Z, Xiao-Ming W, Jie C, Yan-Hong M, Yi-Rong G, Min W. An optimized model for MapReduce based on Hadoop. TELKOMNIKA. 2016;14:1552–8.
  5. Karun AK, Chitharanjan K. A review on Hadoop—HDFS infrastructure extensions. 2013 IEEE Conference on Information & Communication Technologies, Thuckalay, India. 2013. pp. 132–7. DOI: 10.1109/CICT.2013.6558077.
  6. Salloum S, Dautov R, Chen X, Peng PX, Huang JZ. Big data analytics on Apache Spark. Int J Data Sci Anal. 2016;1:145–64. DOI: 10.1007/s41060-016-0027-9.
  7. Wang K, Khan MMH. Performance prediction for Apache Spark platform. 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, New York, NY, USA. 2015. pp. 166–73. DOI: 10.1109/HPCC-CSS-ICESS.2015.246.
  8. Shaikh E, Mohiuddin I, Alufaisan Y, Nahvi I. Apache Spark: A big data processing engine. 2019 2nd IEEE Middle East and North Africa COMMunications Conference (MENACOMM), Manama, Bahrain. 2019. pp. 1–6. DOI: 10.1109/MENACOMM46666.2019.8988541.
  9. García-Gil D, Ramírez-Gallego S, García S, Herrera F. A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Anal. 2017;2:1–11. DOI: 10.1186/s41044-016-0020-2.
  10. Joshi A, Luo Y, John LK. Applying statistical sampling for fast and efficient simulation of commercial workloads. IEEE Trans Comput. 2007;56:1520–33. doi: 10.1109/TC.2007.70748.
  11. Aravinth SS, Begam AH, Shanmugapriyaa S, Sowmya S, Arun E. An efficient HADOOP frameworks SQOOP and Ambari for big data processing. Int J Innov Res Sci Technol. 2015;1: 252–5.
  12. Benlachmi Y, El Yazidi A, Hasnaoui ML. A comparative analysis of Hadoop and Spark frameworks using word count algorithm. Int J Adv Comput Sci Appl. 2021;12:778–88.
  13. Benbrahim H, Hachimi H, Amine A. Comparison between Hadoop and Spark. Proceedings of the International Conference on Industrial Engineering and Operations Management; 2019 Mar 5-7; Bangkok, Thailand. IEOM Society International; 2019. p. 690–701.
  14. Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, et al. Mllib: Machine learning in Apache Spark. J Mach Learn Res. 2016;17:1–7.
  15. Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B, Özcan F. Clash of the Titans: MapReduce vs. Spark for large scale data analytics. Proc VLDB Endow. 2015 Sep;8(13):2110–21. DOI: 10.14778/2831360.2831365.
  16. Tekdogan T, Cakmak A. Benchmarking Apache Spark and Hadoop MapReduce on big data classification. Proceedings of the 2021 5th International Conference on Cloud and Big Data Computing. New York, NY, USA: Association for Computing Machinery; 2021. p. 15–20. DOI: 10.1145/3481646.3481649.
  17. Pavan DKU, Nachappa DMN, Srinivasu DSN. Sqoop usage in Hadoop distributed file system and observations to handle common errors. Int J Recent Technol Eng. 2020;9:452–4. DOI: 10.35940/ijrte.D4980.119420.
  18. Verma A, Mansuri AH, Jain N. Big data management processing with Hadoop MapReduce and Spark technology: A comparison. 2016 Symposium on Colossal Data Analysis and Networking (CDAN), Indore, India. 2016. pp. 1–4. DOI: 10.1109/CDAN.2016.7570891.

Regular Issue Subscription Review Article
Volume 11
Issue 03
Received 24/06/2024
Accepted 08/08/2024
Published 16/09/2024


Login


My IP

PlumX Metrics