Decoding Big Data: A Practical Comparison between Hadoop and Spark

Year : 2024 | Volume :11 | Issue : 03 | Page : –
By

Saunved Palve,

Ammar Abdulhussain,

Om Sawant,

Ayush Yadav,

Aditya Kasar,

  1. Student, Department of Computer Engineering, SVKM’S NMIMS, Navi Mumbai, Maharashtra, India
  2. Student, Department of Computer Engineering, SVKM’S NMIMS, Navi Mumbai, Maharashtra, India
  3. Student, Department of Computer Engineering, SVKM’S NMIMS, Navi Mumbai, Maharashtra, India
  4. Student, Department of Computer Engineering, SVKM’S NMIMS, Navi Mumbai, Maharashtra, India
  5. Assistant Professor, School of Technology, SVKM’S NMIMS, Navi Mumbai, Maharashtra, India

Abstract

This paper conducts a comprehensive comparison of Apache Hadoop and Apache Spark, two essential frameworks in the big data era. The rapid expansion of data possesses challenges in terms of volume, variety, and velocity, which necessitate advanced processing solutions. Hadoop, utilizing its MapReduce paradigm, provides scalable and fault-tolerant storage, whereas Spark, built upon Hadoop, introduces in-memory processing to increase speed and flexibility. This study includes a detailed examination of their features, strengths, and limitations, offering insights through experiments in statistical analysis, machine learning, and database operations. The research contributes valuable perspectives for both practitioners and researchers, enabling them to make informed decisions in the ever-changing landscape of big data analytics. Our experiments reveal that Apache Spark achieves a notable average speedup of 41.57% over Apache Hadoop in statistical and machine learning applications, underscoring its superiority in big data analytics. However, it is important to note that Hadoop excels in database management systems, demonstrating superior performance in particular scenarios.

Keywords: Hadoop, MapReduce, Spark, Sqoop, Average Word Length, Machine Learning, DBMS

[This article belongs to Recent Trends in Parallel Computing(rtpc)]

How to cite this article: Saunved Palve, Ammar Abdulhussain, Om Sawant, Ayush Yadav, Aditya Kasar. Decoding Big Data: A Practical Comparison between Hadoop and Spark. Recent Trends in Parallel Computing. 2024; 11(03):-.
How to cite this URL: Saunved Palve, Ammar Abdulhussain, Om Sawant, Ayush Yadav, Aditya Kasar. Decoding Big Data: A Practical Comparison between Hadoop and Spark. Recent Trends in Parallel Computing. 2024; 11(03):-. Available from: https://journals.stmjournals.com/rtpc/article=2024/view=172338



References

  1. Anjum B. MapReduce–The Scalable Distributed Data Processing Solution. Topics in Parallel and Distributed Computing: Enhancing the Undergraduate Curriculum: Performance, Concurrency, and Programming on Modern Platforms. 2018:173-90.
  2. Malleswari TY, Vadivu G. MapReduce: A technical review. Indian Journal of science and technology. 2016 Jan;9(1):1-6.
  3. Feller E, Ramakrishnan L, Morin C. On the performance and energy efficiency of Hadoop deployment models. In2013 IEEE International Conference on Big Data 2013 Oct 6 (pp. 131-136). IEEE.
  4. Hong Z, Xiao-Ming W, Jie C, Yan-Hong M, Yi-Rong G, Min W. An optimized model for MapReduce based on Hadoop. TELKOMNIKA (Telecommunication Computing Electronics and Control). 2016 Dec 1;14(4):1552-8.
  5. Karun AK, Chitharanjan K. A review on hadoop—HDFS infrastructure extensions. In2013 IEEE conference on information & communication technologies 2013 Apr 11 (pp. 132-137). IEEE.
  6. Salloum S, Dautov R, Chen X, Peng PX, Huang JZ. Big data analytics on Apache Spark. International Journal of Data Science and Analytics. 2016 Nov;1:145-64.
  7. Wang K, Khan MM. Performance prediction for apache spark platform. In2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems 2015 Aug 24 (pp. 166-173). IEEE.
  8. Shaikh E, Mohiuddin I, Alufaisan Y, Nahvi I. Apache spark: A big data processing engine. In2019 2nd IEEE Middle East and North Africa COMMunications Conference (MENACOMM) 2019 Nov 19 (pp. 1-6). IEEE.
  9. García-Gil D, Ramírez-Gallego S, García S, Herrera F. A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Analytics. 2017 Dec;2:1-11.
  10. Joshi JB. Institute of Electrical and Electronics Engineers, and IEEE Computer Society, Proceedings. In2016 IEEE International Conference on Big Data: Dec 2015.
  11. Aravinth SS, Begam AH, Shanmugapriyaa S, Sowmya S, Arun E. An efficient HADOOP frameworks SQOOP and ambari for big data processing. International Journal for Innovative Research in Science and Technology. 2015 Mar;1(10):252-5.
  12. Benlachmi Y, El Yazidi A, Hasnaoui ML. A comparative analysis of hadoop and spark frameworks using word count algorithm. International Journal of Advanced Computer Science and Applications. 2021;12(4):778-88.
  13. Benbrahim H, Hachimi H, Amine A. Comparison between Hadoop and Spark. InProceedings of the International Conference on Industrial Engineering and Operations Management 2019 (Vol. 2019, pp. 690-701).
  14. Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai DB, Amde M, Owen S, Xin D. Mllib: Machine learning in apache spark. Journal of Machine Learning Research. 2016;17(34):1-7.
  15. Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B, Özcan F. Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proceedings of the VLDB Endowment. 2015 Sep 1;8(13):2110-21.
  16. Tekdogan T, Cakmak A. Benchmarking apache spark and hadoop mapreduce on big data classification. InProceedings of the 2021 5th International Conference on Cloud and Big Data Computing 2021 Aug 13 (pp. 15-20).
  17. Pavan DKU, Nachappa MN, Srinivasu DSVN. Sqoop usage in Hadoop Distributed File System and Observations to Handle Common Errors. International Journal of Recent Technology and Engineering (IJRTE). 2020 Nov;9(4):452-454.
  18. Verma A, Mansuri AH, Jain N. Big data management processing with Hadoop MapReduce and spark technology: A comparison. In2016 symposium on colossal data analysis and networking (CDAN) 2016 Mar 18 (pp. 1-4). IEEE.

Regular Issue Subscription Review Article
Volume 11
Issue 03
Received June 24, 2024
Accepted August 8, 2024
Published September 16, 2024

Check Our other Platform for Workshops in the field of AI, Biotechnology & Nanotechnology.
Check Out Platform for Webinars in the field of AI, Biotech. & Nanotech.