This is an unedited manuscript accepted for publication and provided as an Article in Press for early access at the author’s request. The article will undergo copyediting, typesetting, and galley proof review before final publication. Please be aware that errors may be identified during production that could affect the content. All legal disclaimers of the journal apply.
Sanjeev Sharma,
- Head, Department of Multimedia, BBK DAV College For Women, Amritsar, Punjab, India
Abstract
The rapid advancement of artificial intelligence (AI) has significantly reshaped the field of audio signal processing, with sound spectrogram analysis emerging as a central research focus. Spectrograms provide a rich time–frequency representation of audio signals, making them particularly suitable for data-driven learning approaches. This paper presents an in- depth and original review of modern AI-based techniques applied to spectrogram analysis, highlighting their growing impact across critical application areas such as healthcare diagnostics, security systems, environmental surveillance, and intelligent multimedia processing. We examine a range of state-of-the-art methodologies, including convolutional neural networks (CNNs), transformer-based models, and adaptive spectral estimation strategies, emphasizing how each approach leverages spectro-temporal patterns for improved feature extraction and classification. Through multiple case studies—covering deepfake audio detection, disease diagnosis from biomedical sounds, and acoustic scene classification—we demonstrate that AI-driven models consistently outperform conventional signal processing methods in terms of accuracy, generalization, and noise robustness. A key contribution of this work is the discussion of explainable artificial intelligence (XAI) techniques, which enhance model transparency and trust by revealing how spectral features influence decision-making. Experimental evaluations show that transformer architectures equipped with spectrogram-aware attention mechanisms achieve up to 92.4% accuracy on standard benchmark datasets, exceeding CNN-based models by 6.8%. Finally, the paper identifies current technical challenges and outlines future research directions, including multimodal data fusion, efficient edge deployment, and the potential role of quantum- enhanced spectral analysis in next-generation audio intelligence systems.
Keywords: Spectrogram Analysis, Audio AI, Deep Learning Architectures, Transformer Models, Multimodal Fusion
Sanjeev Sharma. Advancements in AI-Driven Sound Spectrogram Analysis: From Deep Learning to Quantum and Neuromorphic Processing. Journal of Multimedia Technology & Recent Advancements. 2025; 13(01):-.
Sanjeev Sharma. Advancements in AI-Driven Sound Spectrogram Analysis: From Deep Learning to Quantum and Neuromorphic Processing. Journal of Multimedia Technology & Recent Advancements. 2025; 13(01):-. Available from: https://journals.stmjournals.com/jomtra/article=2025/view=242091
References
- Anagha R, Arya A, Narayan VH, Abhishek S, Anjali T. Audio deepfake detection using deep learning. In2023 12th International Conference on System Modeling & Advancement in Research Trends (SMART) 2023 Dec 22 (pp. 176-181). IEEE.
- Balamurugan A, Teo SG, Yang J, Peng Z, Xulei Y, Zeng Z. ResHNet: Spectrograms based efficient heart sounds classification using stacked residual networks. In2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI) 2019 May 19 (pp. 1-4). IEEE.
- Nogueira AF, Oliveira HS, Machado JJ, Tavares JM. Transformers for urban sound classification—A comprehensive performance evaluation. Sensors. 2022 Nov 16;22(22):8874.
- Jones CC, Gannon ZE, Blunt SD, Allen CT, Martone AF. An adaptive spectrogram estimator to enhance signal characterization. In2022 IEEE Radar Conference (RadarConf22) 2022 Mar 21 (pp. 1-6). IEEE.
- Tsui BM, Xu J, Rittenbach A, Chen S, El-Sharkaway AM, Edelstein WA, Guo X, Liu A, Hugg JW. High performance SPECT system for simultaneous SPECT-MR imaging of small animals. In2011 IEEE Nuclear Science Symposium Conference Record 2011 Oct 23 (pp. 3178-3182). IEEE.
- Tian B, Pang Y, Huzaifa M, Wang S, Adve S. Towards Energy-Efficiency by Navigating the Trilemma of Energy, Latency, and Accuracy. In2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) 2024 Oct 21 (pp. 913-922). IEEE.
- Xia Y, Zhao Z. Cross-modal background suppression for audio-visual event localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition 2022 (pp. 19989-19998).
- Abdul ZK, Al-Talabani AK. Mel frequency cepstral coefficient and its applications: A review. IEEE Access. 2022 Nov 18;10:122136-58.
- Cerezuela-Escudero E, Jimenez-Fernandez A, Paz-Vicente R, Dominguez-Morales M, Linares-Barranco A, Jimenez-Moreno G. Musical notes classification with neuromorphic auditory system using FPGA and a convolutional spiking network. In2015 International Joint Conference on Neural Networks (IJCNN) 2015 Jul 12 (pp. 1-7). IEEE.
- Li F, Zhang Z, Wang L, Liu W. Heart sound classification based on improved mel- frequency spectral coefficients and deep residual learning. Frontiers in Physiology. 2022 Dec 22;13:1084420.
- Kim G, Han DK, Ko H. Specmix: A mixed sample data augmentation method for training withtime-frequency domain features. arXiv preprint arXiv:2108.03020. 2021 Aug 6.
- Foresti GL, Regazzoni CS. Multisensor data fusion for autonomous vehicle navigation in risky environments. IEEE Transactions on Vehicular Technology. 2002 Dec 16;51(5):1165-85.
- Wang H, Zou Y, Wang W. Specaugment++: A hidden space data augmentation method for acoustic scene classification. arXiv preprint arXiv:2103.16858. 2021 Mar 31.
- Han J, Matuszewski M, Sikorski O, Sung H, Cho H. Randmasking augment: A simple and randomized data augmentation for acoustic scene classification. InICASSP 2023-2023 IEEE
- Thuillier E, Gamper H, Tashev IJ. Spatial audio feature discovery with convolutional neural networks. In2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) 2018 Apr 15 (pp. 6797-6801). IEEE.
- Schlüter J, Gutenbrunner G. Efficientleaf: A faster learnable audio frontend of questionable use. In2022 30th European signal processing conference (EUSIPCO) 2022 Aug 29 (pp. 205-208). IEEE.
- Nguyen DD, Luong CM. Vietnamese Speaker Verification With Mel-Scale Filter Bank Energies and Deep Learning. IEEE Access. 2024 Oct 11.
- Wang J, Li J, Tan X. Spectral-spatial symmetrical aggregation cross-linking multi- modal data fusion network. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022 May 23 (pp. 5098-5102). IEEE.
- Barahona S, de Benito-Gorrón D, Toledano DT, Ramos D. Enhancing conformer-based sound event detection using frequency dynamic convolutions and BEATs audio embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2024 Aug 15.
- Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y, Yang Z. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence. 2022 Feb 18;45(1):87-110.
- Oonishi K, Kunihiro N. Shor’s algorithm using efficient approximate quantum Fourier transform. IEEE Transactions on Quantum Engineering. 2023 Sep 25;4:1-6.
- Aboy M, Márquez OW, McNames J, Hornero R, Trong T, Goldstein B. Adaptive modeling and spectral estimation of nonstationary biomedical signals based on Kalman filtering. IEEE transactions on biomedical engineering. 2005 Jul 11;52(8):1485-9.
- Isik M, Vishwamith H, Inadagbo K, Dikmen I. HPCNeuroNet: Advancing neuromorphic audio signal processing with transformer-enhanced spiking neural networks. arXiv preprint arXiv:2311.12449. 2023 Nov 21.
- Leiber M, Marnissi Y, Barrau A, El Badaoui M. Differentiable adaptive short-time Fourier transform with respect to the window length. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023 Jun 4 (pp. 1-5). IEEE.
- Mastriani M. Quantum spectral analysis: frequency in time, with applications to signal and image processing. arXiv preprint arXiv:1611.02302. 2016 Oct 11.
- Jain PK, Choudhary RR, Singh MR. A lightweight 1-d convolution neural network model for multi-class classification of heart sounds. In2022 International Conference on Emerging Techniques in Computational Intelligence (ICETCI) 2022 Aug 25 (pp. 40-44). IEEE.
- Wen P, Hu K, Yue W, Zhang S, Zhou W, Wang Z. Robust audio anti-spoofing with fusion-reconstruction learning on multi-order spectrograms. arXiv preprint arXiv:2308.09302. 2023 Aug 18.
- Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision 2017 (pp. 618-626).
- Baghel S, Prasanna SR, Guha P. Overlapped speech detection using phase features. The Journal of the Acoustical Society of America. 2021 Oct 1;150(4):2770-81.
- Tuli S, Jha NK. EdgeTran: Device-aware co-search of transformers for efficient inference on mobile edge platforms. IEEE Transactions on Mobile Computing. 2023 Oct 30;23(6):7012-29.

Journal of Multimedia Technology & Recent Advancements
| Volume | 13 |
| 01 | |
| Received | 11/04/2025 |
| Accepted | 07/10/2025 |
| Published | 20/12/2025 |
| Publication Time | 253 Days |
Login
PlumX Metrics