Yogesh J.L.,
Shravya C.,
Shreshta Shivakumara,
Soundarya Siddu Pujari,
Pavithra B.,
- Student, Department of Electronics and Communication Engineering, Sai Vidya Institute of Technology, Rajanukunte, Karnataka, India
- Student, Department of Electronics and Communication Engineering, Sai Vidya Institute of Technology, Rajanukunte, Karnataka, India
- Student, Department of Electronics and Communication Engineering, Sai Vidya Institute of Technology, Rajanukunte, Karnataka, India
- Student, Department of Electronics and Communication Engineering, Sai Vidya Institute of Technology, Rajanukunte, Karnataka, India
- Assistant Professor, Department of Electronics and Communication Engineering, Sai Vidya Institute of Technology, Rajanukunte, Karnataka, India
Abstract
This program utilizes key features of the .NET framework to facilitate smooth text-to-speech conversion and audio playback. Upon execution, users are prompted to input text via a graphical user interface (GUI), which the program converts into speech using the ‘SpeechSynthesizer’ class from the ‘System. Speech.Synthesis’ namespace. The audio that has been synthesized is handled and stored as a WAV file called ‘output.wav’ by utilizing the ‘FileStream’ class, allowing for future playback. This file format is chosen for its wide compatibility with audio systems. The program subsequently loads the audio into an ‘AxWindowsMediaPlayer’ control, which facilitates multimedia playback. Beyond mere playback, the application incorporates signal processing techniques for audio frequency visualization, providing users with a graphical representation of the audio spectrum. In addition to simple playback, the application uses signal processing methods to visualize audio frequencies, offering users a visual display of the audio spectrum. The methodology is designed to streamline both the text-to-audio conversion process and the management of audio files. It includes importing relevant packages such as ‘System.IO’, ‘System.Media’, and ‘System.Speech.AudioFormat’, each contributing to different facets of the operation, from file handling to audio format management. The conversion from text to speech is handled efficiently through memory streams, where audio data is stored before being saved to a file. The program ensures a user-friendly experience by combining audio playback with a visual representation of frequency, enhancing the overall utility of the application. This project demonstrates a practical approach to integrating text-to-speech conversion, file storage, and multimedia playback using the robust tools available in the .NET framework. Essential functionalities from the .NET framework facilitate text-to-audio conversion and audio playback. Upon execution, the program prompts the user to input the text they wish to convert into audio, typically via a user interface element. Using the ‘SpeechSynthesizer’ class, the program converts the input text into synthesized audio. The synthesized audio is then stored in the WAV file format named ‘output.wav’ for future playback. The program loads the generated audio file into a multimedia control for playback and signal processing technique is applied for visualizing the audio frequency
Keywords: Text-to-audio conversion, speech synthesizer, Deep learning, audio in waveform
[This article belongs to Journal of Artificial Intelligence Research & Advances ]
Yogesh J.L., Shravya C., Shreshta Shivakumara, Soundarya Siddu Pujari, Pavithra B.. Deep Learning Approach to Produce Artificial Speech (Text-To-Audio). Journal of Artificial Intelligence Research & Advances. 2024; 12(01):28-33.
Yogesh J.L., Shravya C., Shreshta Shivakumara, Soundarya Siddu Pujari, Pavithra B.. Deep Learning Approach to Produce Artificial Speech (Text-To-Audio). Journal of Artificial Intelligence Research & Advances. 2024; 12(01):28-33. Available from: https://journals.stmjournals.com/joaira/article=2024/view=191591
References
- Kadlag S, Purohit A, Kumar A, Mhaske S. Text–To–Audio Converter. Available at SSRN 3823008. 2021 Mar 25.
- Mangale K, Mhaske H, Wankhade P, Niwane V. Printed Text To Audio Converter Using OCR. IJCA Proceedings on National Conference on Emerging Trends in Advanced Communication Technologies. 2015 Jun; 27–30.
- Xue J, Deng Y, Gao Y, Li Y. Auffusion: Leveraging the power of diffusion and large language models for text-to-audio generation. arXiv preprint arXiv:2401.01044. 2024 Jan 2.
- Huang R, Huang J, Yang D, Ren Y, Liu L, Li M, Ye Z, Liu J, Yin X, Zhao Z. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In International Conference on Machine Learning, PMLR. 2023 Jul 3; 13916–13932.
- Bhesra K, Shukla SA, Agarwal A. Audio vs. Text: Identify a Powerful Modality for Effective Hate Speech Detection. In The Second Tiny Papers Track at ICLR. 2024 Mar 20.
- Sridhar AK, Guo Y, Visser E, Mahfuz R. Parameter Efficient Audio Captioning with Faithful Guidance Using Audio-Text Shared Latent Representation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2024 Apr 14; 1181–1185.
- Yuan Y, Liu H, Liu X, Huang Q, Plumbley MD, Wang W. Retrieval-augmented text-to-audio generation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2024 Apr 14; 581–585.
- Reddy M, Deepika K, Sanathan M. A survey on audio analysis: Text characterization and summarization. World Journal of Advanced Research and Reviews (WJARR). 2024; 21(3): 1596–601.
- Tang C, Luo C, Zhao Z, Yin D, Zhao Y, Zeng W. Zero-shot text-to-speech for text-based insertion in audio narration. arXiv preprint arXiv:2109.05426. 2021 Sep 12.
- Singh P, Karanam S, Shekhar S. Audio Retrieval for Multimodal Design Documents: A New Dataset and Algorithms. arXiv preprint arXiv:2302.14757. 2023 Feb 28.

Journal of Artificial Intelligence Research & Advances
| Volume | 12 |
| Issue | 01 |
| Received | 03/05/2024 |
| Accepted | 24/12/2024 |
| Published | 30/12/2024 |
Login
PlumX Metrics