AI Chatbot for Expressing Visual Content

Year : 2024 | Volume :11 | Issue : 02 | Page : 11-19
By

Ashwini Garole,

Sneha Nath,

Dhruv Patil,

Hindavi Mhatre,

Shivtej Kadam,

  1. Assistant Professor Department of Computer Science & Engineering (Artificial Intelligence & Machine Learning), Vishwaniketan’s Institute of Management Entrepreneurship and Engineering Technology (ViMEET), Kumbhivali Maharashtra India
  2. Student Department of Computer Science & Engineering (Artificial Intelligence & Machine Learning), Vishwaniketan’s Institute of Management Entrepreneurship and Engineering Technology (ViMEET), Kumbhivali Maharashtra India
  3. Student Department of Computer Science & Engineering (Artificial Intelligence & Machine Learning), Vishwaniketan’s Institute of Management Entrepreneurship and Engineering Technology (ViMEET), Kumbhivali Maharashtra India
  4. Student Department of Computer Science & Engineering (Artificial Intelligence & Machine Learning), Vishwaniketan’s Institute of Management Entrepreneurship and Engineering Technology (ViMEET), Kumbhivali Maharashtra India
  5. Student Department of Computer Science & Engineering (Artificial Intelligence & Machine Learning), Vishwaniketan’s Institute of Management Entrepreneurship and Engineering Technology (ViMEET), Kumbhivali Maharashtra India

Abstract

Recently, the artificial intelligence (AI) chatbot for expressing visual content has shown remarkable multi-modal capabilities. It can recognize funny features in photos and create webpages straight from handwritten text. These characteristics are uncommon in earlier vision language models. We think the use of a more sophisticated large language model (LLM) is the main factor behind vision verbalizer’s superior multi-modal generating capabilities. We introduce vision verbalizer, which employs a single projection layer to align a frozen visual encoder with a frozen language model, Vicuna, to explore this phenomenon. It is evident from our research that vision verbalizer is capable of many tasks, such as creating extensive descriptions of images and creating websites from handwritten drafts. Additionally, we note that vision verbalizer is gaining other features, such as crafting poetry and stories based on the images provided, solving difficulties depicted in the images, instructing users on cooking using food photos, etc. We discovered in our experiment that pretraining on raw image-text pairs alone could result in inconsistent, repetitive, and sentence-fragmented language outputs. In order to tackle this issue, in the second stage we use a conversational template to curate a high-quality, well-aligned dataset and refine our model. This particular step was found to be essential in improving the model’s general usability and generation dependability.

Keywords: Artificial intelligence (AI) chatbot, visual content, sophisticated large language model, vision verbalizer, natural language processing

[This article belongs to Journal of Multimedia Technology & Recent Advancements(jomtra)]

How to cite this article: Ashwini Garole, Sneha Nath, Dhruv Patil, Hindavi Mhatre, Shivtej Kadam. AI Chatbot for Expressing Visual Content. Journal of Multimedia Technology & Recent Advancements. 2024; 11(02):11-19.
How to cite this URL: Ashwini Garole, Sneha Nath, Dhruv Patil, Hindavi Mhatre, Shivtej Kadam. AI Chatbot for Expressing Visual Content. Journal of Multimedia Technology & Recent Advancements. 2024; 11(02):11-19. Available from: https://journals.stmjournals.com/jomtra/article=2024/view=160779



References

  1. BigScience Workshop, Scao TL, Fan A, Akiki C, Pavlick E, Ilić S, Hesslow D, Castagné R, Luccioni AS, Yvon F, Gallé M, et al. Bloom: a 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100. November 9, 2022. Available at https://arxiv.org/abs/2211.05100
  2. Schwenk D, Khandelwal A, Clark C, Marino K, Mottaghi R. A-OKVQA: a benchmark for visual question answering using world knowledge. In: Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, editors. European Conference on Computer Vision 2022. Cham, Switzerland: Springer Nature; 2022. pp. 149–162.
  3. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog. 2019; 1 (8): 9.
  4. Surís D, Menon S, Vondrick C. ViperGPT: visual inference via Python execution for reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, October 1–6, 2023. pp. 11888–11898.
  5. Rohrbach A, Hendricks LA, Burns K, Darrell T, Saenko K. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156. September 6, 2018. Available at https://arxiv.org/abs/1809.02156
  6. Chen J, Zhu D, Haydarov K, Li X, Elhoseiny M. Video ChatCaptioner: towards enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227. April 9, 2023. Available at https://arxiv.org/abs/2304.04227
  7. Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M, Ring R. Flamingo: a visual language model for few-shot learning. Adv Neural Inform Process Syst. 2022; 35: 23716–23736.
  8. Wu C, Yin S, Qi W, Wang X, Tang Z, Duan N. Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671. March 8, 2023. Available at https://arxiv.org/abs/2303.04671
  9. Driess D, Xia F, Sajjadi MS, Lynch C, Chowdhery A, Ichter B, Wahid A, Tompson J, Vuong Q, Yu T, Huang W, Chebotar Y, Sermanet P, Duckworth D, Levine S, Vanhoucke V, Hausman K, Toussaint M, Greff K, Zeng A, Mordatch I, Florence P. PaLM-E: an embodied multimodal language model. arXiv preprint arXiv:2303.03378. March 6, 2023. Available at https://arxiv.org/abs/2303.03378
  10. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020; 21 (140): 1–67.

Regular Issue Subscription Review Article
Volume 11
Issue 02
Received May 3, 2024
Accepted July 15, 2024
Published August 2, 2024

Check Our other Platform for Workshops in the field of AI, Biotechnology & Nanotechnology.
Check Out Platform for Webinars in the field of AI, Biotech. & Nanotech.