[{“box”:0,”content”:”[if 992 equals=”Open Access”]n
n
Open Access
nn
n
n[/if 992]n
n
n
n
n

n
Ashwini Garole, Sneha Nath, Dhruv Patil, Hindavi Mhatre, Shivtej Kadam,
n
- n t
n
n
n[/foreach]
n
n[if 2099 not_equal=”Yes”]n
- [foreach 286] [if 1175 not_equal=””]n t
- Assistant Professor, Student, Student, Student, Student Department of Computer Science & Engineering (Artificial Intelligence & Machine Learning), Vishwaniketan’s Institute of Management Entrepreneurship and Engineering Technology (ViMEET), Kumbhivali, Department of Computer Science & Engineering (Artificial Intelligence & Machine Learning), Vishwaniketan’s Institute of Management Entrepreneurship and Engineering Technology (ViMEET), Kumbhivali, Department of Computer Science & Engineering (Artificial Intelligence & Machine Learning), Vishwaniketan’s Institute of Management Entrepreneurship and Engineering Technology (ViMEET), Kumbhivali, Department of Computer Science & Engineering (Artificial Intelligence & Machine Learning), Vishwaniketan’s Institute of Management Entrepreneurship and Engineering Technology (ViMEET), Kumbhivali, Department of Computer Science & Engineering (Artificial Intelligence & Machine Learning), Vishwaniketan’s Institute of Management Entrepreneurship and Engineering Technology (ViMEET), Kumbhivali Maharashtra, Maharashtra, Maharashtra, Maharashtra, Maharashtra India, India, India, India, India
n[/if 1175][/foreach]
n[/if 2099][if 2099 equals=”Yes”][/if 2099]n
Abstract
nRecently, the AI Chatbot For Expressing Visual Content has shown remarkable multi-modal capabilities. It can recognise funny features in photos and create webpages straight from handwritten text. These characteristics are uncommon in earlier vision language models. We think the use of a more sophisticated large language model (LLM) is the main factor behind vision verbalizer’s superior multi-modal generating capabilities. We introduce Vision Verbalizer, which employs a single projection layer to align a frozen visual encoder with a frozen language model, Vicuna, to explore this phenomenon. It is evident from our research that vision verbalizer is capable of many tasks that it does not, such as creating extensive descriptions of images and creating websites from handwritten drafts. Additionally, we note that vision verbalizer is gaining other features, such as crafting poetry and stories based on the images provided, solving difficulties depicted in the images, instructing users on cooking using food photos, etc.We discovered in our experiment that pretraining on raw image-text pairs alone could result in inconsistent, repetitive, and sentence-fragmented language outputs.In order to tackle this issue, in the second stage we use a conversational template to curate a high-quality, wellaligned dataset and refine our model. This particular step was found to be essential in improving the model’s general usability and generation dependability.
n
Keywords: AI Chatbot, Visual Content, sophisticated large language model, vision verbalizer, Natural Language Processing
n[if 424 equals=”Regular Issue”][This article belongs to Journal of Multimedia Technology & Recent Advancements(jomtra)]
n
n
n
n
n
nn[if 992 equals=”Open Access”] Full Text PDF Download[/if 992] n
nn[if 379 not_equal=””]n
Browse Figures
n
n
n[/if 379]n
References
n[if 1104 equals=””]n
- Chen J, Zhu D, Haydarov K, Li X, Elhoseiny M. Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions. arXiv preprint arXiv:2304.04227. 2023 Apr 9.
- Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M, Ring R. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems. 2022 Dec 6;35:23716-36.
- Driess D, Xia F, Sajjadi MS, Lynch C, Chowdhery A, Ichter B, Wahid A, Tompson J, Vuong Q, Yu T, Huang W. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378. 2023 Mar 6.
- Wu C, Yin S, Qi W, Wang X, Tang Z, Duan N. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671. 2023 Mar 8.
- Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog. 2019 Feb 24;1(8):9.
- Rohrbach A, Hendricks LA, Burns K, Darrell T, Saenko K. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156. 2018 Sep 6.
- Surís D, Menon S, Vondrick C. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision 2023 (pp. 11888-11898).
- Workshop B, Scao TL, Fan A, Akiki C, Pavlick E, Ilić S, Hesslow D, Castagné R, Luccioni AS, Yvon F, Gallé M. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100. 2022 Nov 9.
- Schwenk D, Khandelwal A, Clark C, Marino K, Mottaghi R. A-okvqa: A benchmark for visual question answering using world knowledge. InEuropean conference on computer vision 2022 Oct 23 (pp. 146-162). Cham: Springer Nature Switzerland.
- Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research. 2020;21(140):1-67.
nn[/if 1104][if 1104 not_equal=””]n
- [foreach 1102]n t
- [if 1106 equals=””], [/if 1106][if 1106 not_equal=””],[/if 1106]
n[/foreach]
n[/if 1104]
nn
nn[if 1114 equals=”Yes”]n
n[/if 1114]
n
n

n
Journal of Multimedia Technology & Recent Advancements
n
n
n
n
n
n
| Volume | 11 | |
| [if 424 equals=”Regular Issue”]Issue[/if 424][if 424 equals=”Special Issue”]Special Issue[/if 424] [if 424 equals=”Conference”][/if 424] | 02 | |
| Received | May 3, 2024 | |
| Accepted | July 15, 2024 | |
| Published | August 2, 2024 |
n
n
n
n
n
n nfunction myFunction2() {nvar x = document.getElementById(“browsefigure”);nif (x.style.display === “block”) {nx.style.display = “none”;n}nelse { x.style.display = “Block”; }n}ndocument.querySelector(“.prevBtn”).addEventListener(“click”, () => {nchangeSlides(-1);n});ndocument.querySelector(“.nextBtn”).addEventListener(“click”, () => {nchangeSlides(1);n});nvar slideIndex = 1;nshowSlides(slideIndex);nfunction changeSlides(n) {nshowSlides((slideIndex += n));n}nfunction currentSlide(n) {nshowSlides((slideIndex = n));n}nfunction showSlides(n) {nvar i;nvar slides = document.getElementsByClassName(“Slide”);nvar dots = document.getElementsByClassName(“Navdot”);nif (n > slides.length) { slideIndex = 1; }nif (n (item.style.display = “none”));nArray.from(dots).forEach(nitem => (item.className = item.className.replace(” selected”, “”))n);nslides[slideIndex – 1].style.display = “block”;ndots[slideIndex – 1].className += ” selected”;n}n”}]