Devdatta Jahirabadkar,
Sankarshan Joshi,
Sakshi Kulkarni,
Sharvani Kulkarni,
- Student, Department of Computer Science and Engineering, Government College of Engineering, Aurangabad, Chhatrapati Sambhajinagar, Maharashtra, India
- Student, Department of Computer Science and Engineering, Government College of Engineering, Aurangabad, Chhatrapati Sambhajinagar, Maharashtra, India
- Student, Department of Computer Science and Engineering, Government College of Engineering, Aurangabad, Chhatrapati Sambhajinagar, Maharashtra, India
- Student, Department of Computer Science and Engineering, Government College of Engineering, Aurangabad, Chhatrapati Sambhajinagar, Maharashtra, India
Abstract
Deepfake technologies have become a major risk to the credibility and trustworthiness of digital visual information. Using powerful generative models like GANs and autoencoders, deepfakes can generate highly realistic fake videos and images, resulting in misinformation, identity theft, and public loss of trust in digital media. Classic Convolutional Neural Networks (CNNs) while being highly effective in initial-stage, deepfake detection tend to be limited by their local receptive fields and dependency on spatial hierarchies while detecting increasingly refined and advanced manipulations. For this purpose, this research proposes a strong detection approach based on Vision Transformers (ViT), a transformer architecture that utilizes global self-attention and patch-based embeddings for the task of image classification. The ViT model recommended employs multi-head self-attention mechanisms to identify long-range dependencies between image patches and thereby detect fine-grained inconsistencies induced in deepfakes. The experiments were performed using the Deepfake Detection Challenge Dataset hosted on Kaggle, with the ViT model resulting in a classification accuracy of 93.7%, a precision of 92.3%, and a recall of 94.1%. Comparative performance between baseline CNN and ResNet structures and the model showed the comparative advantage of ViT in precision and robustness. In addition, the performance of the model was tested in a confusion matrix, which expressed high true negative and true positive rates, a confirmation of how well it operates in practical usage. Future research areas involve optimizing the ViT architecture for real-time inference on edge devices, merging detection systems with IoT-enabled surveillance networks, and using few-shot learning or self-supervised methods to improve performance in low-data settings. This work highlights the promise of transformer-based methods in addressing the changing nature of deepfake threats and maintaining media integrity in the digital era.
Keywords: Deepfake detection, vision transformer, machine learning, image classification, transformer architecture
[This article belongs to Journal of Image Processing & Pattern Recognition Progress ]
Devdatta Jahirabadkar, Sankarshan Joshi, Sakshi Kulkarni, Sharvani Kulkarni. Secure Forge: Deepfake Image Detection Using Vision Transformers. Journal of Image Processing & Pattern Recognition Progress. 2025; 12(03):32-45.
Devdatta Jahirabadkar, Sankarshan Joshi, Sakshi Kulkarni, Sharvani Kulkarni. Secure Forge: Deepfake Image Detection Using Vision Transformers. Journal of Image Processing & Pattern Recognition Progress. 2025; 12(03):32-45. Available from: https://journals.stmjournals.com/joipprp/article=2025/view=210180
References
- Chollet F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2017; 1251–1258.
- He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016; 770–778.
- Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. 2020 Oct 22.
- Sumathi D, Singh A, Sinha A, Aditya D, KF MR. The Deepfake Dilemma: Enhancing Deepfake Detection with Vision Transformers. In 2025 IEEE International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE). 2025 Jan 16; 1–7.
- Li Y, Lyu S. Exposing deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656. 2018 Nov 1.
- Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 2014 Sep 4.
- Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016; 779–788.
- Kaggle. (2025). Deepfake Detection Challenge. [Online]. Available from: https://www.kaggle. com/competitions/deepfake-detection-challenge
- Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2017; 2980–2988.
- Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I. Zero-shot text-to-image generation. In International conference on machine learning, PMLR. 2021 Jul 1; 8821–8831.

Journal of Image Processing & Pattern Recognition Progress
| Volume | 12 |
| Issue | 03 |
| Received | 02/05/2025 |
| Accepted | 07/05/2025 |
| Published | 15/05/2025 |
| Publication Time | 13 Days |
Login
PlumX Metrics