Secure Forge: Deepfake Image Detection Using Vision Transformers

Year : 2025 | Volume : 12 | 02 | Page : –
    By

    Devdatta Atul Jahirabadkar,

  • Sankarshan Purushottam Joshi,

  • Sakshi Avinash Kulkarni,

  • Sharvani Nachiket Kulkarni,

  • V. A. Chakkarwar,

  1. Student, Government college of engineering Aurangabad , Chhatrapati sambhajinagar, Maharashtra, India
  2. Student, Government college of engineering Aurangabad , Chhatrapati sambhajinagar, Maharashtra, India
  3. Student, Government college of engineering Aurangabad , Chhatrapati sambhajinagar, Maharashtra, India
  4. Student, Government college of engineering Aurangabad , Chhatrapati sambhajinagar, Maharashtra, India
  5. Professor, Government college of engineering Aurangabad , Chhatrapati sambhajinagar, Maharashtra, India

Abstract

document.addEventListener(‘DOMContentLoaded’,function(){frmFrontForm.scrollToID(‘frm_container_abs_188784’);});Edit Abstract & Keyword

Deepfake technologies have become a major risk to the credibility and trustworthiness of digital visual information. Using powerful generative models like GANs and autoencoders, deepfakes can generate highly realistic fake videos and images, resulting in misinformation, identity theft, and public loss of trust in digital media. Classic Convolutional Neural Networks (CNNs) while being highly effective in initial-stage deepfake detection tend to be limited by their local receptive fields and dependency on spatial hierarchies while detecting increasingly refined and advanced manipulations. For this purpose, this research proposes a strong detection approach based on Vision Transformers (ViT), a transformer architecture that utilizes global self-attention and patch-based embeddings for the task of image classification.
The ViT model recommended employs multi-head self-attention mechanisms to identify long- range dependencies between image patches and thereby detect fine-grained inconsistencies induced in deepfakes. The experiments were performed using the Deepfake Detection Challenge Dataset hosted on Kaggle, with the ViT model resulting in a classification accuracy of 93.7%, a precision of 92.3%, and a recall of 94.1%. Comparative performance between baseline CNN and ResNet structures and the model showed the comparative advantage of ViT in precision and robustness.
In addition, the performance of the model was tested in a confusion matrix, which expressed high true negative and true positive rates, a confirmation of how well it operates in practical usage. Future research areas involve optimizing the ViT architecture for real-time inference on edge devices, merging detection systems with IoT-enabled surveillance networks, and using few-shot learning or self-supervised methods to improve performance in low-data settings. This work highlights the promise of transformer-based methods in addressing the changing nature of deepfake threats and maintaining media integrity in the digital era.

Keywords: Deepfake Detection, Vision Transformer, Machine Learning, Image Classification, Transformer Architecture

How to cite this article:
Devdatta Atul Jahirabadkar, Sankarshan Purushottam Joshi, Sakshi Avinash Kulkarni, Sharvani Nachiket Kulkarni, V. A. Chakkarwar. Secure Forge: Deepfake Image Detection Using Vision Transformers. Journal of Image Processing & Pattern Recognition Progress. 2025; 12(02):-.
How to cite this URL:
Devdatta Atul Jahirabadkar, Sankarshan Purushottam Joshi, Sakshi Avinash Kulkarni, Sharvani Nachiket Kulkarni, V. A. Chakkarwar. Secure Forge: Deepfake Image Detection Using Vision Transformers. Journal of Image Processing & Pattern Recognition Progress. 2025; 12(02):-. Available from: https://journals.stmjournals.com/joipprp/article=2025/view=0


document.addEventListener(‘DOMContentLoaded’,function(){frmFrontForm.scrollToID(‘frm_container_ref_188784’);});Edit

References

  1. Chollet, F. (2017). Xception: Deep Learning with Depthwise Separable Convolutions. CVPR.
  2. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR.
  3. Dosovitskiy, A., et al. (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR.
  4. Wang, S., & Yu, F. (2022). Deepfake Detection with Vision Transformers. NeurIPS Workshops.
  5. Li, Y., & Lyu, S. (2020). Exposing DeepFake Videos by Detecting Face Warping Artifacts. CVPR Workshops.
  6. Kaggle. (2020). Deepfake Detection Challenge Dataset. Retrieved from: https://www.kaggle.com/c/deepfake-detection-challenge
  7. Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal Loss for Dense Object Detection. ICCV.
  8. Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. NeurIPS.
  9. Ramesh, A., et al. (2021). Zero-Shot Text-to-Image Generation. ICML.
  10. Chung, S., et al. (2019). You Only Look Once: Unified, Real-Time Object Detection. CVPR.

Ahead of Print Subscription Review Article
Volume 12
02
Received 02/05/2025
Accepted 07/05/2025
Published 15/05/2025
Publication Time 13 Days

[first_name] [last_name]

My IP

PlumX Metrics