STEMM Institute Press
Science, Technology, Engineering, Management and Medicine
Image Classification Method Using a CNN-Transformer Hybrid Architecture that Integrates Local and Global Features
DOI: https://doi.org/10.62517/jike.202504316
Author(s)
Wei Lu*
Affiliation(s)
The National University of Malaysia, Bangi 43600, Selangor, Malaysia *Corresponding Author
Abstract
In recent years, visual Transformers have achieved significant success in the field of computer vision, but they still have shortcomings in terms of local feature extraction. This paper proposes an innovative CNN-Transformer hybrid architecture that combines the local feature extraction capabilities of convolutional neural networks with the global modelling capabilities of Transformers to achieve efficient image classification on the CIFAR-100 dataset. The architecture first uses multi-scale CNN modules to extract hierarchical feature representations from images, then employs a Transformer encoder to capture long-range dependencies between features. Experimental results demonstrate that the proposed hybrid architecture achieves excellent classification accuracy on the CIFAR-100 dataset, with stable training processes and fast convergence speeds. This study enhances the model's ability to understand complex image features by introducing positional encoding mechanisms and multi-head self-attention mechanisms. Additionally, this paper employs various data augmentation strategies and regularisation techniques, including random cropping, colour jittering, and Dropout, to further improve the model's generalisation performance. Ablation experiments validate the effectiveness of each module, with the number of Transformer layers and attention heads having the most significant impact on model performance. This study provides new insights into hybrid architecture design in the field of computer vision, offering important theoretical value and application prospects.
Keywords
CNN-Transformer Hybrid Architecture; Image Classification; Self-Attention Mechanism; Feature Fusion; CIFAR-100
References
[1] Maurício, J., Domingues, I., & Bernardino, J. (2023). Comparing vision transformers and convolutional neural networks for image classification: A literature review. Applied Sciences, 13(9), 5521. https://doi.org/10.3390/app13095521 [2] Chen, J., Wu, P., Zhang, X., et al. (2024). Add-Vit: CNN-Transformer hybrid architecture for small data paradigm processing. Neural Processing Letters, 56, 198. https://doi.org/10.1007/s11063-024-11643-8 [3] Takahashi, S., Sakaguchi, Y., Kouno, N., et al. (2024). Comparison of vision transformers and convolutional neural networks in medical image analysis: A systematic review. Journal of Medical Systems, 48(1), 84. https://doi.org/10.1007/s10916-024-02105-8 [4] Naidji, M. R., & Elberrichi, Z. (2024). A novel hybrid vision transformer CNN for COVID-19 detection from ECG images. Computers, 13(5), 109. https://doi.org/10.3390/computers13050109 [5] Wang, G., et al. (2024). A hybrid approach of vision transformers and CNNs for detection of ulcerative colitis. Scientific Reports, 14, 24321. https://doi.org/10.1038/s41598-024-75901-4 [6] Kuang, H., Wang, Y., Liu, J., et al. (2024). Hybrid CNN-Transformer network with circular feature interaction for acute ischemic stroke lesion segmentation on non-contrast CT scans. IEEE Transactions on Medical Imaging, 43(6), 2303-2316. https://doi.org/10.1109/TMI.2024.3362879 [7] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008.https://doi.org/10.48550/arXiv.1706.03762 [8] Haruna, Y., et al. (2024). Exploring the synergies of hybrid CNNs and ViTs architectures for computer vision: A survey. arXiv preprint arXiv:2402.02941.https://doi.org/10.48550/arXiv.2402.02941 [9] Khan, A., et al. (2024). A survey of the vision transformers and their CNN-transformer based variants. arXiv preprint arXiv:2305.09880.https://doi.org/10.48550/arXiv.2305.09880 [10] Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11976-11986. https://doi.org/10.1109/CVPR52688.2022.01167 [11] Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., & Girshick, R. (2021). Early convolutions help transformers see better. Advances in Neural Information Processing Systems, 34, 30392-30400.https://doi.org/10.48550/arXiv.2106.14881
Copyright @ 2020-2035 STEMM Institute Press All Rights Reserved