Research on Lightweight and NPU Hardware Acceleration of ViT Model Based on Pruning-Distillation-Quantization
DOI: https://doi.org/10.62517/jbdc.202601205
Author(s)
Meilin Deng
Affiliation(s)
Shenyang University of Technology, Shenyang, Liaoning, China
Abstract
This study addresses the core challenges of high computational complexity, large memory footprint, and poor hardware adaptability in visual Transformer-based whole-slide pathology image analysis. We propose a five-stage collaborative optimization architecture tailored for Ascend AI processors-pruning, distillation, quantization, hardware acceleration, and whole-slide deployment-to achieve a balance between model lightweighting and hardware efficiency, providing a viable deployment path for real-time clinical pathology-assisted diagnosis. We introduce a “hardware–algorithm–domain co-optimization paradigm”, integrating Ascend NPU physical constraints with high-level pathological semantics to construct an end-to-end software–hardware acceleration pipeline. First, “model restructuring” is performed through hardware-aware design. During pruning, a multidimensional sensitivity evaluation function-incorporating medical relevance, NPU computational efficiency, and memory efficiency-guides structured pruning to remove low-contribution and hardware-inefficient parameters. Post-pruning, dimension alignment and reparameterization adapt the model to Ascend NPU computing units, improving computational density and instruction efficiency. The preprocessing pipeline is integrated into Ascend AIPP hardware units, with a learnable staining enhancement matrix compensating for scanner variability, enhancing preprocessing speed and cross-site robustness. Second, “knowledge transfer” in pathological image perception is achieved via a multi-level distillation strategy with lesion-aware weighting. The lightweight student model replicates not only the teacher’s final output but also its attention distribution and multi-scale feature representations in lesion areas, with added emphasis on diagnostically ambiguous samples. Quantization employs a task-aware heterogeneous precision strategy, dynamically allocating per-layer precision to preserve diagnostic semantics while maximizing computational and storage compression. These techniques are integrated through an iterative closed-loop co-optimization process, enabling efficient deployment on Ascend NPUs. Experiments on the public BACH Breast Cancer dataset demonstrate significant efficiency gains with preserved accuracy: overall classification accuracy reaches 87.76%, the F1 score for in situ cancer identification is 92.8%, model size is reduced by >50%, single WSI inference latency drops by >60%, and accuracy loss is kept within 1%. System monitoring confirms efficient resource utilization, stable CPU and I/O performance, and validates hardware–software optimizations such as “whole-image sinking.”
Keywords
Whole Slide Image; Model Compression; Model Pruning; Knowledge Distillation; Model Quantization; Hardware Acceleration; Breast Cancer Classification; Real-Time Inference
References
[1] Osareh A, Shadgar B. Machine learning techniques to diagnose breast cancer[C]//2010 5th international symposium on health informatics and bioinformatics. IEEE, 2010: 114- 120.
[2] Montazeri M, Montazeri M, Montazeri M, et al. Machine learning models in breast cancer survival prediction[J]. Technology and Health Care, 2016, 24(1): 31-42.
[3] Asri H, Mousannif H, Al Moatassime H, et al. Using machine learning algorithms for breast cancer risk prediction and diagnosis[J]. Procedia Computer Science, 2016, 83: 1064-1069.
[4] Tafavvoghi M, et al. Deep learning-based classification of breast cancer molecular subtypes from H&E whole-slide images[J]. Journal of Pathology Informatics, 2025, 16: 100410.
[5] Vaswani,A.,Shazeer,N., Parmar, N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008
[6] Chen R J, Ding T, Lu M Y, et al. UNI: A universal self-supervised foundation model for computational pathology[J]. Nature Medicine, 2024, 30(7): 1389-1400
[7] Liao Q, et al. AMLA: MUL by ADD in FlashAttention rescaling[EB/OL]. arXiv:2509.25224, 2025.
[8] Kornblith S, Norouzi M, et al. A Simple Framework for Contrastive Learning of Visual Representations[J]. ACM Digital Library, 2020, 37(ICML): 1597-1607
[9] Chen D, Lin K, Deng Q. UCC: A unified cascade compression framework for vision transformer models[J]. Neurocomputing, 2025, 612: 128747.
[10] Guo K, Li Y, Fu D, et al. Vision transformer model compression based on pruning-distillation[J]. Journal of Xidian University, 2025, 52(3): 232-241.