Comparative Analysis of CNNs and Vision Transformers for Age Estimation

Waqar Tanveer; Laura Fernández Robles; Eduardo Fidalgo Fernández; Víctor González Castro; Enrique Alegre Gutiérrez; Milad Mirjalili

Ayuda

Comparative Analysis of CNNs and Vision Transformers for Age Estimation

Tanveer, Waqar ^[1] ; Fernández-Robles, Laura ^[1] ; Fidalgo, Eduardo ^[1] ; González-Castro, Víctor ^[1] ; Alegre, Enrique ^[1] ; Mirjalili, Milad ^[1]
1. [1] Universidad de León
  
  Universidad de León
  
  León, España
Localización: Jornadas de Automática, ISSN-e 3045-4093, Nº. 46, 2025
Idioma: inglés
DOI: 10.17979/ja-cea.2025.46.12251
Enlaces
- Texto completo
Resumen
- español
  Los transformadores de visión han adquirido recientemente una importancia significativa en las tareas de visión por ordenador debido a sus mecanismos de autoatención. Anteriormente, las CNN dominaban el campo de la visión por ordenador al lograr resultados notables en diversas aplicaciones como la clasificación de imágenes o el reconocimiento de objetos, entre otras. Sin embargo, con la llegada de los Transformadores de Visión, ha surgido una intensa competencia entre ambos. Este artículo presenta un análisis comparativo del rendimiento de las CNNs y los Transformadores de Visión para la tarea de estimación de la edad en los conjuntos de datos FG-NET y UTKFace. Realizamos la estimación de la edad utilizando seis modelos, incluidos tres modelos de CNN (VGG-16, ResNet-50, EfficientNet-B0) y tres modelos de transformadores de visión (ViT, CaiT, Swin). Nuestros resultados experimentales muestran que el transformador Swin superó tanto a la CNN como a los demás transformadores de visión.
- English
  Vision Transformers have recently gained significant importance in computer vision tasks due to their self-attention mechanisms. Previously, CNNs dominated the computer vision field by achieving remarkable results in various applications such as image classification, object recognition, and more. However, with the arrival of Vision Transformers, an intense competition has emerged between the two. This paper presents a comparative analysis of the performance of CNNs and Vision Transformers for the task of age estimation on the FG-NET and UTKFace datasets. We performed age estimation using six models, including three CNN models (VGG-16, ResNet-50, EfficientNet-B0) and three Vision Transformer models (ViT, CaiT, Swin). Our experimental results show that the Swin Transformer outperformed both CNN and other Vision Transformers, achieving a mean absolute error (MAE) of 2.79 years on FG-NET and 4.37 years on UTKFace.
Referencias bibliográficas
- Agbo-Ajala, O., Viriri, S., 2021. Deep learning approach for facial age classification: a survey of the state-of-the-art. Artificial Intelligence...
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.,...
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.,...
- Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., Yang, Z., Zhang, Y., Tao, D., 2023. A survey...
- Hatamizadeh, A., Yin, H., Heinrich, G., Kautz, J., Molchanov, P., 2023. Global context vision transformers. In: Proceedings of the 40th International...
- Hiba, S., Keller, Y., 2023. Hierarchical attention-based age estimation and bias analysis. IEEE Transactions on Pattern Analysis and Machine...
- King, D. E., 2009. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research 10, 1755–1758.
- Kuang, H., Huang, X., Ma, X., Liu, X., 2023. Efficientrf: Facial age estimation based on efficientnet and random forest. In: 2023 IEEE 3rd...
- Kuprashevich, M., Tolstykh, I., 2023. Mivolo: Multi-input transformer for age and gender estimation. In: International Conference on Analysis...
- Lanitis, A., Taylor, C., Cootes, T., 2002. Toward automatic simulation of aging effects on face images. IEEE Transactions on Pattern Analysis...
- Li, X., Wang, L., Zhu, R., Ma, Z., Cao, J., Xue, J.-H., 2025. Srml: Structurerelation mutual learning network for few-shot image classification....
- Liu, P., Qian,W., Huang, J., Tu, Y., Cheung, Y.-M., 2025. Transformer-driven feature fusion network and visual feature coding for multi-label...
- Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted...
- Maurício, J., Domingues, I., Bernardino, J., 2023. Comparing vision transformers and convolutional neural networks for image classification:...
- Moutik, O., Sekkat, H., Tigani, S., Chehri, A., Saadane, R., Tchakoucht, T. A., Paul, A., 2023. Convolutional neural networks or vision transformers:...
- Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A., 2021. Do vision transformers see like convolutional neural networks?...
- Rothe, R., Timofte, R., Van Gool, L., 2018. Deep expectation of real and apparent age from a single image without facial landmarks. International...
- Shi, C., Zhao, S., Zhang, K., Wang, Y., Liang, L., 2023. Face-based age estimation using improved swin transformer with attention-based convolution....
- Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- Song, Y.,Wang, F., 2024. Coreface: Sample-guided contrastive regularization for deep face recognition. Pattern Recognition 152, 110483. DOI:...
- Takahashi, S., Sakaguchi, Y., Kouno, N., Takasawa, K., Ishizu, K., Akagi, Y., Aoyama, R., Teraya, N., Bolatkan, A., Shinkai, N., et al., 2024....
- Tan, M., Le, Q., 09–15 Jun 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov,...
- Tomasini, U. M., Petrini, L., Cagnetta, F., Wyart, M., 2023. How deep convolutional neural networks lose spatial information with training....
- Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., J´egou, H., 2021. Going deeper with image transformers. In: 2021 IEEE/CVF International...
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., Polosukhin, I., 2017. Attention is all you need....
- Wang, X., Zhang, L. L., Wang, Y., Yang, M., 2022. Towards efficient vision transformer inference: a first study of transformers on mobile...
- Xu, L., Hu, C., Shu, X., Yu, H., 2025. Cross spatial and cross-scale swin transformer for fine-grained age estimation. Computers and Electrical...
- Yu, S., Zhao, Q., 2025. Improving age estimation in occluded facial images with knowledge distillation and layer-wise feature reconstruction....
- Zhang, Z., Song, Y., Qi, H., 2017. Age progression/regression by conditional adversarial autoencoder. In: 2017 IEEE Conference on Computer...
- Zhao, Z., Qian, P., Hou, Y., Zeng, Z., 2022. Adaptive mean-residue loss for robust facial age estimation. In: 2022 IEEE International Conference...