Improving Aphasic Communication Using Multimodal AI Systems

Isabel Ferri Molla; Jordi Linares Pellicer; Juan Izquierdo Doménech

Ayuda

Improving Aphasic Communication Using Multimodal AI Systems

Isabel Ferri-Molla ^[1] ; Jordi Linares-Pellicer ^[1] ; Juan Izquierdo-Domenech ^[1]
1. [1] Universidad Politécnica de Valencia
  
  Universidad Politécnica de Valencia
  
  Valencia, España
Localización: IJIMAI, ISSN-e 1989-1660, Vol. 9, Nº. 7, 2026, págs. 67-77
Idioma: inglés
DOI: 10.9781/ijimai.2026.2215
Enlaces
- Texto completo
Resumen
- Aphasia, often resulting from brain injuries, significantly impairs individuals’ language abilities, creating substantial challenges for verbal communication. Existing assistive technologies frequently fall short in addressing these specialised communication needs, underscoring the urgent demand for adaptive, intelligent support systems. This research proposes a dual approach: an Automatic Speech Recognition (ASR) module fine-tuned on aphasic speech, and a multimodal component that integrates visual context to infer the speaker’s intended meaning. The ASR system leverages fine-tuned versions of Whisper and Wav2Vec 2.0 on data from the AphasiaBank corpus. Results show a notable reduction in Word Error Rate (WER) when comparing base pre-trained ASR models with their finetuned versions, decreasing from 70.36% to 31.53% in a contextindependent setting, and from 61.25% to 35.60% in a speaker-independent evaluation, demonstrating robustness across different scenarios. In contrast to the ASR module, the goal of the multimodal component is not to produce a literal word-by-word transcription, but rather to reconstruct the speaker’s communicative intent using contextual information. To evaluate this capability, we conducted a human study assessing the system’s ability to interpret what the speaker truly meant. The results confirmed that outputs combining visual cues with language model reasoning more reliably captured communicative intent than audio-only transcriptions.
Referencias bibliográficas
- [1] L. Rabiner, B. Juang, “An introduction to hidden markov models,” ieee assp magazine, vol. 3, no. 1, pp. 4–16, 1986, doi: https://doi.org/10.1109/MASSP.1986.1165342
- [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, “Attention is all you need,” Advances...
- [3] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in...
- [4] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, “wav2vec 2.0: A framework for selfsupervised learning of speech representations,” Advances in...
- [5] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman, et al., “Seamlessm4tmassively...
- [6] J. Li, et al., “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing,...
- [7] M. Ozh, D. Oralbekova, K. Alimhan, M. Othman, B. Zhumazhanov, “Development online models for automatic speech recognition systems with...
- [8] J. Nouza, L. Mateju, P. Cerva, J. Zdansky, “Developing state-of-theart end-to-end asr for norwegian,” in International Conference on Text,...
- [9] K. Deng, P. C. Woodland, “Adaptable end-to-end asr models using replaceable internal lms and residual softmax,” in ICASSP 2023-2023 IEEE...
- [10] S. Dhahbi, N. Saleem, T. S. Gunawan, S. Bourouis, I. Ali, A. Trigui, A. D. Algarni, “Lightweight realtime recurrent models for speech...
- [11] D. Mulfari, G. Meoni, M. Marini, L. Fanucci, “Machine learning assistive application for users with speech disorders,” Applied Soft Computing,...
- [12] N. Riccardi, S. Nelakuditi, D. B. den Ouden, C. Rorden, J. Fridriksson, R. H. Desai, “Discourse-and lesionbased aphasia quotient estimation...
- [13] A. Adikari, N. Hernandez, D. Alahakoon, M. L. Rose, J. E. Pierce, “From concept to practice: a scoping review of the application of ai...
- [14] H. Yang, M. Zhang, S. Tao, M. Ma, Y. Qin, “Chinese asr and ner improvement based on whisper finetuning,” in 2023 25th International Conference...
- [15] J. R. Green, R. L. MacDonald, P.-P. Jiang, J. Cattiau, R. Heywood, R. Cave, K. Seaver, M. A. Ladewig, J. Tobin, M. P. Brenner, et al.,...
- [16] K. Rao, H. Sak, R. Prabhavalkar, “Exploring architectures, data and units for streaming endto-end speech recognition with rnn-transducer,”...
- [17] J. Shor, D. Emanuel, O. Lang, O. Tuval, M. Brenner, J. Cattiau, F. Vieira, M. McNally, T. Charbonneau, M. Nollstadt, et al., “Personalizing...
- [18] V. B. Kumar, S. Cheng, N. Peng, Y. Zhang, “Visual information matters for asr error correction,” in ICASSP 2023-2023 IEEE International...
- [19] J. Effendi, A. Tjandra, S. Sakti, S. Nakamura, “Listening while speaking and visualizing: Improving asr through multimodal chain,” in...
- [20] S. Debnath, P. Roy, “Audio-visual automatic speech recognition using pzm, mfcc and statistical analysis,” International Journal of Interactive...
- [21] S. K. Choe, Q. Lu, V. Raunak, Y. Xu, F. Metze, “On leveraging visual modality for speech recognition error correction,” 2019.
- [22] X. Chen, Y. Wang, X. Wu, D. Wang, Z. Wu, X. Liu, H. Meng, “Exploiting audio-visual features with pretrained av-hubert for multi-modal...
- [23] C. Yu, X. Su, Z. Qian, “Multi-stage audiovisual fusion for dysarthric speech recognition with pre-trained models,” IEEE Transactions...
- [24] E. Howarth, G. Vabulas, S. Connolly, D. Green, S. S. and, “Developing accessible speech technology with users with dysarthric speech,”...
- [25] G. Ayoka, G. Barbareschi, R. Cave, C. Holloway, “Enhancing communication equity: evaluation of an automated speech recognition application...
- [26] B. MacWhinney, “The talkbank project,” Creating andDigitizing Language Corpora: Volume 1: Synchronic Databases, pp. 163–180, 2007, doi:...
- [27] B. MacWhinney, D. Fromm, M. Forbes, A. Holland, “Aphasiabank: Methods for studying discourse,” Aphasiology, vol. 25, no. 11, pp. 1286–1307,...
- [28] J. Guo, J. Li, D. Li, A. M. H. Tiong, B. Li, D. Tao, S. Hoi, “From images to textual prompts: Zero-shot visual question answering with...
- [29] Y. Hu, H. Hua, Z. Yang, W. Shi, N. A. Smith, J. Luo, “Promptcap: Promptguided image captioning for vqa with gpt-3,” in Proceedings of...
- [30] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, et al., “Espnet:...
- [31] J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui, Z. Zhou, C. Gong, Y. Shen, et al., “A comprehensive capability analysis of gpt-3...
- [32] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, et al., “Lamda: Language...
- [33] Y. Zhang, R. Zhang, J. Gu, Y. Zhou, N. Lipka, D. Yang, T. Sun, “Llavar: Enhanced visual instruction tuning for text-rich image understanding,”...