Valencia, España
Aphasia, often resulting from brain injuries, significantly impairs individuals’ language abilities, creating substantial challenges for verbal communication. Existing assistive technologies frequently fall short in addressing these specialised communication needs, underscoring the urgent demand for adaptive, intelligent support systems. This research proposes a dual approach: an Automatic Speech Recognition (ASR) module fine-tuned on aphasic speech, and a multimodal component that integrates visual context to infer the speaker’s intended meaning. The ASR system leverages fine-tuned versions of Whisper and Wav2Vec 2.0 on data from the AphasiaBank corpus. Results show a notable reduction in Word Error Rate (WER) when comparing base pre-trained ASR models with their finetuned versions, decreasing from 70.36% to 31.53% in a contextindependent setting, and from 61.25% to 35.60% in a speaker-independent evaluation, demonstrating robustness across different scenarios. In contrast to the ASR module, the goal of the multimodal component is not to produce a literal word-by-word transcription, but rather to reconstruct the speaker’s communicative intent using contextual information. To evaluate this capability, we conducted a human study assessing the system’s ability to interpret what the speaker truly meant. The results confirmed that outputs combining visual cues with language model reasoning more reliably captured communicative intent than audio-only transcriptions.
© 2008-2026 Fundación Dialnet · Todos los derechos reservados