Resumen de Overview of PastReader at IberLEF 2025: transcribing texts from the past

Ayuda

Resumen de Overview of PastReader at IberLEF 2025: transcribing texts from the past

Arturo Montejo Ráez , Elena Sánchez Nogales, Gloria Expósito Álvarez, Luis Alfonso Ureña López , María Teresa Martín Valdivia , Jaime Collado Montañez, Manuel Carlos Díaz Galiano , Isabel Cabrera de Castro, M.ª Victoria Cantero Romero, Rocío Ortuño Casanova

español
La tarea PastReader 2025, en el marco de IberLEF 2025, se centra en la transcripción automática de prensa histórica española digitalizada. Utiliza como base la Hemeroteca Digital de la Biblioteca Nacional de España, una colección que forma parte del proyecto Biblioteca Digital Hispánica y que reúne millones de páginas de periódicos y revistas representativas de la diversidad temática y estilística de la prensa hispánica. Aunque los documentos están disponibles en PDF con OCR, la calidad de los textos extraídos suele ser baja debido a escaneos deteriorados, estructuras de página irregulares, ortografía antigua y otros problemas visuales. Para avanzar en la automatización de este proceso, la tarea propone dos retos: la corrección de errores OCR y la generación de textos curados a partir de imágenes escaneadas, aplicando modelos multimodales. El objetivo principal es reducir la necesidad de intervención humana en los procesos de digitalización masiva, promoviendo sistemas capaces de mejorar la accesibilidad, recuperación y preservación del patrimonio hemerográfico español mediante soluciones tecnológicas robustas y eficientes.
English
The PastReader 2025 task, within the framework of IberLEF 2025, focuses on the automatic transcription of digitized Spanish historical press. It uses as a basis the Digital Newspaper Library of the National Library of Spain, a collection that is part of the Hispanic Digital Library project and that gathers millions of pages of newspapers and magazines representative of the thematic and stylistic diversity of the Hispanic press. Although the documents are available in PDF with OCR, the quality of the extracted texts is often poor due to deteriorated scans, irregular page structures, old spelling, and other visual problems. To further automate this process, the task proposes two challenges: the correction of OCR errors and the generation of curated texts from scanned images, applying multimodal models. The main objective is to reduce the need for human intervention in mass digitization processes, promoting systems capable of improving the accessibility, recovery, and preservation of Spanish newspaper heritage through robust and efficient technological solutions.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Mi Documat

Selección

Coordinado por: