Creación de un Modelo de Descripciones de Imágenes Especializado en Arqueología Griega

Enrique Garcia Arias; Ana M. García Serrano

Ayuda

Creación de un Modelo de Descripciones de Imágenes Especializado en Arqueología Griega

Autores: Enrique Garcia Arias, Ana M. García Serrano
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 75, 2025 (Ejemplar dedicado a: Procesamiento del Lenguaje Natural, Revista nº 75, septiembre de 2025), págs. 161-172
Idioma: español
Títulos paralelos:
- Creating an Image Description Model Specialized in Greek Archaeology
Enlaces
- Texto completo
Resumen
- español
  La generación automatizada de descripciones de imágenes (IM, Image Captioning) ha experimentado un considerable impulso en los últimos años con la incorporación de los LLM (Large Language Models). En contextos generalistas, los resultados son bastante ajustados, sin embargo, los desafíos son significativos en dominios especializados, como es el caso del proyecto Arqueogriegos. El corpus multimodal de este trabajo está formado por fotos, planos y textos en un contexto arqueológico y se refieren a yacimientos, artefactos y su entorno histórico, un ámbito complejo para interpretar estas imágenes descontextualizadas y carentes de un texto descriptivo (caption) adecuado. El objetivo principal de este estudio es generar descripciones automáticas optimizadas que superen esta desconexión entre imágenes y textos, abordando las limitaciones de las imágenes arqueológicas aisladas. Para ello, en lugar de recurrir a soluciones directas o vía API, que han resultado insuficientes para la complejidad del problema, se ha diseñado una metodología innovadora que divide los componentes clave en fases, evaluando e implementando en cada una la solución más efectiva, constituyendo así la principal contribución del trabajo al superar las deficiencias de los modelos de IM y LLM multimodal existentes.
- English
  The automated generation of image descriptions (IM, Image Captioning) has seen significant progress in recent years with the integration of LLMs (Large Language Models). In generalist contexts, the results are quite accurate; however, challenges remain substantial in specialized domains, as exemplified by the Arqueogriegos project. The multimodal corpus of this study comprises photos, plans, and texts within an archaeological context, encompassing sites, artifacts, and their historical environment—a particularly complex domain due to the difficulty of interpreting these decontextualized images, lacking an adequate descriptive text (caption). The primary objective of this study is to generate optimized automatic descriptions that address the disconnect between images and texts, tackling the limitations of isolated archaeological images. To achieve this, rather than relying on direct solutions or APIs, which have proven insufficient for the problem's complexity, an innovative methodology was designed, breaking down key components into phases and evaluating and implementing the most effective solution at each stage. This approach constitutes the main contribution of the work, overcoming the shortcomings of existing IM and multimodal LLM models.
Referencias bibliográficas
- Alayrac, J.-B., J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, ..., y K. Simonyan. 2022. Flamingo: a Visual Language...
- Arthur, D. y S. Vassilvitskii. 2006. k- means++: The advantages of careful seeding. ilpubs.stanford.edu.
- Berganzo-Besga, I., H. A. Orengo, F. Lumbreras, M. Carrero-Pazos, J. Fonte, y B. Vilas-Estévez. 2021. Hybrid MSRM-based deep learning and...
- Burns, A., K. Srinivasan, J. Ainslie, G. Brown, B. A. Plummer, K. Saenko, J. Ni, y M. Guo. 2023. A Suite of Generative Tasks for Multi-Level...
- Caspari, G. y P. Crespo. 2019. Convolutional neural networks for archaeological site detection – Finding “princely” tombs. Journal of Archaeological...
- Chen, J., H. Guo, K. Yi, B. Li, y M. Elhoseiny. 2022. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning....
- Chowdhery, A., S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, ..., y N. Fiedel. 2023. PaLM: Scaling Language...
- Fei, J., T. Wang, J. Zhang, Z. He, C. Wang, y F. Zheng. 2023. Transferable Decoding with Visual Entities for Zero-Shot Image Captioning. En...
- Fernández, C. y A. García-Serrano. 2025. Arqueogriegos - catálogo universal de museos y yacimientos arqueológicos de la antigua Grecia. Informe...
- García-Serrano, A., X. Benavent, R. Granados, y J. M. Goñi-Menoyo. 2009. Some results using different approaches to merge visual and text-based...
- Garcia Serrano, A. y A. Menta Garuz. 2022. La inteligencia artificial en las humanidades digitales: dos experiencias con corpus digitales....
- García-Serrano, A., F. Chuquimarca, F. Paños Merino, y C. Fernández, 2025. Diseño del acceso a un catálogo sobre la Antigua Grecia, capítulo...
- Gualandi, M. L., G. Gattiglia, y F. Anichini. 2021. An Open System for Collection and Automatic Recognition of Pottery through Neural Network...
- Gualandi, M. L., R. Scopigno, L. Wolf, J. Richards, J. B. I. Garrigos, M. Heinzelmann, M. A. Hervas, L. Vila, y M. Zallocco. 2016. ArchAIDE...
- Huang, S., L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra, Q. Liu, K. Aggarwal, Z. Chi, N. Bjorck, V....
- Koh, J. Y., R. Salakhutdinov, y D. Fried. 2023. Grounding Language Models to Images for Multimodal Inputs and Outputs. En Proceedings of the...
- Lastra-Díaz, J. J., J. Goikoetxea, M. A. Hadj Taieb, A. Garcia-Serrano, M. Ben Aouicha, E. Agirre, y D. Sánchez. 2021. A large reproducible...
- Lewis, P., E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, y D. Kiela....
- Liu, H., C. Li, Q. Wu, y Y. J. Lee. 2023. Visual Instruction Tuning. Advances in Neural Information Processing Systems, 36:34892–34916, Diciembre.
- Mantovan, L. y L. Nanni. 2020. The Computerization of Archaeology: Survey on Artificial Intelligence Techniques. SN Computer Science, 1(5):267,...
- Martínez-Fernández, J. L., J. V. Román, A. M. García-Serrano, y J. C. González-Cristóbal. 2006. Combining textual and visual features for...
- Nguyen, K., A. F. Biten, A. Mafla, L. Gomez, y D. Karatzas. 2023. Show, Interpret and Tell: Entity-Aware Contextualised Image Captioning in...
- Rombach, R., A. Blattmann, D. Lorenz, P. Esser, y B. Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. En 2022 IEEE/CVF...
- Song, Y., Z. Zhang, Z. Lin, S. Cohen, B. Price, J. Zhang, S. Y. Kim, y D. Aliaga. 2023. ObjectStitch: Object Compositing with Diffusion Model....
- Soroush, M., A. Mehrtash, E. Khazraee, y J. A. Ur. 2020. Deep Learning in Archaeological Remote Sensing: Automated Qanat Detection in the...
- Su, Y., T. Lan, Y. Liu, F. Liu, D. Yogatama, Y.Wang, L. Kong, y N. Collier. 2022. Language Models Can See: Plugging Visual Controls in Text...
- Tobalina Pulido, L., L. A. Polo Romero, y P. A. Suárez López. 2025. La necrópolis de Santa Ana (Guzmán, Burgos). Localización y caracterización...
- Touvron, H., L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, ..., y T. Scialom. 2023. Llama 2: Open Foundation...
- Tsimpoukelli, M., J. L. Menick, S. Cabi, S. M. A. Eslami, O. Vinyals, y F. Hill. 2021. Multimodal Few-Shot Learning with Frozen Language Models....
- Zhu, D., J. Chen, X. Shen, X. Li, y M. Elhoseiny. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models,...
- Zou, A., Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, y M. Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language...