An effective and efficient web news extraction technique for an operational newsIR system

Autores: Javier Parapar , Álvaro Barreiro
Localización: XII Conferencia de la Asociación Española para la Inteligencia Artificial: (CAEPIA 2007). Actas / coord. por Daniel Borrajo Millán , Luis Castillo Vidal , Juan Manuel Corchado Rodríguez , Vol. 2, 2007, ISBN 978-84-611-8848-2, págs. 319-329
Idioma: inglés
Texto completo no disponible (Saber más ...)
Resumen
- Web information extraction, in particular web news extraction is an open research problem and it is a key point in NewsIR systems. Current techniques fail in the quality of the results, the high computational costs or the necessity of human intervention, all of them critical issues in a real system. We present an automated approach to news recognition and extraction based on a set of heuristics about the articles structure, that is currently applied in an operational system. We also built a data set to evaluate web news extraction methods. Our results in this collection of international news, composed of 4869 web pages from 15 different on-line sources, achieved a 97% of precision and a 94% of recall for the news recognition and extraction task.