Ir al contenido

Documat


Discovering data dependencies in web content mining

  • Cortizo, José C. ; Giráldez, Ignacio [1]
    1. [1] Universidad Europea de Madrid

      Universidad Europea de Madrid

      Madrid, España

  • Localización: Proceedings of the IADIS International Conference WWW/INTERNET 2004: Madrid, Spain, October 6-9, 2004 / coord. por Pedro Isaías, Nitya Karmakar, Vol. 2, 2004 (Short Papers-Posters), ISBN 972-99353-0-0, págs. 881-884
  • Idioma: inglés
  • Texto completo no disponible (Saber más ...)
  • Resumen
    • Web content mining opens up the possibility to use data presented in web pages for the discovery of interesting and useful patterns. Our web mining tool, FBL (Filtered Bayesian Learning), performs a two stage process: first it analyzes data present in a web page, and then, using information about the data dependencies encountered, it performs the mining phase based on bayesian learning. The Näive Bayes classifier is based on the assumption that the attribute values are conditionally independent for a given class. This makes if perform very well in some data domains, but performs poorly when attributes are dependent. In this paper, we try to identify those dependencies using linear regression on the attribute values, and then eliminate the attributes which are a linear combination of one or two others. We have tested this system on six web domains (extracting the data by parsing the html), where we have added a synthetic attribute which is a linear combination of two of the original ones. The system detects perfectly those synthetic attributes and also some "natural" dependent attributes, obtaining a more accurate classifier.


Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno