Discovering data dependencies in web content mining

José Carlos Cortizo Pérez; Ignacio Giráldez

Ayuda

Discovering data dependencies in web content mining

Cortizo, José C. ; Giráldez, Ignacio ^[1]
1. [1] Universidad Europea de Madrid
  
  Universidad Europea de Madrid
  
  Madrid, España
Localización: Proceedings of the IADIS International Conference WWW/INTERNET 2004: Madrid, Spain, October 6-9, 2004 / coord. por Pedro Isaías, Nitya Karmakar, Vol. 2, 2004 (Short Papers-Posters), ISBN 972-99353-0-0, págs. 881-884
Idioma: inglés
Texto completo no disponible (Saber más ...)
Resumen
- Web content mining opens up the possibility to use data presented in web pages for the discovery of interesting and useful patterns. Our web mining tool, FBL (Filtered Bayesian Learning), performs a two stage process: first it analyzes data present in a web page, and then, using information about the data dependencies encountered, it performs the mining phase based on bayesian learning. The Näive Bayes classifier is based on the assumption that the attribute values are conditionally independent for a given class. This makes if perform very well in some data domains, but performs poorly when attributes are dependent. In this paper, we try to identify those dependencies using linear regression on the attribute values, and then eliminate the attributes which are a linear combination of one or two others. We have tested this system on six web domains (extracting the data by parsing the html), where we have added a synthetic attribute which is a linear combination of two of the original ones. The system detects perfectly those synthetic attributes and also some "natural" dependent attributes, obtaining a more accurate classifier.