Documentos duplicados y casi duplicados en el Web: detección con técnicas de hashing borroso

Luis Carlos García de Figuerola Paniagua; Raquel Gómez Díaz; José Luis Alonso Berrocal; Ángel Francisco Zazo Rodríguez

Ayuda

Documentos duplicados y casi duplicados en el Web: detección con técnicas de hashing borroso

Autores: Luis Carlos García de Figuerola Paniagua , Raquel Gómez Díaz , José Luis Alonso Berrocal , Ángel Francisco Zazo Rodríguez
Localización: Scire: Representación y organización del conocimiento, ISSN 1135-3716, Vol. 17, Nº 1, 2011, págs. 49-54
Idioma: español
DOI: 10.54886/scire.v17i1.3895
Enlaces
- Texto completo

Dialnet Métricas: 1 Cita

Referencias bibliográficas
- Bar-Ilan, J. (2005). Expectations versus reality sarch engine features needed for web research at mid 2005. // Cybermetrics 9:1 (2005).
- Bharat, K.; Broder, A. (1999). Mirror, mirror on the web: A study of host pairs with replicated con-tent. // Computer Networks. 31:11-16 (1999)...
- Chowdhury, A. (2004). Duplicate data detection. http://gogamza.mireene. co.kr/wpcontent/uploads/1/Xbsr PeUgh6.pdf (2011-01-13).
- Chowdhury, A.; Frieder, O.; Grossman, D.; McCabe, M. (2002). Collection statistics for fast duplicate document detection. // ACM Transactions...
- Clarke C.L.; Crasswell, N.; Soboroff, I. (2009). Overview of the TREC 2009 Web Track // Proceedings of the 18th Text REtrieval Conference,...
- Damerau, F. (1964). A technique for computer detec-tion and correction of spelling errors. // Communications of the ACM. 3, 171-176.
- Figuerola, C. G.; Alonso Berrocal, J. L.; Zazo Rodríguez, A. F.; Rodriguez Vázquez de Aldana, E. (2006). Diseño de spiders. // Tech. Rep....
- Figuerola, C. G.; Gómez Díaz, R.; Alonso Berrocal, J. L.; Zazo Rodríguez, A. F. (2010). Proyecto 7: un motor de recuperación web colaborativo....
- Hamming, R. (1950). Error detecting and error correcting codes. // Bell System Technical Journal. 29:2, 147-160.
- Kornblum, J. (2006). Identifying almost identical files using context triggered piecewise hashing. // Digital investigation. 3, 91-97. (Pubitemid...
- Kornblum, J. (2010). Beyond fuzzy hash. // US Digital Forensic and Incident Response Summit 2010 (2010). http://computer-foren-sics.sans.org/...
- Kornblum, J. (2010). Fuzzy hashing and sseep. http://ssdeep.sourceforge. net/ (2011-01-13).
- Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. // Soviet Physics Doklady. 10:8, 707-710.
- Milenko, D. (2010). ssdeep 2.5. python wrapper for ssdeep library. http://pypi.python.org/pypi/ssdeep (2011-01-13).
- Navarro, G. (2001). A guided tour to approximate string matching. // ACM computing surveys (CSUR). 33:1, 31-88. (Pubitemid 33768480)
- Pugh, W. Y; Henzinger, M.H. (2003). Detecting Duplicate and Near Duplicate Files. United Sates Patent 6.658.423.
- Soukoreff, R., MacKenzie, I. (2001). Measuring errors in text entry tasks: an application of the levenshtein string distance statistic. //...
- Tan, P.; Steinbach, M.; Kumar, V.; et al. (2006). Introduction to data mining. Pearson Addison Wesley: Boston (2006).
- Tridgell, A. (2002). Spamsum overview and code. http://sam ba.org/ftp/unpacked/junkcode/spamsum (2011-01-13).
- Tridgell, A., Mackerras, P.(2004). The rsync algorithm. http://dspace-prod1.anu.edu.au/bitstream/1885/40765/2/ TR-CS-96-05.pdf (2011-01-13).
- Yahoo! (2011). Yahoo Developer Network. http://developer.yahoo.com (2011-01-13).
- Yerra, R.; Ng, Y. (2005). Detecting similar html documents using a fuzzy set information retrieval approach. // 2005 IEEE International Conference...