Metodología de prevención del cibercrimen mediante Web Scraping y procesamiento del lenguaje natural para la detección de filtraciones de datos en la Dark Web

Noelia Rico Pachón; Facundo Gallo-Serpillo; Raquel Barroso Reyes

Ayuda

Metodología de prevención del cibercrimen mediante Web Scraping y procesamiento del lenguaje natural para la detección de filtraciones de datos en la Dark Web

Rico Pachón, Noelia ^[1] ; Gallo-Serpillo, Facundo ^[2] ; Barroso, Raquel ^[3]
1. [1] Universidad de Oviedo
  
  Universidad de Oviedo
  
  Oviedo, España
2. [2] Universidad Internacional de La Rioja
  
  Universidad Internacional de La Rioja
  
  Logroño, España
3. [3] Ewala IT Services
Mostrar afiliaciones +
Localización: Ciencia policial, ISSN 1886-5577, Nº. 184, 2025, págs. 87-113
Idioma: español
DOI: 10.14201/cp.32279
Títulos paralelos:
- Cybercrime Prevention Methodology Using Web Scraping and Natural Language Processing for the Detection of Data Leaks in the Dark Web
Enlaces
- Texto completo
Resumen
- español
  El presente trabajo describe una metodología basada en la captura y el procesamiento de datos filtrados que han sido puestos a la venta en la Dark Web, una zona de internet con una alta prevalencia de contenido criminal. Uno de los principales desafíos en la recopilación y procesamiento de datos críticos vendidos en la Dark Web es la volatilidad del contenido y su falta de estructura. Por esta razón, se propone una metodología basada en técnicas de Web Scraping y procesamiento del lenguaje natural (PNL) para la detección de datos sensibles publicados en el mercado negro de Internet, con el objetivo de prevenir su uso en casos de extorsión, robo de identidad, divulgación de secretos y otro tipo de ciberdelitos.El software desarrollado representa un gran avance para el bienestar social, ya que permite la monitorización y la interpretación automática de datos filtrados en la Dark Web (por ejemplo, tarjetas de crédito, números de identificación personal, etc.) salvaguardando, de esta manera, la privacidad de los individuos y los organismos afectados. Además, al proporcionar un enfoque más rápido y preciso para abordar estas amenazas, la adopción de la arquitectura propuesta en este trabajo promueve un entorno en línea más seguro para todos los usuarios.
- English
  This paper sets out a methodology for the capture and processing of data that has been leaked and made available for sale on the Dark Web, a part of the internet that is known for its high prevalence of criminal content. A significant challenge in the collection and processing of critical data sold on the Dark Web is the volatility of the content and its lack of structure. To address these challenges, a methodology employing Web Scraping and Natural Language Processing (NLP) techniques is proposed for the detection of sensitive data published on the Internet black market. The aim is to prevent its use in cases of extortion, identity theft, disclosure of secrets and other types of cybercrime.The developed software represents a significant advancement in terms of social welfare, as it facilitates the automated monitoring and interpretation of data leaked on the Dark Web (e.g. credit card details, personal identification numbers, etc.), thereby ensuring the privacy of the individuals and organisations concerned. Moreover, by offering a more efficient and precise approach to address these threats, the adoption of the architecture proposed in this work contributes to the creation of a safer online environment for all users.
Referencias bibliográficas
- Al Nabki, M. W., Fidalgo, E., Alegre, E. y De Paz, I. (2017). Classifying illegal activities on Tor network based on web textual contents....
- Alneyadi, S., Sithirasenan, E. y Muthukkumarasamy, V. (2016). A survey on data leakage prevention systems. Journal of Network and Computer...
- Ciberprotection-magazine. (2021). Here is how hackers attacked the Oldsmar water supply – and how a catastrophe was prevented. Recuperado...
- Código Penal [CP]. Art. 197. 24 noviembre de 1995 (España).
- Código Penal [CP]. Art. 298. 24 de noviembre de 1995 (España).
- Connolly, K., Klempay, A., McCann, M. y Brenner, P. (2023). Dark web marketplaces: Data for collaborative threat intelligence. Digital Threats:...
- Ewala IT Services. (2023). Karonte Fuga de Datos. https://www.karont3.tech/img/cms/Karont3%20Tendencias%202023_12%201.pdf
- Gallo-Serpillo, F. y Valls-Prieto, J. (2024). Analysis of CSEM offenders on the dark web using honeypots to geolocate IP addresses from Spain....
- Gede, I., Rahayuda, S., Putu, N. y Santiari, L. (2017). Crawling and Cluster Hidden Web Using Crawler Framework and Fuzzy-KNN. Conference:...
- Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, Yu, L., Liu, Y., Li, J., Xiong, B. y Xiong, D. (2023). Evaluating large language...
- Karabey Aksakalli, I., Çelik, T., Can, A. B. y Teki̇nerdoğan, B. (2021). Deployment and communication patterns in microservice architectures:...
- Lacey, D. y Salmon, P. M. (2015). It’s dark in there: Using systems analysis to investigate trust and engagement in dark web forums. En Lecture...
- Lin, F., Liu, Y., Ebrahimi, M., Ahmad-Post, Z., Hu, J. L., Xin, J., Samtani, S., Li, W. y Chen, H. (2020). Linking personally identifiable...
- Mitchell, R. (2015). Web scraping with Python: Collecting more data from the modern web. O’Reilly Media.
- Naskali, J., Rantanen, M., Rottenkolber, M. y Kimppa, K. K. (2024). Smart Ethics in the Digital World. Proceedings of the ETHICOMP 2024. Smart...
- Nayak, S. K. y Ojha, A. C. (2020). Data leakage detection and prevention: Review and research directions. En Machine Learning and Information...
- Neto, N. N., Madnick, S., Paula, A. M. G. D. y Borges, N. M. (2021). Developing a global data breach database and the challenges encountered....
- Ojoawo, A. O., Fagbolu, O. O., Olaniyan, A. S. y Sonubi, T. A. (s. f.). Data leak protection using text mining and social network analysis....
- Pimenta Rodrigues, G. A., Marques Serrano, A. L., Lopes Espiñeira Lemos, A. N., Canedo, E. D., Mendonça, F. L. L. de, De Oliveira Albuquerque,...
- Saleem, J., Islam, R. y Kabir, M. A. (2022). The anonymity of the dark web: A survey. IEEE Access: Practical Innovations, Open Solutions,...
- Zhang, Z., He, B. y Chang, K. C.-C. (2004). Understanding Web query interfaces: Best-effort parsing with hidden syntax. En Proceedings of...