Análisis y Diseño de un Modelo Predictivo para Detección de Phishing Basado en Url y Corpus del Correo Electrónico

Dolores Fernanda Albán Toapanta; Ménthor Oswaldo Urvina Mayorga; Roberto Omar Andrade Paredes

Ayuda

Análisis y Diseño de un Modelo Predictivo para Detección de Phishing Basado en Url y Corpus del Correo Electrónico

Autores: Dolores Fernanda Albán Toapanta, Ménthor Oswaldo Urvina Mayorga, Roberto Omar Andrade Paredes
Localización: Revista Politécnica, ISSN-e 2477-8990, Vol. 50, Nº. 3, 2022 (Ejemplar dedicado a: Revista Politécnica), págs. 27-42
Idioma: español
DOI: 10.33333/rp.vol50n3.03
Títulos paralelos:
- Analysis and Design of a Predictive Model for Phishing Detection Based on Url and Email Corpus
Enlaces
- Texto completo
Resumen
- español
  Uno de los delitos cibernéticos más reportados a nivel mundial es el phishing. En la actualidad se están desarrollando diversos sistemas anti-phishing (APS) para identificar este tipo de ataque en sistemas de comunicación en tiempo real. A pesar de los esfuerzos de las organizaciones, este ataque continúa creciendo, teniendo como causas: la detección errónea en el ataque de día cero, el alto costo computacional y las tasas altas de falsificación. Aunque el enfoque de Machine Learning (ML) ha logrado una tasa de precisión favorable, se debe considerar que la elección y el rendimiento del vector de características es un punto clave para obtener un nivel de precisión adecuado. En este trabajo, se propone un modelo predictivo basado en ML y en el análisis de la eficiencia de algunos esquemas anti-phishing que sirvieron para entender esta temática. El modelo propuesto consta de un módulo de selección de características que se utiliza para la construcción del vector final. Estas características se extraen de la URL, las propiedades de la página web y del corpus de correo electrónico. El sistema utiliza los modelos de clasificación, Random Forest (RF) y Naïve Bayes (NB), que han sido entrenados en el vector de características. Los experimentos se basaron en Dataset compuestas por instancias de phishing y benignas. Mediante el uso de la validación cruzada, los resultados experimentales indican una precisión del 97,5% para los dataset utilizados, mientras que para el abordaje de esta investigación a nivel local se obtuvo una precisión del 96,5%.
- English
  One of the most reported cyber crimes worldwide is phishing, and various anti-phishing systems (APS) are currently being developed to identify this type of attack on communication systems in real time. Despite the efforts of organizations, this attack continues to grow, due to the erroneous detection in the zero-day attack: the high computational cost and the high rates of forgery. Although the Machine Learning (ML) approach has achieved a favorable accuracy rate, it should be considered that the choice and performance of the feature vector is a key point to obtain an adequate level of accuracy. In this work, a predictive model based on ML and the analysis of the efficiency of some anti-phishing schemes that served to understand this issue is proposed. The proposed model consists of a feature selection module that is used to build the final vector. These characteristics are extracted from the URL, the properties of the web page, and the email corpus. The system uses the Random Forest (RF) and Naïve Bayes (NB) classification models, which have been trained on the feature vector. The experiments were based on datasets composed of phishing and benign instances. Using cross-validation, the experimental results indicate a precision of 97.5% for the datasets used, while a precision of 96.5% was obtained for the approach of this research at the local level.
Referencias bibliográficas
- Aburrous, M., Hossain, M., Dahal, K. and Thabtah, F. (2010). Experimental case studies for investigating e-banking phishing techniques and...
- Adebowale, M., Lwin, K., Sanchez, E. and Hossain, M. (2018). Intelligent Web-Phishing Detection and Protection Scheme using integrated Features...
- Amat Rodrigo, Joaquín. (2020). Análisis de texto (text mining) con Python, cienciadedatos.net. Obtenido de: https://www.cienciadedatos.net/....
- Anwar, T., Abu-Kresha, M. and Bakry A. (2017). An efficient method for web page classification based on text. International J. Eng. Comput....
- Barraclough, P. & Sexton, G. (2015). Phishing website detection fuzzy system modelling, IEEE, London, UK, 1384-1386, 10.1109/SAI.2015.7237323.
- Breiman, L. (2001). Random Forests. Machine Learning SpringerLink, 45, 5–32, https://doi.org/10.1023/A:1010933404324
- Calva Yaguana, Karen Priscilla. (2020). Modelo de predicción del rendimiento académico para el curso de nivelación de la Escuela Politécnica...
- Chin, T., Xiong, K. and Hu, C. (2015). PhishLimiter: A Phishing Detection and Mitigation Approach using Software- Defined Networking, IEEE...
- Cortina, V. G. (2015). Aplicación de la metodología CRISP-DM a un proyecto de minería de datos en el entorno universitario. [Universidad Carlos...
- Creswell. (2015). Educational research. Planning, conducting and evaluating quantitative and qualitative research. USA.
- Dhanalakshmi, R. & Chellappan, C. (2013). Detecting Malicious URLs in E-mails- An Implementation. AASRI Procedia, 4, 125-131, https://doi.org/10.1016/j.aasri.2013.10.020
- Gansterer, W.N. & Polz, D. (2009). E-mail classification for phishing defense, in Advances in Information Retrieval. Heidelberg: Springer...
- Gironés Roig, J., Casas Roma, J., Minguillón Alfonso, J., Caihuelas Quiles, R. (2020). Minería de datos Modelos y algoritmos, Editorial UOC.
- Gowtham, R., Gupta, J. and Gamya, P.G. (2017). Identification of phishing web pages and their target domains by analyzing the feign relationship...
- Gowtham,R. & Krishnamurthi, I. (2014). PhishTackle-a web services architecture for anti-phishing Cluster Compt, 17, 1051–1068. https://doi.org/10.1007/s10586-013-0320-5
- Gupta, B.B., Tewari, A., Jain, A.K. and Agrawal, P. (2017). Fighting against phishing attacks: state of the art and future challenges. Neural...
- Hastie, T., Tibshirani, R. and Friedman, J. (2017). The elements of statistical learning: data mining, inference, and prediction. New York:...
- Hota, H.S., Shrivas, A.K. and Hota, R. (2018). An ensemble model for detecting phishing attack with proposed removereplace feature selection...
- Isa, D., Lee, L., Kallimani, V. and Rajkumar, R. (2016). Text document pre-processing using bayes formula for classification based on the...
- Jain, AK & Gupta, BB. (2016). A novel approach to protect against phishing attacks at client side using auto-updated. EURASIP Journal...
- Kittler, J., Hatef, M. and Duin, R.P.W. (1998). On Combining Classifiers. Transactions on pattern analysis and machine intelligence. IEEE,...
- Khonji, M., Iraqi, Y. and Jones, A. (2013). Phishing Detection: A Literature Survey. IEEE Communications Surveys & Tutorials, 15(4), 2091-2121.
- Kuncheva, L. (2004). Combining Pattern Classifiers. Methods and algorithms. John Wiley & Sons, New Jersey.
- Martínez, M. B. (2018). Minería de Datos. web: http://bbeltran.cs.buap.mx/NotasMD.pdf.
- Moghimi, M. & Varjani, A.Y. (2016). New rule-based phishing detection method Expert systems with applications, 53, 231-242. Monkey.org....
- CSO Online report on phishing activities. Accessed 2016 http://www.csoonline.com/articles
- Orunsolu, A.A., Afolabi, O., Sodiya, A.S. and Akinwale, A.T. (2019). A Users’ Awareness Study and Influence of Socio-Demography Perception...
- Pedregosa. (2011). Scikit-learn: Machine Learning in Python JMLR 12, 2825-2830
- Phishtank dataset. (2021). http://www.phishtank.com. (2021).
- Qabajeh, I., Thabtah, F. and Chiclana, F. (2018). A recent review of conventional vs. automated cybersecurity antiphishing techniques Computer...
- Rosero Gomezcoello, Johanna Mishell. (2020). Detección y mitigación de ataques de ingeniería social tipo Phishing utilizando minería de datos...
- Segal, M. (2004). Machine learning benchmarks and random forest regression [Tesis, University of California].
- Shabtai, A., Kanonov, U., Elovici, Y., Glezer, C. and Weiss, Y. (2016). Andromaly: a behavioural malware detection framework for android devices....
- Sonowal, G. & Kuppusamy, K.S. (2020). PhiDMA- A phishing detection model with a multi-filter approach Journal of King Saud University-Computer...
- Sonowal, G. & Kuppusamy, K.S. (2018). MMSPhiD: A Phoneme based Phishing Verification Model for Persons with Visual Impairments. Information...
- Tan, C. L., Chiew, K. L. and Sze, S. N. (2017). Phishing Webpage Detection UsingWeighted URL Tokens for Identity Keywords Retrieval. In 9th...
- Moghimi, M. and Varjani, A.Y. (2016). New rule-based phishing detection method.Expert Systems with Applications, (pp. 231-242), https://doi.org/10.1016/j.eswa.2016.01.028.
- William, W. and Cohen, MLD.CMU. (2019). Base de datos de correos electrónicos de Enron. Web de https://www.cs.cmu.edu/ ./enron/. Zhao, J.,Wang,...
- Zouina, M. & Outtaj, B. (2017). A novel lightweight URL phishing detection system using SVM and similarity index Human-centric Computing...