Classifying Pastebin Content Through the Generation of PasteCC Labeled Dataset

Riesco, Adrián; Fidalgo, Eduardo; Al-Nabki, Mhd Wesam; Jáñez-Martino, Francisco; Alegre, Enrique

doi:10.1007/978-3-030-29859-3_39

Adrián Riesco¹³,
Eduardo Fidalgo^14,15,
Mhd Wesam Al-Nabki^14,15,
Francisco Jáñez-Martino¹⁵ &
…
Enrique Alegre^14,15

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11734))

Included in the following conference series:

International Conference on Hybrid Artificial Intelligence Systems

1382 Accesses
3 Citations

Abstract

Online notepad services allow users to upload and share free text anonymously. Reviewing Pastebin, one of the most popular online notepad services websites, it is possible to find textual content that could be related to illegal activities, such as leaks of personal information or hyperlinks to multimedia files containing child sexual abuse images or videos. An automatic approach to monitor and to detect these activities in such an active and a dynamic environment could be useful for Law Enforcement Agencies to fight against cybercrime. In this work, we present Pastes Content Classification 17K (PasteCC_17K), a dataset of 17640 textual samples crawled from Pastebin, which are classified in 15 categories, being 6 of them suspicious to be related to illegal ones. We used PasteCC_17K to evaluated two well-known text representation techniques, ensembled with three different supervised approaches to classify the pastes of the Pastebin website. We found that the best performance is achieved ensembling TF-IDF encoding with Logistic Regression obtaining an accuracy of \(98.63\%\). The proposed model could assist the authorities in the detection of suspicious content shared in Pastebin.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://piratepad.net.
2.
http://codepad.org/.
3.
http://pastebin.com.
4.
https://github.com/kevthehermit/PasteHunter.
5.
https://github.com/CIRCL/AIL-framework.
6.
https://github.com/isuru-c/LeakHawk.
7.
http://gvis.unileon.es/dataset/paste-bin.
8.
http://gvis.unileon.es/dataset/duta-darknet-usage-text-addresses/.
9.
Machine Learning library for Python. (Source: http://scikit-learn.org/stable).

References

Aizawa, A.: An information-theoretic perspective of tf-idf measures. Inf. Process. Manage. 39(1), 45–65 (2003)
Article Google Scholar
Al-Nabki, M.W., Fidalgo, E., Alegre, E., Fernández-Robles, L.: Torank: identifying the most influential suspicious domains in the tor network. Expert Syst. Appl. 123, 212–226 (2019)
Article Google Scholar
Al Nabki, M.W., Fidalgo, E., Alegre, E., de Paz Centeno, I.: Classifying illegal activities on tor network based on web textual contents. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Valencia, Spain, April 2017
Google Scholar
Bui, D.D.A., Fiol, G.D., Jonnalagadda, S.: Pdf text classification to leverage information extraction from publication reports. J. Biomed. Inform. 61, 141–148 (2016)
Article Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Cox, D.R.: The regression analysis of binary sequences. J. Roy. Stat. Soc. B 20, 215–242 (1958)
MathSciNet MATH Google Scholar
Diab, D.M., Hindi, K.: Using differential evolution for fine tuning naïve bayesian classifiers and its application for text classification. Appl. Soft Comput. 54, 183–199 (2016)
Article Google Scholar
Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
Article Google Scholar
Herath, H.: Web information extraction system to sense information leakage. Master’s thesis, University of Moratuwa, Sri Lanka (2003)
Google Scholar
Hu, R., Jane Delany, S., Mac Namee, B.: EGAL: exploration guided active learning for TCBR. In: Bichindaritz, I., Montani, S. (eds.) ICCBR 2010. LNCS (LNAI), vol. 6176, pp. 156–170. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14274-1_13
Chapter Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683
Chapter Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. CoRR abs/1607.01759 (2016)
Google Scholar
Lochter, J.V., Zanetti, R.F., Reller, D., Almeida, T.A.: Short text opinion detection using ensemble of classifiers and semantic indexing. Expert Syst. Appl. 62, 243–249 (2016)
Article Google Scholar
Matic, S., Fattori, A., Bruschi, D., Cavallaro, L.: Peering into the muddy waters of pastebin. ERCIM News 90, 16 (2012)
Google Scholar
Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., Chi, Y.: Deep keyphrase generation. CoRR abs/1704.06879 (2017)
Google Scholar
Mironczuk, M., Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification. Expert Syst. Appl. 106, 36–54 (2018)
Article Google Scholar
Panchenko, A., Ruppert, E., Faralli, S., Ponzetto, S.P., Biemann, C.: Building a web-scale dependency-parsed corpus from commoncrawl. CoRR abs/1710.01779 (2017)
Google Scholar
Perlroth, N.: Hackers breach 53 universities and dump thousands of personal records online. New York Times, New York (2012)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Silva, R.M., Almeida, T.A., Yamakami, A.: Mdltext: an efficient and lightweight text classifier. Knowl.-Based Syst. 118, 152–164 (2017)
Article Google Scholar
Stein, R.A., Jaques, P.A., Valiati, J.F.: An analysis of hierarchical text classification using word embeddings. CoRR abs/1809.01771 (2018)
Google Scholar
Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., Weston, J.: Starspace: Embed all the things! CoRR abs/1709.03856 (2017)
Google Scholar
Zhang, Q., Wang, Y., Gong, Y., Huang, X.: Keyphrase extraction using deep recurrent neural networks on twitter. In: EMNLP (2016)
Google Scholar
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657. Neural Information Processing Systems Foundation, January 2015
Google Scholar
Zhu, D., Wong, K.W.: An evaluation study on text categorization using automatically generated labeled dataset. Neurocomputing 249, 321–336 (2017)
Article Google Scholar

Download references

Acknowledgements

This research is supported by the INCIBE grant “INCIBEI-2015-27359”, corresponding to the “Ayudas para la Excelencia de los Equipos de Investigación avanzada en ciberseguridad” and also by the framework agreement between the University of León and INCIBE (Spanish National Cybersecurity Institute) under Addendum 22 and 01.

Author information

Authors and Affiliations

Summer Internship at the Universidad de León with the VARP Research Group, León, Spain
Adrián Riesco
Department of Electrical, Systems and Automation, Universidad de León, León, Spain
Eduardo Fidalgo, Mhd Wesam Al-Nabki & Enrique Alegre
Researcher at INCIBE (Spanish National Cybersecurity Institute), León, Spain
Eduardo Fidalgo, Mhd Wesam Al-Nabki, Francisco Jáñez-Martino & Enrique Alegre

Authors

Adrián Riesco
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo Fidalgo
View author publications
You can also search for this author in PubMed Google Scholar
Mhd Wesam Al-Nabki
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Jáñez-Martino
View author publications
You can also search for this author in PubMed Google Scholar
Enrique Alegre
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Adrián Riesco , Eduardo Fidalgo , Mhd Wesam Al-Nabki , Francisco Jáñez-Martino or Enrique Alegre .

Editor information

Editors and Affiliations

University of León, León, Spain
Hilde Pérez García
University of León, León, Spain
Lidia Sánchez González
University of León, León, Spain
Manuel Castejón Limas
University of A Coruña, Ferrol, Spain
Héctor Quintián Pardo
University of Salamanca, Salamanca, Spain
Emilio Corchado Rodríguez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Riesco, A., Fidalgo, E., Al-Nabki, M.W., Jáñez-Martino, F., Alegre, E. (2019). Classifying Pastebin Content Through the Generation of PasteCC Labeled Dataset. In: Pérez García, H., Sánchez González, L., Castejón Limas, M., Quintián Pardo, H., Corchado Rodríguez, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2019. Lecture Notes in Computer Science(), vol 11734. Springer, Cham. https://doi.org/10.1007/978-3-030-29859-3_39

Download citation

DOI: https://doi.org/10.1007/978-3-030-29859-3_39
Published: 26 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29858-6
Online ISBN: 978-3-030-29859-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics