Skip to main content

Classifying Pastebin Content Through the Generation of PasteCC Labeled Dataset

  • Conference paper
  • First Online:
Hybrid Artificial Intelligent Systems (HAIS 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11734))

Included in the following conference series:

Abstract

Online notepad services allow users to upload and share free text anonymously. Reviewing Pastebin, one of the most popular online notepad services websites, it is possible to find textual content that could be related to illegal activities, such as leaks of personal information or hyperlinks to multimedia files containing child sexual abuse images or videos. An automatic approach to monitor and to detect these activities in such an active and a dynamic environment could be useful for Law Enforcement Agencies to fight against cybercrime. In this work, we present Pastes Content Classification 17K (PasteCC_17K), a dataset of 17640 textual samples crawled from Pastebin, which are classified in 15 categories, being 6 of them suspicious to be related to illegal ones. We used PasteCC_17K to evaluated two well-known text representation techniques, ensembled with three different supervised approaches to classify the pastes of the Pastebin website. We found that the best performance is achieved ensembling TF-IDF encoding with Logistic Regression obtaining an accuracy of \(98.63\%\). The proposed model could assist the authorities in the detection of suspicious content shared in Pastebin.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://piratepad.net.

  2. 2.

    http://codepad.org/.

  3. 3.

    http://pastebin.com.

  4. 4.

    https://github.com/kevthehermit/PasteHunter.

  5. 5.

    https://github.com/CIRCL/AIL-framework.

  6. 6.

    https://github.com/isuru-c/LeakHawk.

  7. 7.

    http://gvis.unileon.es/dataset/paste-bin.

  8. 8.

    http://gvis.unileon.es/dataset/duta-darknet-usage-text-addresses/.

  9. 9.

    Machine Learning library for Python. (Source: http://scikit-learn.org/stable).

References

  1. Aizawa, A.: An information-theoretic perspective of tf-idf measures. Inf. Process. Manage. 39(1), 45–65 (2003)

    Article  Google Scholar 

  2. Al-Nabki, M.W., Fidalgo, E., Alegre, E., Fernández-Robles, L.: Torank: identifying the most influential suspicious domains in the tor network. Expert Syst. Appl. 123, 212–226 (2019)

    Article  Google Scholar 

  3. Al Nabki, M.W., Fidalgo, E., Alegre, E., de Paz Centeno, I.: Classifying illegal activities on tor network based on web textual contents. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Valencia, Spain, April 2017

    Google Scholar 

  4. Bui, D.D.A., Fiol, G.D., Jonnalagadda, S.: Pdf text classification to leverage information extraction from publication reports. J. Biomed. Inform. 61, 141–148 (2016)

    Article  Google Scholar 

  5. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  6. Cox, D.R.: The regression analysis of binary sequences. J. Roy. Stat. Soc. B 20, 215–242 (1958)

    MathSciNet  MATH  Google Scholar 

  7. Diab, D.M., Hindi, K.: Using differential evolution for fine tuning naïve bayesian classifiers and its application for text classification. Appl. Soft Comput. 54, 183–199 (2016)

    Article  Google Scholar 

  8. Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)

    Article  Google Scholar 

  9. Herath, H.: Web information extraction system to sense information leakage. Master’s thesis, University of Moratuwa, Sri Lanka (2003)

    Google Scholar 

  10. Hu, R., Jane Delany, S., Mac Namee, B.: EGAL: exploration guided active learning for TCBR. In: Bichindaritz, I., Montani, S. (eds.) ICCBR 2010. LNCS (LNAI), vol. 6176, pp. 156–170. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14274-1_13

    Chapter  Google Scholar 

  11. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683

    Chapter  Google Scholar 

  12. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. CoRR abs/1607.01759 (2016)

    Google Scholar 

  13. Lochter, J.V., Zanetti, R.F., Reller, D., Almeida, T.A.: Short text opinion detection using ensemble of classifiers and semantic indexing. Expert Syst. Appl. 62, 243–249 (2016)

    Article  Google Scholar 

  14. Matic, S., Fattori, A., Bruschi, D., Cavallaro, L.: Peering into the muddy waters of pastebin. ERCIM News 90, 16 (2012)

    Google Scholar 

  15. Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., Chi, Y.: Deep keyphrase generation. CoRR abs/1704.06879 (2017)

    Google Scholar 

  16. Mironczuk, M., Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification. Expert Syst. Appl. 106, 36–54 (2018)

    Article  Google Scholar 

  17. Panchenko, A., Ruppert, E., Faralli, S., Ponzetto, S.P., Biemann, C.: Building a web-scale dependency-parsed corpus from commoncrawl. CoRR abs/1710.01779 (2017)

    Google Scholar 

  18. Perlroth, N.: Hackers breach 53 universities and dump thousands of personal records online. New York Times, New York (2012)

    Google Scholar 

  19. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  20. Silva, R.M., Almeida, T.A., Yamakami, A.: Mdltext: an efficient and lightweight text classifier. Knowl.-Based Syst. 118, 152–164 (2017)

    Article  Google Scholar 

  21. Stein, R.A., Jaques, P.A., Valiati, J.F.: An analysis of hierarchical text classification using word embeddings. CoRR abs/1809.01771 (2018)

    Google Scholar 

  22. Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., Weston, J.: Starspace: Embed all the things! CoRR abs/1709.03856 (2017)

    Google Scholar 

  23. Zhang, Q., Wang, Y., Gong, Y., Huang, X.: Keyphrase extraction using deep recurrent neural networks on twitter. In: EMNLP (2016)

    Google Scholar 

  24. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657. Neural Information Processing Systems Foundation, January 2015

    Google Scholar 

  25. Zhu, D., Wong, K.W.: An evaluation study on text categorization using automatically generated labeled dataset. Neurocomputing 249, 321–336 (2017)

    Article  Google Scholar 

Download references

Acknowledgements

This research is supported by the INCIBE grant “INCIBEI-2015-27359”, corresponding to the “Ayudas para la Excelencia de los Equipos de Investigación avanzada en ciberseguridad” and also by the framework agreement between the University of León and INCIBE (Spanish National Cybersecurity Institute) under Addendum 22 and 01.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Adrián Riesco , Eduardo Fidalgo , Mhd Wesam Al-Nabki , Francisco Jáñez-Martino or Enrique Alegre .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Riesco, A., Fidalgo, E., Al-Nabki, M.W., Jáñez-Martino, F., Alegre, E. (2019). Classifying Pastebin Content Through the Generation of PasteCC Labeled Dataset. In: Pérez García, H., Sánchez González, L., Castejón Limas, M., Quintián Pardo, H., Corchado Rodríguez, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2019. Lecture Notes in Computer Science(), vol 11734. Springer, Cham. https://doi.org/10.1007/978-3-030-29859-3_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-29859-3_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-29858-6

  • Online ISBN: 978-3-030-29859-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics