Ir al contenido

Documat


Adapting support vector optimisation algorithms to textual gender classifcation

  • Javier Gomez [1] ; Cesar Alfaro [1] ; Felipe Ortega [1] ; Javier M. Moguerza [1] Árbol académico ; Maria Jesus Algar [1] ; Raul Moreno [2]
    1. [1] Research Centre for Intelligent Information Technologies (CETINIA-DSLAB), Rey Juan Carlos University, Calle Tulipán, Móstoles 28933, Madrid, Spain
    2. [2] Doctorate Programme in Information Technologies and Communications, Rey Juan Carlos University, Calle Tulipán, Móstoles 28933, Madrid, Spain
  • Localización: Top, ISSN-e 1863-8279, ISSN 1134-5764, Vol. 32, Nº. Extra 3, 2024 (Ejemplar dedicado a: Mathematical Optimization and Machine Learning), págs. 463-488
  • Idioma: inglés
  • DOI: 10.1007/s11750-024-00671-1
  • Enlaces
  • Resumen
    • In this paper, we focus on the problem of determining the gender of the person described in a biographical text. Since support vector machine classifers are well suited for text classifcation tasks, we present a new stopping criterion for support vector optimisation algorithms tailored to this problem. This new approach exploits the geometric properties of the vector representation of such content. An experiment on a set of English and Spanish biographical articles retrieved from Wikipedia illustrates this approach and compares it to other machine learning classifcation algorithms. The proposed method allows real-time classifcation algorithm training. Moreover, these results confrm the advantage of leveraging additional gender information in strongly infected languages, like Spanish, for this task.

  • Referencias bibliográficas
    • Aggarwal CC (2018) Machine Learning For Text. Springer, Cham, Switzerland. https://doi.org/10.1007/ 978-3-319-73531-3
    • Breiman L (2001) Random Forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324
    • Adler BT, de Alfaro L, Mola-Velasco SM, Rosso P, West AG (2011) Wikipedia Vandalism Detection: Combining Natural Language, Metadata and Reputation...
    • Aizawa A (2003) An information-theoretic perspective of tf-idf measures. Inform Process Manag 39(1):45–65. https://doi.org/10.1016/S0306-4573(02)00021-3
    • Amado A, Cortez P, Rita P, Moro S (2018) Research trends on big data in marketing: a text mining and topic modeling based literature analysis....
    • Baeza-Yates R, Ribeiro-Neto B (2011) Modern information retrieval: the concepts and technology behind search, 2nd edn. Pearson Education Ltd.,...
    • Berry MW, Kogan J (eds) (2010) Text Mining: Applications and Theory. Wiley InterScience. John Wiley & Sons, Chichester, West Sussex, UK
    • Chen P-H, Fan R-E, Lin C-J (2006) A study on SMO-type decomposition methods for support vector machines. IEEE Trans Neural Netw 17(4):893–908
    • Cho H-C, Okazaki N, Miwa M, Tsujii J (2013) Named entity recognition with multiple segment representations. Inform Process Manage 49(4):954–965....
    • Corney M, de Vel OY, Anderson A, Mohay GM (2002) Gender-Preferential Text Mining of E-mail Discourse. In: 18th Annual Computer Security Applications...
    • Das M, Hecht B, Gergle D (2019) The Gendered Geography of Contributions to OpenStreetMap: Complexities in Self-Focus Bias. In: Brewster, S.A.,...
    • Das S, Paik JH (2021) Context-sensitive gender inference of named entities in text. Inform Process Manage 58(1):102423. https://doi.org/10.1016/j.ipm.2020.102423
    • Eisenstein J (2019) Introduction to Natural Language Processing. Adaptive Computation and Machine Learning series. MIT Press, Cambridge, MA,...
    • Fatima M, Hasan K, Anwar S, Nawab RMA (2017) Multilingual author profiling on Facebook. Inform Process Manage 53(4):886–904. https://doi.org/10.1016/j.ipm.2017.03.005
    • Feldman R, Sanger J (2006) The Text Mining Handbook - Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, New...
    • Feinerer I, Hornik K, Meyer D (2008) Text Mining Infrastructure in R. Journal of Statistical Software, Articles 25(5), 1–54. https://doi.org/10.18637/jss.v025.i05
    • Feng M, Li S (2018) An approximate strong KKT condition for multiobjective optimization. Top 26(3):489–509. https://doi.org/10.1007/s11750-018-0491-6
    • Foong E, Vincent N, Hecht B, Gerber EM (2018) Women (Still) Ask For Less: Gender Differences in Hourly Rate in an Online Labor Marketplace....
    • Fourkioti O, Symeonidis S, Arampatzis A (2019) Language models and fusion for authorship attribution. Inform Process Manage 56(6):102061....
    • Geiger RS, Ribes D (2010) The work of sustaining order in Wikipedia: the banning of a vandal. In: Inkpen, K., Gutwin, C., Tang, J.C. (eds.)...
    • Gomez J, Alfaro C, Ortega F, Moguerza JM, Algar MJ, Moreno R (2021). Biographies of literature writers written in English language. https://doi.org/10.6084/m9.fgshare.13551467.v4....
    • Gomez J, Alfaro C, Ortega F, Moguerza JM, Algar MJ, Moreno R (2021). Biographies of literature writers written in Spanish language. https://doi.org/10.6084/m9.fgshare.13551437.v5....
    • Hamidi F, Scheuerman MK, Branham SM (2018) Gender Recognition or Gender Reductionism? The Social Implications of Embedded Gender Recognition...
    • Hedlund T, Pirkola A, Järvelin K (2001) Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information...
    • Hollander M, Wolfe DA, Chicken E (2013) Nonparametric Statistical Methods. John Wiley & Sons, Hoboken, New Jersey
    • Huang F, Li C, Lin L (2014) Identifying Gender of Microblog Users Based on Message Mining. In: Li, F., Li, G., Hwang, S., Yao, B., Zhang,...
    • Jansen BJ, Moore K, Carman S (2013) Evaluating the performance of demographic targeting using gender in sponsored search. Inform Process Manag...
    • Joachims T (1998) Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C....
    • Joachims T (1999) Making Large-Scale Support Vector Machine Learning Practical. In: Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances...
    • Joachims T (2002) Learning to Classify Text Using Support Vector Machines. The Springer International Series in Engineering and Computer Science,...
    • Jurafsky D, Martin JH (2009) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, And...
    • Keyes O, Tilbert B (2017) WikipediR: A MediaWiki API Wrapper. R package version 1.5.0. https://CRAN.R-project.org/package=WikipediR
    • Keyes O (2018) The Misgendering Machines: Trans/HCI Implications of Automatic Gender Recognition. Proc. ACM Hum.-Comput. Interact. 2(CSCW),...
    • Kocher M, Savoy J (2017) Distance measures in author profiling. Inform Process Manag 53(5):1103–1119. https://doi.org/10.1016/j.ipm.2017.04.004
    • Kretschmer H, Aguillo IF (2005) New indicators for gender studies in web networks. Inform Process Manage 41(6):1481–1494. https://doi.org/10.1016/j.ipm.2005.03.009....
    • Krüger S, Hermann B (2019) Can an Online Service Predict Gender? On the State-of-the-Art in Gender Identification from Texts. In: Crnkovic,...
    • Kucukyilmaz T, Cambazoglu BB, Aykanat C, Can F (2006) Chat Mining for Gender Prediction. In: Yakhno, T.M., Neuhold, E.J. (eds.) Advances in...
    • Lau K-N, Lee K-H, Ho Y (2005) Text Mining for the Hotel Industry. Cornell Hotel Restaurant Administration Q 46(3):344–362. https://doi.org/10.1177/0010880405275966
    • Lin B, Serebrenik A (2016) Recognizing gender of Stack Overflow users. In: Kim, M., Robbes, R., Bird, C. (eds.) Proceedings of the 13th International...
    • López-Santillán R, Montes-Y-Gómez M, González-Gurrola LC, Ramírez-Alonso G, Prieto-Ordaz O (2020) Richer document embeddings for author profiling...
    • Markov I, Gómez-Adorno H, Sidorov G, Gelbukh A (2017) The Winning Approach to Cross-Genre Gender Identification in Russian at RUSPprofiling...
    • Moguerza JM, Muñoz A et al (2006) Support vector machines with applications. Stat Sci 21(3):322–336. https://doi.org/10.1214/088342306000000493
    • Mukherjee S, Bala PK (2017) Gender classification of microblog text based on authorial style. Inform Syst e-Business Manag 15(1):117–138....
    • Olson DL, Delen D (2008) Advanced Data Mining Techniques. Springer, Berlin, Heidelberg
    • Platt J (1998) Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Technical Report MSR-TR-98-14, Microsoft....
    • Rangel F, Rosso P (2016) On the impact of emotions on author profiling. Inform Process Manag 52(1):73–92. https://doi.org/10.1016/j.ipm.2015.06.003
    • R Core Team (2022) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. R Foundation...
    • Santamaría L, Mihaljević H (2018) Comparison and benchmark of name-to-gender inference services. PeerJ Comput Sci 4:156. https://doi.org/10.7717/peerj-cs.156
    • Sagi O, Rokach L (2018) Ensemble learning: a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(4):1249
    • Schapire RE (1990) The Strength of Weak Learnability. Mach Learn 5(2):197–227. https://doi.org/10.1007/BF00116037
    • Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inform Process Manag 45(4):427–437
    • Sproat R, Black AW, Chen S, Kumar S, Ostendorf M, Richards C (2001) Normalization of non-standard words. Comput Speech Lang 15(3):287–333....
    • Srivastava A, Sahami M (2009) (eds.): Text Mining: Classification, Clustering, and Applications, 1st edn. Chapman & Hall/CRC, New York,...
    • Tikhonov AN, Arsenin VY (1977) Solutions of Ill-Posed Problems. Scripta Series in Mathematics. Halsted Press, John Wiley & Sons, New York,...
    • Terrell J, Kofnk A, Middleton J, Rainear C, Murphy-Hill E, Parnin C, Stallings J (2017) Gender differences and bias in open source: pull request...
    • Uysal AK, Gunal S (2014) The impact of preprocessing on text classification. Inform Process Manag 50(1):104–112. https://doi.org/10.1016/j.ipm.2013.08.006
    • Vasilescu B, Capiluppi A, Serebrenik A (2014) Gender, representation and online participation: a quantitative study. Interacting Comput 26(5):488–511....
    • Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85. https://doi.org/10.1145/2629489
    • Wais K (2016) Gender Prediction Methods Based on First Names with genderizeR. The R Journal 8(1), 17–37. https://doi.org/10.32614/RJ-2016-002
    • Witten IH, Frank E, Hall MA, Pal CJ (2017) Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann, Elsevier,...
    • Yan X, Yan L (2006) Gender Classification of Weblog Authors. In: Computational Approaches to Analyzing Weblogs, Papers from the 2006 AAAI...

Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno