Ir al contenido

Documat


Predicting the demographics of Twitter users with programmatic weak supervision

  • Jonathan Tonglet [1] ; Astrid Jehoul [2] ; Manon Reusens [1] ; Michael Reusens [1] ; Bart Baesens [1]
    1. [1] KU Leuven

      KU Leuven

      Arrondissement Leuven, Bélgica

    2. [2] Datashift, Oude Brusselsestraat, 14, 2800, Mechelen, Belgium
  • Localización: Top, ISSN-e 1863-8279, ISSN 1134-5764, Vol. 32, Nº. Extra 3, 2024 (Ejemplar dedicado a: Mathematical Optimization and Machine Learning), págs. 354-390
  • Idioma: inglés
  • DOI: 10.1007/s11750-024-00666-y
  • Enlaces
  • Resumen
    • Predicting the demographics of Twitter users has become a problem with a large interest in computational social sciences. However, the limited amount of public datasets with ground truth labels and the tremendous costs of hand-labeling make this task particularly challenging. Recently, programmatic weak supervision has emerged as a new framework to train classifers on noisy data with minimal human labeling efort. In this paper, demographic prediction is framed for the frst time as a programmatic weak supervision problem. A new three-step methodology for gender, age category, and location prediction is provided, which outperforms traditional programmatic weak supervision and is competitive with the state-of-the-art deep learning model. The study is performed in Flanders, a small Dutch-speaking European region, characterized by a limited number of user profles and tweets. An evaluation conducted on an independent hand-labeled test set shows that the proposed methodology can be generalized to unseen users within the geographic area of interest.

  • Referencias bibliográficas
    • Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: Inferring latent attributes of Twitter users from neighbors. In:...
    • Alarif A, Alsaleh M, Al-Salman A (2016) Twitter turing test: identifying social machines. Inf Sci 372:332–346
    • Aletras N, Chamberlain BP (2018) Predicting Twitter user socioeconomic attributes with network and language information. In: Proceedings of...
    • Angelov D (2020) Top2Vec: Distributed Representations of Topics. arXiv. https://doi.org/10.48550/ARXIV.2008.09470 . https://arxiv.org/abs/2008.09470
    • Ardehaly EM, Culotta A (2017) Co-training for demographic classification using deep learning from label proportions. In: 2017 IEEE International...
    • Ardehaly EM, Culotta A (2017) Mining the demographics of political sentiment from Twitter using learning from label proportions. In: 2017...
    • Barberá P (2016) Less is more? How demographic sample weights can improve public opinion estimates based on Twitter data. Working Paper NYU
    • Bifgnandi S, Bianchi A, Salvatore C (2018) Can big data provide good quality statistics? A case study on sentiment analysis on Twitter data....
    • Chen X, Wang Y, Agichtein E, Wang F (2015) A comparative study of demographic attribute inference in Twitter. Proc Int AAAI Conf Web Soc Med...
    • Compton R, Jurgens D, Allen D (2014) Geotagging one hundred million Twitter accounts with total variation minimization. 2014 IEEE International...
    • Culotta A (2014) Reducing sampling bias in social media data for county health inference. In: Joint Statistical Meetings Proceedings, pp....
    • Culotta A, Ravi NK, Cutler J (2016) Predicting Twitter user demographics using distant supervision from website traffic data. J Artif Intell...
    • Daas PJ, Burger J, Le Q, Bosch O, Puts M (2016) Profiling of Twitter Users: a Big Data Selectivity Study, 1–25
    • Diaz F, Gamon M, Hofman JM, Kıcıman E, Rothschild D (2016) Online and social media data as an imperfect continuous panel survey. PloS one...
    • Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923
    • Fu D, Chen M, Sala F, Hooper S, Fatahalian K, Ré C (2020) Fast and three-rious: Speeding up weak supervision with triplet methods. In: International...
    • Graells-Garrido E, Baeza-Yates R, Lalmas M (2020) Representativeness of abortion legislation debate on Twitter: A case study in Argentina...
    • Grinberg N, Joseph K, Friedland L, Swire-Thompson B, Lazer D (2019) Fake news on Twitter during the 2016 US presidential election. Science...
    • HaCohen-Kerner Y (2022) Survey on profiling age and gender of text authors. Expert Syst Appl 199:117140
    • Hinds J, Joinson AN (2018) What demographic attributes do our digital footprints reveal? A Syst Rev PloS one 13(11):0207112
    • Hou W, Li Y, Liu Y, Li Q (2022) Leveraging multidimensional features for policy opinion sentiment prediction. Inf Sci 610:215–234
    • Ikeda K, Hattori G, Ono C, Asoh H, Higashino T (2013) Twitter user profiling based on text and community mining for market analysis. Knowledge-Based...
    • Jurgens D, Finethy T, McCorriston J, Xu YT, Ruths D (2015) Geolocation prediction in Twitter using social networks: A critical analysis and...
    • Li J, Ritter A, Hovy E (2014) Weakly supervised user profile extraction from Twitter. In: Proceedings of the 52nd annual meeting of the association...
    • López-Monroy AP, Gonzalez FA, Solorio T (2020) Early author profiling on Twitter using profile features with multi-resolution. Expert Syst...
    • Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Himmelfarb J, Bansal N, Lee S-I (2020) From local explanations to global...
    • Matz SC, Menges JI, Stillwell DJ, Schwartz HA (2019) Predicting individual-level income from Facebook profiles. PLOS ONE 14(3):1–13. https://doi.org/10.1371/journal.pone.0214369
    • McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: homophily in social networks. Ann Rev Sociol 27(1):415–444
    • Miranda Filho R, Almeida JM, Pappa GL (2015) Twitter population sample bias and its impact on predictive outcomes: A case study on elections....
    • Mislove A, Lehmann S, Ahn Y-Y, Onnel J-P, Rosenquist J (2011) Understanding the demographics of Twitter users. In: Proceedings of the international...
    • Moreno-Torres JG, Raeder T, Alaiz-Rodríguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recog...
    • Nguyen D, Gravel R, Trieschnigg D, Meder T (2013) “How old do you think I am?” A study of language and age in Twitter. In: Proceedings of...
    • Pan J, Bhardwaj R, Lu W, Chieu HL, Pan X, Puay NY (2019) Twitter homophily: Network-based prediction of user’s occupation. In: Proceedings...
    • Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition. In: Proceedings of the British machine vision conference (BMVC), 41–14112. ...
    • Pennacchiotti M, Popescu A-M (2011) A machine learning approach to Twitter user classification. In: Fifth international AAAI conference on...
    • Preoţiuc-Pietro D, Volkova S, Lampos V, Bachrach Y, Aletras N (2015) Studying user income through language, behaviour and affect in social...
    • Preoţiuc-Pietro D, Lampos V, Aletras N (2015) An analysis of the user occupational class through Twitter content. In: Proceedings of the 53rd...
    • Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J (2021) Learning transferable visual models...
    • Rahimi A, Cohn T, Baldwin T (2018) Semi-supervised user geolocation via graph convolutional networks. In: Proceedings of the 56th annual meeting...
    • Rao D, Paul M, Fink C, Yarowsky D, Oates T, Coppersmith G (2011) Hierarchical Bayesian models for latent attribute detection in social media....
    • Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in Twitter. In: Proceedings of the 2nd international workshop...
    • Ratner AJ, De Sa CM, Wu S, Selsam D, Ré C (2016) Data programming: Creating large training sets, quickly. Adv Neural Inf Process Syst 29:3567–3575
    • Ratner A, Bach SH, Ehrenberg H, Fries J, Wu S, Ré C (2017) Snorkel: Rapid training data creation with weak supervision. In: Proceedings of...
    • Ratner A, Hancock B, Dunnmon J, Sala F, Pandey S, Ré C (2019) Training complex models with multitask weak supervision. In: Proceedings of...
    • Serengil SI, Ozpinar A (2020) LightFace: A hybrid deep face recognition framework. In: 2020 Innovations in intelligent systems and applications...
    • Serengil SI, Ozpinar A (2021) Hyperextended LightFace: A facial attribute analysis framework. In: 2021 International conference on engineering...
    • Suman C, Naman A, Saha S, Bhattacharyya P (2021) A multimodal author profiling system for tweets. IEEE Trans Comput Soc Syst 8(6):1407–1416
    • Vandendriessche K, Steenberghs E, Matheve A, Georges A, De Marez L (2020) imec.digimeter 2020, Digitale trends in Vlaanderen. https://www.imec.be/sites/default/files/inline-files/DIGIMETER2020.pdf
    • Vijayaraghavan P, Vosoughi S, Roy D (2017) Twitter demographic classification using deep multi-modal multi-task learning. In: Proceedings...
    • Wang Z, Hale S, Adelani DI, Grabowicz P, Hartman T, Flöck F, Jurgens D (2019) Demographic inference and representative population estimates...
    • Wang Z, Yu Z, Fan R, Guo B (2020) Correcting biases in online social media data based on target distributions in the physical world. IEEE...
    • Wood-Doughty Z, Xu P, Liu X, Dredze M (2021) Using noisy self-reports to predict Twitter user demographics. In: Proceedings of the ninth international...
    • Yu P, Ding T, Bach SH (2022) Learning from multiple noisy partial labelers. In: International conference on artificial intelligence and statistics,...
    • Zhang J, Hsieh C-Y, Yu Y, Zhang C, Ratner A (2022) A survey on programmatic weak supervision. arXiv preprint arXiv:2202.05433

Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno