Learning to build statistical indicators from open data sources


Pilar Rey-del-Castillo
Institute of Fiscal Studies, Madrid, Spain.
mpilar.rey.castillo@ief.hacienda.gob.es

Abstract

One of the biggest challenges facing official statistics today is the use of the massive amount of data generated on the web or by sensors and other electronic devices for the production of statistical figures. This paper presents the building of several statistical indicators from different Open Data sources. All the indicators have been built using a common methodological approach to estimate changes across time. The purpose of the paper is to show the different problems that must be addressed when using these data sources and to learn about the different ways to cope with them. The first Open Data source is traffic sensors data, where the data about the geographical location of the sensors permits to compute traffic intensity indicators at detailed geographical level. Apart from being proxies or lead indicators for economic activity, the figures can be used to measure the impact of different traffic arrangements in specific areas. Before constructing the indicators for the following source, call records from a multichannel citizen attention service, the data have been analyzed using Natural Language Processing tools to identify several categories of topics for the requests received. Other Open Data sources, Twitter messages and scraped data from a digital newspapers’ library website, are studied using similar tools in both situations. A rough idea about the evolution for the general sentiments in Spain is obtained from Twitter messages. From scraped data, the evolution of the average opinions and sentiments in the country’s newspapers is similarly computed. Usually, it is accepted that the ideas expressed in the newspapers are relevant to conform public opinion. On the other hand, an interesting result obtained in our research is that individuals react stronger and more quickly than newspapers to some social, political or economic events.

Keywords: Big data, Official statistics, Open Data, Periodic indicators.

Introduction

The coming of what is known as Big Data and other new data sources has signified a number of new opportunities for the production of official statistics. National statistical offices and other bodies are these days involved in different projects that are trying to use these data in an efficient way, maintaining at the same time the level of relevance and high quality required (Brakel 2022). Many of these new data sources are privately owned and controlled, what makes particularly difficult its access to explore and evaluate possible uses as data for the production of statistical figures. As they are not designed for the purpose of its statistical exploitation, its prior exploration and study is needed to discover the unknown data structure. The aim of the paper is precisely to take advantage of the worldwide movement for the opening of information that makes it accessible, exploitable and shared by anyone for any objective. The idea is to present the building of several statistical indicators from different Open Data sources, showing some of the problems that must be addressed and learning about the different ways to cope with them, according to the type of information, the data available and the aim of the specific indicator. All of them have been built using a common methodological approach, trying to obtain new statistical figures that can be proxies of the evolution of social or economic phenomena of interest.

The rest of the paper is organized into four sections. Section 2 describes the common general methodological approach; Section 3 then sets out the results obtained using each one of the proposed Open Data sources; and finally, some comments and conclusions are presented in Section 4.

Methodological approach

The first aim of the indicators to build is to estimate changes across time. From the statistical theory, it is known that it is more accurate to estimate changes (over time or space) than absolute figures, because some of the possible bias and errors can be cancelled out when totaling the change from repeated measures that use the same instrument every time.

As for their periodicity, daily indicators have been constructed in all cases. The data available for all of them allow also for building at a more detailed time granularity level, but this daily level is enough and appropriated to produce quick assessments of the corresponding evolution. The progress over time is computed as the variation of each period in relation to a common period (known as the base period). That is, the evolution between base time (\(0\)) and time t is calculated as a simple index (Stone and Prais 1952), \(I_t = X_t /X_0\), where \(X_0\), \(X_t\) are the values of the variable in question at base time and time \(t\), respectively. When the variable of interest is calculated as an aggregate (total, mean value), the evolution can also be computed as an aggregate index, using chain-linked indexes, \[I_t = I_{t-1} \frac{\sum_i x_{it}}{\sum_i x_{it-1}}.\qquad(1)\]

Sometimes, to maintain a certain idea of the level of the aggregate, the value of the index at base time can be the value of the aggregate \(I_0 = A_0\), and the rest of the periods are recursively computed using the previous expression, i.e., \[A_t = A_{t-1} \frac{ \sum_i x_{it}}{ \sum_i x_{it-1}}.\qquad(2)\]

Another issue is about the volume of the information. In general terms, the data gathered from the different sources cannot be processed using traditional statistical software and precise tools especially developed for this purpose. Apache Spark (Zaharia et al. 2016), an open-source analytics engine for Big Data treatment has been used for the collection and first steps of the processing. In addition to Apache Spark, the Python software (Python Core Team 2015) has been used for other processing and programming steps in this work.

Results using different open data sources

The next subsections describe the results for each one of the proposed Open Data sources.

Traffic sensors data

The users can explore and download the data from the open data portal offered by the local government of Madrid City1. Some of the datasets available on this site contain pre-processed data from traffic sensors located on specific points in the roads and streets of the city. These traffic sensors, apart from other information, provide data on its geographical location and on the number of vehicles passing by. After the end of each month, these data measured in 15-minutes intervals -for each one of the more than 4000 sensors- are released. The purpose of this work is to monitor the traffic in Madrid city, computing indicators of its progress. The intensity in an area defined as the average number of vehicles moving on all its roads and streets can be approximated by the average number of vehicles passing by all the sensors located in the area. Once a first pre-processing stage to make data standardized, the total number of vehicles by sensor and day is computed. Sometimes the sensors may be blocked due to fail of power supply, environmental interfering factors or other problems. In this case, changes in the averages could be due to changes in the sensors’ location and/or activity and not by changes in the area’s traffic. To ensure validity and completeness, data editing is executed. Missing and invalid data are imputed by a procedure specifically developed for this purpose and, subsequently, chain-linked indices are calculated to monitor the time-evolution in each area (Rey-del-Castillo 2019a). The indicators are computed for the whole city, the M30 ring road and the Urban area. Figures 1 (a) and 1 (b) exhibit the three daily series obtained for the period between January 2016 and March 2022.

Figure 1: (a) Global and (b) M30 and Urban intensity indicators.

A significant feature in all series is the fall during the restrictions due to the coronavirus pandemic. The effects of Filomena storm in January 2021 can also be seen.

Using the geographical information available, the effects of some traffic measures implemented in different areas can also be evaluated as an example of the utility of the time series produced. The analysis can also be extended to other variables obtained from traffic sensors and available at the portal.

A significant feature in all series is the fall during the restrictions due to the coronavirus pandemic. The effects of Filomena storm in January 2021 can also be seen.

Using the geographical information available, the effects of some traffic measures implemented in different areas can also be evaluated as an example of the utility of the time series produced. The analysis can also be extended to other variables obtained from traffic sensors and available at the portal.

Call records from a citizen attention service

The local government of Madrid City offers also access in its open data portal2 to the Personalized Attention Records (PARs) of Linea Madrid. There are diverse communication channels providing the primary information: 26 citizen attention offices distributed by borough, the 010-phone number, the website chat, the Facebook account and the Twitter account @lineamadrid. Each record contains some data that can be studied for improving customer attention services. In particular, we are interested on the study of variables such as the channels dealing with the services and the topics that attract the attention of citizens. Daily time series counting the number of requests or questions by channel and topic are calculated to offer an idea of the way in which citizens’ interests are handled by municipality services.

Figure 2: Word-cloud for words used to describe topics.

The processing of the PARs is done in some steps, being the most challenging the use of Natural Language Processing (Jurafsky and Martin 2008) to obtain hands-on categories or classes to classify the topics which the requests refer to, from some text variables that provide information on this. Figure 2 shows the word-cloud for the words used to describe the topic. Once the requests have been classified into 15 topic categories, time series indices have been computed for channels and topics. These indicators have different rhythms, trends and seasonal behaviors and may help to improve the attention to citizens. For example, comparing the average workload of the channels, studying the seasonal behavior by topic, or executing other natural language analysis procedures to automate the citizens service by creating online chatbots.

Sentiment Index Based on Spanish Tweets

The microblogs such as Twitter are at this time often studied using a diversity of methods from the Natural Language Processing (NLP) and text mining fields (Wagner et al. 2014). In particular, keyword spotting is a popular although naïve method within sentiment analysis. This procedure is used here with the aim to acquire a raw knowledge about the average general mood or sentiment in Spain, not trying to compute accurate information about the sentiment towards specific targets. Although keyword spotting may be prone to errors due to its difficulties in handling possible nuances of emotions such as sarcasm, irony, and others, it is appropriate for small texts, when users extract its opinions and feelings in few sentences and words. The errors may cancel out on the aggregation of the individual scores and the method can result more correct than other sentiment analysis techniques designed to optimize per-document classification.

Figure 3: General sentiment indicator.

An emotional lexicon for Spanish words developed for psychological purposes (Stadthagen-Gonzalez et al. 2016) has been tailored in this work to assign sentiment polarity to tweets. As the Twitter streaming API (Twitter 2020) provides free access to random samples of Twitter data, geographically referenced tweets are gathered. Then, an average sentiment index that summarizes the separate scores in a single number is computed for each day (Rey-del-Castillo 2019b). Figure 3 shows the computed daily indicators including a trend line for the period between January 2016 and April 2022.

Sentiment values far away from the mean.
Date Sentiment Std from mean Event
2016-01-01 55.5 4.2 January the first
2016-03-22 52.8 -4.0 Brussels terrorist attacks
2016-12-24 56.3 6.5 Christmas’s Eve
2016-12-25 56.1 5.9 Christmas Day
2017-01-01 55.5 4.0 January the first
2017-10-01 52.5 -4.9 Catalonia independence referendum
2017-12-24 56.3 6.6 Christmas’s Eve
2017-12-25 56.1 5.8 Christmas Day
2018-01-01 55.6 4.5 January the first
2018-03-11 52.7 -4.2 Barcelona pro-independence rally
2018-04-26 52.8 -4.0 Madrid leader Cifuentes resign
2018-12-25 55.7 4.8 Christmas Day
2019-12-24 55.5 4.0 Christmas’s Eve
2019-12-25 55.7 4.8 Christmas Day
2022-01-01 55.5 4.0 January the first
2022-02-24 52.3 -5.5 Russia invades Ukraine
2022-02-25 52.8 -4.0 One day after Russia invades Ukraine
2017-12-24 55.5 4.2 January the first
2017-12-25 52.8 -4.0 Brussels terrorist attacks

The series has a lot of noise, being particularly noisy at the beginning because at this time there were less tweets by day. The concept whose evolution is tried to be measured by means of this time series -the average collective mood- lacks a precise definition and, thus, it is difficult to assess its quality and accuracy. One thing that might be evaluated is whether the indicators obtained make sense, that is, whether meaning can be extracted from the values of peaks and troughs in the indices. To obtain a first idea about the days having extreme values in the series, Table 1 shows the points being away from the mean value four or more standard deviations. In this regard, celebrations can improve the general humor in a day, and this is what is observed for days of celebration or holidays such as first of January and Christmas Day.

Likewise, from the table, tragic or challenging events such as the Brussels terrorist attacks, Catalonia independence referendum, Russian invasion of Ukraine, etc. generate a sharp fall, that is, a significant worsening in the average mood throughout the country. The coronavirus pandemic (from March 2020) produces also a clear decline in the general sentiment. To analyze the behavior of the series, we follow the Box-Jenkins methodology (Box and Jenkins 1970). The Sentiment index may be used to assess the effects of different social, political or economic events using intervention analysis (Tiao 1985), by means of the introduction of dummy variables in the models.

Newspapers sentiment indicator from web scraping

Using a similar approach to the one used in the previous case, other daily indicators can be built collecting now the raw data through the scraping of certain websites. With this aim in mind, the analysis of newspapers’ sentiments is deemed relevant as they can be considered instruments to conform public opinion. The Digital Periodical and Newspaper Library website3 allows for searching the articles that include search words for most of the journals and newspapers written in Spanish, and the data scraped consist of the text fragments that are near to these words. The scrutiny has been limited to the opinions and sentiments in general, using the same search terms applied to narrow the tweets download, for building also a sort of newspapers sentiment indicator. It has also been restricted to the five most read national generalist newspapers (ABC, El Mundo, El País, La Razón and La Vanguardia) to give an idea of the bigger picture. The indices are computed for the whole newspapers scope and also for each one of the newspapers, identically as in the previous section. The Global series is built using the data from the five newspapers and not as an aggregate of the five newspapers series.

Figure 4: Newspapers sentiment indicators.

Figure 4 shows the resulting time series, including also the corresponding trend lines, having El Mundo and La Vanguardia indicators less data at the beginning. It can be appreciated that the Global series has less noise, being probably smoother if it would have included a bigger number of newspapers. One common feature to observe in all series is the decline and rise due to the coronavirus pandemic, with different degrees of incidence. Their general performance may also be evaluated from their summary properties in Table 2. The different series range, extending from 4.5 for El Mundo to 5.9 for La Razón, suggest different levels of moderation or extremism in the manner to express their opinions. Likewise, the most negative value is obtained for El País (53.6) and the most positive for La Vanguardia (54.3), being all of them very similar.

Indicators summary properties.
Indicator Mean Std Range Min Median Max
ABC 53.8 0.68 4.9 51.1 53.8 55.9
El Mundo 54.0 0.75 4.5 51.6 54.1 56.1
El País 53.6 0.67 5.3 50.9 53.6 56.2
La Razón 53.9 0.75 5.9 50.7 53.9 56.6
Vanguardia 54.3 0.66 5.7 50.8 54.3 56.5
GLOBAL 54.0 0.40 3.0 52.6 54.0 55.6

It is also interesting to compare their general behaviour with that of the Sentiment index based on Spanish tweets. Table 3 shows the only extreme values far away four or more standard deviation from the corresponding mean when a similar assessment is conducted. It results curious that the series are more homogeneous, with not such a big number of extreme values. These do not seem equally linked in general to special events, as it there were a kind of restraint in the way the newspapers express themselves. That is, people in Twitter seem to react harder and more swiftly than newspapers.

Newspapers indicators far away from mean.
Indicator Date Std from mean Event
ABC 2021-01-09 -4.0
País 2020-09-09 -4.0
Razón 2018-11-05 -4.3
Vanguardia 2021-03-16 -4.6 One year after COVID-19 alarm state
Vanguardia 2022-02-11 -4.3
Vanguardia 2022-02-25 -4.0 One day after Russia invades Ukraine

Conclusions

The previous section has shown the building of indicators from Big and Open data sources. The indicators cover very different origins, such as traffic sensors, social networking platforms, or web scraping. This variety of sources requires more tools than the usually employed to calculate traditional statistical indicators. Apart from Spark -an especial platform to process Big Data- (Zaharia et al. 2016), geospatial processing and NLP (Jurafsky and Martin 2008) are worked out. The new tools provide in turn more opportunities for more complex analysis.

A daily periodicity has been considered appropriated to produce quick assessments of the corresponding evolution for them all. This level of time detail is not common for the current social or economic indicators, thus providing useful insights for faster and more actual analysis.

A feature to remark is that, due to the heterogeneity of data sources, the methods to produce the statistical figures should be developed ad hoc for each case. Furthermore, none of the indicators produced is adapted to traditional official statistical structures (observation units, definitions, classifications…) but can be a complement to the information available from other perspectives.

One thing observed when comparing the behavior of two of the constructed indicators which measure general opinions and sentiments is that people seem to react stronger and more quickly than newspapers to special events.

Other important aspects of the indicators produced as time series have not been presented due to the limitations of space. Being daily indicators, they might in theory have up to 4 periodic components: a weekly cycle, a monthly cycle, a quarterly cycle and an annual cycle. The analysis of their trend and seasonality as high-frequency series can also be executed using suitable instruments.

Lastly, the inclusion of intervention analysis (Tiao 1985) may help to weigh up the effects of specific economic and social events on each one of the indicators.

References

Box, G. E. P., and G. Jenkins. 1970. Time Series Analysis: Forecasting and Control. San Francisco, CA: Holden-Day.
Brakel, J. van den. 2022. “New Data Sources and Inference Methods for Official Statistics.” In Statistics in the Public Interest. Springer Series in the Data Sciences. Springer. https://doi.org/10.1007/978-3-030-75460-0_22.
Jurafsky, D., and J. H. Martin. 2008. Speech and Language Processing (2nd Ed.). N. J.: Pearson Prentice Hall.
Python Core Team. 2015. Python: A Dynamic, Open Source Programming Language. Python Software Foundation. https://www.python.org/.
Rey-del-Castillo, P. 2019a. “A Preliminary Assessment of the Traffic Measures in Madrid City.” In CEUR Workshop Proceedings, Second International Workshop on Data Engineering and Analytics (WDEA 2019), 2486:52–64.
———. 2019b. “A Sentiment Index Based on Spanish Tweets.” Boletín de Estadística e Investigación Operativa BEIO 35 (2): 130–47.
Stadthagen-Gonzalez, H., C. Imbault, M. A. Pérez, and M. Brysbaert. 2016. “Norms of Valence and Arousal for 14,031 Spanish Words.” Behavior Research Methods 49: 111–23.
Stone, R., and S. J. Prais. 1952. “Systems of Aggregative Index Numbers and Their Compatibility.” The Economic Journal 72 (247): 565–83.
Tiao, G. C. 1985. “Autoregressive Moving Average Models, Intervention Problems and Outlier Detection in Time Series.” In Handbook of Statistics, Vol. 5: Time Series in the Time Domain, 85–118. Amsterdam: North-Holland.
Twitter. 2020. “Twitter API Access Levels and Versions.” Twitter Developer Platform. 2020. https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api#v2-access-level.
Wagner, J., P. Arora, S. Cortes, U. Barman, D. Bogdanova, J. Foster, and L. Tounsi. 2014. “Aspect-Based Polarity Classification.” In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 223–29.
Zaharia, M., R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, et al. 2016. “Apache Spark: A Unified Engine for Big Data Processing.” Communications of the ACM 59 (11): 56–65.


  1. https://datos.madrid.es/portal/site/egob↩︎

  2. https://datos.madrid.es/portal/site/egob↩︎

  3. https://www.bne.es/en/catalogs/digital-periodical-and-newspaper-library↩︎


Más BEIO

Uso de app’s para recogida de datos en la estadística oficial

Los institutos oficiales de estadística europeos han realizado un gran esfuerzo durante los últimos años para adaptarse al avance de las nuevas tecnologías estableciendo un nuevo canal de recogida de datos basados en cuestionarios web de auto-cumplimentación. Eustat, el Instituto Vasco de Estadística, lleva trabajando desde el año 2017 en el desarrollo de app’s para teléfonos móviles.

New advances in set estimation

Some recent advances in Set Estimation, from 2009 to the present, are discussed. These include some new findings, improved convergence rates, and new type of sets under study. Typically, the theoretical results are derived under some shape constrains, such as r-convexity or positive reach, which are briefly reviewed, together with some other new proposals in this line. Known constraints on the shape, such as r-convexity and positive reach, as well as recently introduced ones are discussed. The estimation of the home-range of a species, which is closely related to set estimation, is also explored, and statistical problems on manifolds are covered. Commentary and references are provided for readers interested in delving deeper into the subject.

Problemas de Elección Social en el Contexto de los Problemas de Asignación

En este trabajo proponemos un método de elección social basado en el problema de asignación de la investigación de operaciones, en particular consideramos un proceso de votación donde los votantes enumeran según sus preferencias a cada uno de los n candidatos disponibles, luego entonces nosotros construimos una matriz de asignación donde las “tareas” por realizar son los puestos 1,2,…n; siendo el puesto número 1 el principal y el n-ésimo el de menor jerarquía. El valor de la posición ij de la matriz se obtiene considerando el número de veces que el candidato i fue seleccionado para “ocupar” el puesto j. Así obtenemos una matriz de rendimiento y se busca la mejor asignación. Usamos bases de datos obtenidos de algunos procesos de elección en los Estados Unidos de América y comparamos los resultados que se obtendrían con nuestra propuesta, adicionalmente se construyen ejemplos para demostrar que nuestro método no es equivalente a los métodos de Borda, Condorcet y mayoría simple.

Técnicas de diferenciabilidad con aplicaciones estadísticas

En esta tesis doctoral se han explorado diferentes aplicaciones del conocido Método delta (Capítulo 2). En concreto, se han calculado las derivadas de Hadamard direccional de diferentes funcionales de tipo supremo en diferentes contextos. A continuación, se han investigado aplicaciones a inferencia no-paramétrica (Capítulo 3), a los problemas de dos muestras u homogeneidad (Capítulo 4) y a la metodología de k-medias (Capítulo 5).

Relevance and identification of biases in statistical graphs by prospective Primary school teachers

El enorme poder de visualización de la información basada en datos representada mediante gráficos estadísticos, hace especialmente interesante el estudio del entendimiento de dicha información por parte de los ciudadanos que se enfrentan a ella día a día. Al mismo tiempo, en el ámbito de didáctica de la estadística se investiga para conocer cómo se produce la transferencia de conocimiento estadístico en la escuela. Así, aunando ambos fines, el propósito del presente estudio exploratorio es observar el grado de alfabetización estadística que poseen los futuros maestros en base a la evaluación de los gráficos estadísticos, frecuentemente utilizados en los medios de comunicación, y la identificación de los sesgos que debido a su visualización selectiva de los datos a veces estos presentan. Los resultados muestran, de forma implícita, una aceptable identificación de convenios para cada gráfico estudiado mientras que evidencia una muy pobre identificación de sesgos o errores en dichas imágenes. Con ello se deduce una necesidad de refuerzo educativo en cuanto a la enseñanza y aprendizaje de la estadística, concretamente, en los estudiantes del Grado de Educación Primaria para, mediante ello, conseguir ciudadanos con una alfabetización estadística funcional desde la escuela.

Learning to build statistical indicators from open data sources

The paper presents the building of several statistical indicators from different Open Data sources, all of them using a common methodological approach to estimate changes across time. The purpose is to show the problems that must be addressed when using these data and to learn about the different ways to cope with them, according to the type of information, the data available and the aim of the specific indicator. The raw data come from diverse secondary sources that make it publicly accessible: traffic sensors, multichannel citizen attention services, Twitter messages and scraped data from a digital newspapers’ library website. The built indicators may be used as proxies or lead indicators for economic activities or social sentiments.