Pilar Rey-del-Castillo
Institute of Fiscal Studies, Madrid, Spain.
mpilar.rey.castillo@ief.hacienda.gob.es
Abstract
One of the biggest challenges facing official statistics today is the use of the massive amount of data generated on the web or by sensors and other electronic devices for the production of statistical figures. This paper presents the building of several statistical indicators from different Open Data sources. All the indicators have been built using a common methodological approach to estimate changes across time. The purpose of the paper is to show the different problems that must be addressed when using these data sources and to learn about the different ways to cope with them. The first Open Data source is traffic sensors data, where the data about the geographical location of the sensors permits to compute traffic intensity indicators at detailed geographical level. Apart from being proxies or lead indicators for economic activity, the figures can be used to measure the impact of different traffic arrangements in specific areas. Before constructing the indicators for the following source, call records from a multichannel citizen attention service, the data have been analyzed using Natural Language Processing tools to identify several categories of topics for the requests received. Other Open Data sources, Twitter messages and scraped data from a digital newspapers’ library website, are studied using similar tools in both situations. A rough idea about the evolution for the general sentiments in Spain is obtained from Twitter messages. From scraped data, the evolution of the average opinions and sentiments in the country’s newspapers is similarly computed. Usually, it is accepted that the ideas expressed in the newspapers are relevant to conform public opinion. On the other hand, an interesting result obtained in our research is that individuals react stronger and more quickly than newspapers to some social, political or economic events.
Keywords: Big data, Official statistics, Open Data, Periodic indicators.
Introduction
The coming of what is known as Big Data and other new data sources has signified a number of new opportunities for the production of official statistics. National statistical offices and other bodies are these days involved in different projects that are trying to use these data in an efficient way, maintaining at the same time the level of relevance and high quality required (Brakel 2022). Many of these new data sources are privately owned and controlled, what makes particularly difficult its access to explore and evaluate possible uses as data for the production of statistical figures. As they are not designed for the purpose of its statistical exploitation, its prior exploration and study is needed to discover the unknown data structure. The aim of the paper is precisely to take advantage of the worldwide movement for the opening of information that makes it accessible, exploitable and shared by anyone for any objective. The idea is to present the building of several statistical indicators from different Open Data sources, showing some of the problems that must be addressed and learning about the different ways to cope with them, according to the type of information, the data available and the aim of the specific indicator. All of them have been built using a common methodological approach, trying to obtain new statistical figures that can be proxies of the evolution of social or economic phenomena of interest.
The rest of the paper is organized into four sections. Section 2 describes the common general methodological approach; Section 3 then sets out the results obtained using each one of the proposed Open Data sources; and finally, some comments and conclusions are presented in Section 4.
Methodological approach
The first aim of the indicators to build is to estimate changes across time. From the statistical theory, it is known that it is more accurate to estimate changes (over time or space) than absolute figures, because some of the possible bias and errors can be cancelled out when totaling the change from repeated measures that use the same instrument every time.
As for their periodicity, daily indicators have been constructed in all cases. The data available for all of them allow also for building at a more detailed time granularity level, but this daily level is enough and appropriated to produce quick assessments of the corresponding evolution. The progress over time is computed as the variation of each period in relation to a common period (known as the base period). That is, the evolution between base time (\(0\)) and time t is calculated as a simple index (Stone and Prais 1952), \(I_t = X_t /X_0\), where \(X_0\), \(X_t\) are the values of the variable in question at base time and time \(t\), respectively. When the variable of interest is calculated as an aggregate (total, mean value), the evolution can also be computed as an aggregate index, using chain-linked indexes, \[I_t = I_{t-1} \frac{\sum_i x_{it}}{\sum_i x_{it-1}}.\qquad(1)\]
Sometimes, to maintain a certain idea of the level of the aggregate, the value of the index at base time can be the value of the aggregate \(I_0 = A_0\), and the rest of the periods are recursively computed using the previous expression, i.e., \[A_t = A_{t-1} \frac{ \sum_i x_{it}}{ \sum_i x_{it-1}}.\qquad(2)\]
Another issue is about the volume of the information. In general terms, the data gathered from the different sources cannot be processed using traditional statistical software and precise tools especially developed for this purpose. Apache Spark (Zaharia et al. 2016), an open-source analytics engine for Big Data treatment has been used for the collection and first steps of the processing. In addition to Apache Spark, the Python software (Python Core Team 2015) has been used for other processing and programming steps in this work.
Results using different open data sources
The next subsections describe the results for each one of the proposed Open Data sources.
Traffic sensors data
The users can explore and download the data from the open data portal offered by the local government of Madrid City1. Some of the datasets available on this site contain pre-processed data from traffic sensors located on specific points in the roads and streets of the city. These traffic sensors, apart from other information, provide data on its geographical location and on the number of vehicles passing by. After the end of each month, these data measured in 15-minutes intervals -for each one of the more than 4000 sensors- are released. The purpose of this work is to monitor the traffic in Madrid city, computing indicators of its progress. The intensity in an area defined as the average number of vehicles moving on all its roads and streets can be approximated by the average number of vehicles passing by all the sensors located in the area. Once a first pre-processing stage to make data standardized, the total number of vehicles by sensor and day is computed. Sometimes the sensors may be blocked due to fail of power supply, environmental interfering factors or other problems. In this case, changes in the averages could be due to changes in the sensors’ location and/or activity and not by changes in the area’s traffic. To ensure validity and completeness, data editing is executed. Missing and invalid data are imputed by a procedure specifically developed for this purpose and, subsequently, chain-linked indices are calculated to monitor the time-evolution in each area (Rey-del-Castillo 2019a). The indicators are computed for the whole city, the M30 ring road and the Urban area. Figures 1 (a) and 1 (b) exhibit the three daily series obtained for the period between January 2016 and March 2022.
A significant feature in all series is the fall during the restrictions due to the coronavirus pandemic. The effects of Filomena storm in January 2021 can also be seen.
Using the geographical information available, the effects of some traffic measures implemented in different areas can also be evaluated as an example of the utility of the time series produced. The analysis can also be extended to other variables obtained from traffic sensors and available at the portal.
A significant feature in all series is the fall during the restrictions due to the coronavirus pandemic. The effects of Filomena storm in January 2021 can also be seen.
Using the geographical information available, the effects of some traffic measures implemented in different areas can also be evaluated as an example of the utility of the time series produced. The analysis can also be extended to other variables obtained from traffic sensors and available at the portal.
Call records from a citizen attention service
The local government of Madrid City offers also access in its open data portal2 to the Personalized Attention Records (PARs) of Linea Madrid. There are diverse communication channels providing the primary information: 26 citizen attention offices distributed by borough, the 010-phone number, the website chat, the Facebook account and the Twitter account @lineamadrid. Each record contains some data that can be studied for improving customer attention services. In particular, we are interested on the study of variables such as the channels dealing with the services and the topics that attract the attention of citizens. Daily time series counting the number of requests or questions by channel and topic are calculated to offer an idea of the way in which citizens’ interests are handled by municipality services.
The processing of the PARs is done in some steps, being the most challenging the use of Natural Language Processing (Jurafsky and Martin 2008) to obtain hands-on categories or classes to classify the topics which the requests refer to, from some text variables that provide information on this. Figure 2 shows the word-cloud for the words used to describe the topic. Once the requests have been classified into 15 topic categories, time series indices have been computed for channels and topics. These indicators have different rhythms, trends and seasonal behaviors and may help to improve the attention to citizens. For example, comparing the average workload of the channels, studying the seasonal behavior by topic, or executing other natural language analysis procedures to automate the citizens service by creating online chatbots.
Sentiment Index Based on Spanish Tweets
The microblogs such as Twitter are at this time often studied using a diversity of methods from the Natural Language Processing (NLP) and text mining fields (Wagner et al. 2014). In particular, keyword spotting is a popular although naïve method within sentiment analysis. This procedure is used here with the aim to acquire a raw knowledge about the average general mood or sentiment in Spain, not trying to compute accurate information about the sentiment towards specific targets. Although keyword spotting may be prone to errors due to its difficulties in handling possible nuances of emotions such as sarcasm, irony, and others, it is appropriate for small texts, when users extract its opinions and feelings in few sentences and words. The errors may cancel out on the aggregation of the individual scores and the method can result more correct than other sentiment analysis techniques designed to optimize per-document classification.
An emotional lexicon for Spanish words developed for psychological purposes (Stadthagen-Gonzalez et al. 2016) has been tailored in this work to assign sentiment polarity to tweets. As the Twitter streaming API (Twitter 2020) provides free access to random samples of Twitter data, geographically referenced tweets are gathered. Then, an average sentiment index that summarizes the separate scores in a single number is computed for each day (Rey-del-Castillo 2019b). Figure 3 shows the computed daily indicators including a trend line for the period between January 2016 and April 2022.
Date | Sentiment | Std from mean | Event |
---|---|---|---|
2016-01-01 | 55.5 | 4.2 | January the first |
2016-03-22 | 52.8 | -4.0 | Brussels terrorist attacks |
2016-12-24 | 56.3 | 6.5 | Christmas’s Eve |
2016-12-25 | 56.1 | 5.9 | Christmas Day |
2017-01-01 | 55.5 | 4.0 | January the first |
2017-10-01 | 52.5 | -4.9 | Catalonia independence referendum |
2017-12-24 | 56.3 | 6.6 | Christmas’s Eve |
2017-12-25 | 56.1 | 5.8 | Christmas Day |
2018-01-01 | 55.6 | 4.5 | January the first |
2018-03-11 | 52.7 | -4.2 | Barcelona pro-independence rally |
2018-04-26 | 52.8 | -4.0 | Madrid leader Cifuentes resign |
2018-12-25 | 55.7 | 4.8 | Christmas Day |
2019-12-24 | 55.5 | 4.0 | Christmas’s Eve |
2019-12-25 | 55.7 | 4.8 | Christmas Day |
2022-01-01 | 55.5 | 4.0 | January the first |
2022-02-24 | 52.3 | -5.5 | Russia invades Ukraine |
2022-02-25 | 52.8 | -4.0 | One day after Russia invades Ukraine |
2017-12-24 | 55.5 | 4.2 | January the first |
2017-12-25 | 52.8 | -4.0 | Brussels terrorist attacks |
The series has a lot of noise, being particularly noisy at the beginning because at this time there were less tweets by day. The concept whose evolution is tried to be measured by means of this time series -the average collective mood- lacks a precise definition and, thus, it is difficult to assess its quality and accuracy. One thing that might be evaluated is whether the indicators obtained make sense, that is, whether meaning can be extracted from the values of peaks and troughs in the indices. To obtain a first idea about the days having extreme values in the series, Table 1 shows the points being away from the mean value four or more standard deviations. In this regard, celebrations can improve the general humor in a day, and this is what is observed for days of celebration or holidays such as first of January and Christmas Day.
Likewise, from the table, tragic or challenging events such as the Brussels terrorist attacks, Catalonia independence referendum, Russian invasion of Ukraine, etc. generate a sharp fall, that is, a significant worsening in the average mood throughout the country. The coronavirus pandemic (from March 2020) produces also a clear decline in the general sentiment. To analyze the behavior of the series, we follow the Box-Jenkins methodology (Box and Jenkins 1970). The Sentiment index may be used to assess the effects of different social, political or economic events using intervention analysis (Tiao 1985), by means of the introduction of dummy variables in the models.
Newspapers sentiment indicator from web scraping
Using a similar approach to the one used in the previous case, other daily indicators can be built collecting now the raw data through the scraping of certain websites. With this aim in mind, the analysis of newspapers’ sentiments is deemed relevant as they can be considered instruments to conform public opinion. The Digital Periodical and Newspaper Library website3 allows for searching the articles that include search words for most of the journals and newspapers written in Spanish, and the data scraped consist of the text fragments that are near to these words. The scrutiny has been limited to the opinions and sentiments in general, using the same search terms applied to narrow the tweets download, for building also a sort of newspapers sentiment indicator. It has also been restricted to the five most read national generalist newspapers (ABC, El Mundo, El País, La Razón and La Vanguardia) to give an idea of the bigger picture. The indices are computed for the whole newspapers scope and also for each one of the newspapers, identically as in the previous section. The Global series is built using the data from the five newspapers and not as an aggregate of the five newspapers series.
Figure 4 shows the resulting time series, including also the corresponding trend lines, having El Mundo and La Vanguardia indicators less data at the beginning. It can be appreciated that the Global series has less noise, being probably smoother if it would have included a bigger number of newspapers. One common feature to observe in all series is the decline and rise due to the coronavirus pandemic, with different degrees of incidence. Their general performance may also be evaluated from their summary properties in Table 2. The different series range, extending from 4.5 for El Mundo to 5.9 for La Razón, suggest different levels of moderation or extremism in the manner to express their opinions. Likewise, the most negative value is obtained for El País (53.6) and the most positive for La Vanguardia (54.3), being all of them very similar.
Indicator | Mean | Std | Range | Min | Median | Max |
---|---|---|---|---|---|---|
ABC | 53.8 | 0.68 | 4.9 | 51.1 | 53.8 | 55.9 |
El Mundo | 54.0 | 0.75 | 4.5 | 51.6 | 54.1 | 56.1 |
El País | 53.6 | 0.67 | 5.3 | 50.9 | 53.6 | 56.2 |
La Razón | 53.9 | 0.75 | 5.9 | 50.7 | 53.9 | 56.6 |
Vanguardia | 54.3 | 0.66 | 5.7 | 50.8 | 54.3 | 56.5 |
GLOBAL | 54.0 | 0.40 | 3.0 | 52.6 | 54.0 | 55.6 |
It is also interesting to compare their general behaviour with that of the Sentiment index based on Spanish tweets. Table 3 shows the only extreme values far away four or more standard deviation from the corresponding mean when a similar assessment is conducted. It results curious that the series are more homogeneous, with not such a big number of extreme values. These do not seem equally linked in general to special events, as it there were a kind of restraint in the way the newspapers express themselves. That is, people in Twitter seem to react harder and more swiftly than newspapers.
Indicator | Date | Std from mean | Event |
---|---|---|---|
ABC | 2021-01-09 | -4.0 | |
País | 2020-09-09 | -4.0 | |
Razón | 2018-11-05 | -4.3 | |
Vanguardia | 2021-03-16 | -4.6 | One year after COVID-19 alarm state |
Vanguardia | 2022-02-11 | -4.3 | |
Vanguardia | 2022-02-25 | -4.0 | One day after Russia invades Ukraine |
Conclusions
The previous section has shown the building of indicators from Big and Open data sources. The indicators cover very different origins, such as traffic sensors, social networking platforms, or web scraping. This variety of sources requires more tools than the usually employed to calculate traditional statistical indicators. Apart from Spark -an especial platform to process Big Data- (Zaharia et al. 2016), geospatial processing and NLP (Jurafsky and Martin 2008) are worked out. The new tools provide in turn more opportunities for more complex analysis.
A daily periodicity has been considered appropriated to produce quick assessments of the corresponding evolution for them all. This level of time detail is not common for the current social or economic indicators, thus providing useful insights for faster and more actual analysis.
A feature to remark is that, due to the heterogeneity of data sources, the methods to produce the statistical figures should be developed ad hoc for each case. Furthermore, none of the indicators produced is adapted to traditional official statistical structures (observation units, definitions, classifications…) but can be a complement to the information available from other perspectives.
One thing observed when comparing the behavior of two of the constructed indicators which measure general opinions and sentiments is that people seem to react stronger and more quickly than newspapers to special events.
Other important aspects of the indicators produced as time series have not been presented due to the limitations of space. Being daily indicators, they might in theory have up to 4 periodic components: a weekly cycle, a monthly cycle, a quarterly cycle and an annual cycle. The analysis of their trend and seasonality as high-frequency series can also be executed using suitable instruments.
Lastly, the inclusion of intervention analysis (Tiao 1985) may help to weigh up the effects of specific economic and social events on each one of the indicators.
References