Ir al contenido

Documat


Resumen de Towards a human-centric data economy

Santiago Andrés Azcoitia

  • Spurred by widespread adoption of artificial intelligence and machine learning, ¿data¿ is becoming a key production factor, comparable in importance to capital, land, or labour in an increasingly digital economy. In spite of an ever-growing demand for third-party data in the B2B market, firms are generally reluctant to share their information. This is due to the unique characteristics of ¿data¿ as an economic good (a freely replicable, non-depletable asset holding a highly combinatorial and context-specific value), which moves digital companies to hoard and protect their ¿valuable¿ data assets, and to integrate across the whole value chain seeking to monopolise the provision of innovative services built upon them. As a result, most of those valuable assets still remain unexploited in corporate silos nowadays.

    This situation is shaping the so-called data economy around a number of champions, and it is hampering the benefits of a global data exchange on a large scale. Some analysts have estimated the potential value of the data economy in US$2.5 trillion globally by 2025. Not surprisingly, unlocking the value of data has become a central policy of the European Union, which also estimated the size of the data economy in 827€ billion for the EU27 in the same period. Within the scope of the European Data Strategy, the European Commission is also steering relevant initiatives aimed to identify relevant cross-industry use cases involving different verticals, and to enable sovereign data exchanges to realise them.

    Among individuals, the massive collection and exploitation of personal data by digital firms in exchange of services, often with little or no consent, has raised a general concern about privacy and data protection. Apart from spurring recent legislative developments in this direction, this concern has raised some voices warning against the unsustainability of the existing digital economics (few digital champions, potential negative impact on employment, growing inequality), some of which propose that people are paid for their data in a sort of worldwide data labour market as a potential solution to this dilemma.

    From a technical perspective, we are far from having the required technology and algorithms that will enable such a human-centric data economy. Even its scope is still blurry, and the question about the value of data, at least, controversial. Research works from different disciplines have studied the data value chain, different approaches to the value of data, how to price data assets, and novel data marketplace designs. At the same time, complex legal and ethical issues with respect to the data economy have risen around privacy, data protection, and ethical AI practices.

    In this dissertation, we start by exploring the data value chain and how entities trade data assets over the Internet. We carry out what is, to the best of our understanding, the most thorough survey of commercial data marketplaces. Based on a comprehensive survey that analysed 104 entities, we have catalogued and characterised ten different business models, including those of personal information management systems, companies born in the wake of recent data protection regulations and aiming at empowering end users to take control of their data. We have also identified the challenges faced by different types of entities, and what kind of solutions and technology they are using to provide their services. Through this extensive study, it has become clear to us that most of the challenges these entities face have to do with trust. On the one hand, sellers express an ambition for absolute control of their data, and demand strong commitment from marketplaces to avoid unauthorised replication, resale or use of their data assets. On the other hand, potential buyers would benefit from testing data and knowing its value before closing a transaction, and from certifying that information comes from trustful data sources.

    Then we present a first of its kind measurement study that sheds light on the prices of data in the market using a novel methodology. Having scraped metadata for hundreds of thousands of data products listed by 10 real-world data marketplaces and other 30 data providers we found fewer than ten thousand that were non-free and included prices. We believe that this is due to prices being often left to direct negotiation between buyers and sellers, and also because most marketplaces use free data to bootstrap their marketplace and attract the first buyers and then commercial sellers. We learnt that comparing across marketplaces is far from simple. Not only do they use different categorisation hierarchies, but they apply different criteria to label product categories, as well. We circumvented this problem by using ML classifiers that are able to learn the criteria that DMs follow to label datasets belonging in a certain category, and apply the same criteria to label datasets in other data marketplaces.

    Focusing on the products that carried a price, some 4,200 of them, we observed that: ¿ Prices vary in a wide range from few, to several hundreds of thousands of US dollars. The median price for data products sold under a subscription model is US$1,400 per month, and US$2,200 for those sold as a one-off purchase.

    ¿ Those related to ¿Telecom¿, ¿Manufacturing¿, ¿Automotive¿ and ¿Gaming¿ command the highest median prices, and that the most expensive ones relate to ¿Retail and Marketing¿.

    ¿ Using regression models, it is possible to fit the prices of commercial products from their features with R2 score above 0.84.

    ¿ Due to the heterogeneity of the sample there is no single feature that drives the prices, but instead we spotted meaningful features that drive the prices of specific categories of data. For example, data update rate is a key price driver for financial and healthcare-related products, whereas geo-spatial localisation and the possibility of connecting data points from the same owner are for marketing data.

    ¿ Overall our models use features related to the category and description of the nature of the different data products (i.e., ¿Financial¿, ¿Retail¿, ¿stock¿, ¿contact¿, ¿list¿, etc.), features related to the data products volume and units, as well as singular characteristics extracted from the data products description (i.e., words like ¿custom¿, ¿accuracy¿, ¿quality¿, etc.) to forecast their market price. Groups of features related to ¿what¿ and ¿how much¿ data a product contains are driving 66% of its price.

    As a result, we also implement the basic building blocks of a novel data pricing tool capable of providing a hint of the market price of a new data product using as inputs just its metadata. This tool would provide more transparency on the prices of data products in the market, which will help in pricing data assets and in avoiding the inherent price fluctuation of nascent markets.

    Next we turn to topics related to data marketplace design. Particularly, we study how buyers can select and purchase suitable data for their tasks without requiring a priori access to such data in order to make a purchase decision, and how marketplaces can distribute payoffs for a data transaction combining data of different sources among the corresponding providers, be they individuals or firms. The difficulty of both problems is further exacerbated in a human-centric data economy where buyers have to choose among data of thousands of individuals, and where marketplaces have to distribute payoffs to thousands of people contributing personal data to a specific transaction.

    Regarding the selection process, we compare different purchase strategies depending on the level of information available to data buyers at the time of making decisions. A first methodological contribution of our work is proposing a data evaluation stage prior to datasets being selected and purchased by buyers in a marketplace. We show that buyers can significantly improve the performance of the purchasing process just by being provided with a measurement of the performance of their models when trained by the marketplace with individual eligible datasets. We design purchase strategies that exploit such functionality and we call the resulting algorithm Try Before You Buy (TBYB), and our work demonstrates over synthetic and real datasets that it can lead to near-optimal data purchasing with only O(N) instead of the exponential execution time - O(2N) - needed to calculate the optimal purchase. In addition: ¿ TBYB remains close to the optimal in most scenarios, and its benefit increases with the catalogue size.

    ¿ TBYB is almost optimal when buying more data yields a progressively diminishing return in value for the buyer. Otherwise, TBYB finds it more difficult to match the optimal performance, although it still outperforms other heuristics.

    ¿ The benefit of TBYB becomes maximal when prices of datasets do not correlate with their actual value for the buyer. When pricing reflects such value, the performance of TBYB is still superior but the gap with value-unaware strategies becomes smaller.

    ¿ When dealing with personal data, TBYB can significantly reduce the number of individuals whose information is disclosed to buyers, hence helping in preserving privacy.

    With regards to the payoff distribution problem, we focus on computing the relative value of spatio-temporal datasets combined in marketplaces for predicting transportation demand and travel time in metropolitan areas using large datasets of taxi rides from Chicago, Porto and New York. To do so, we introduce the Shapley value from collaborative game theory as a baseline metric for establishing the importance of each player (be they taxi companies or individual drivers) in the context of a coalition of data providers. The Shapley value is widely accepted for this purpose due to its salient properties (efficiency, symmetry, linearity, null-player and strict desirability). But at the same time it entails serious computational challenges, since its direct calculation in a coalition of N players requires enumerating and calculating the value of O(2N) sub-coalitions. This may be possible for a few tens of data providers, which is the case of companies in wholesale markets, but becomes impossible when considering hundreds or thousands of them in a retail data market setting. Furthermore, we look at the trade-off between fairness and scalability/practicality by studying and comparing against simpler heuristics used to estimate the value of data, based on the volume of data, the leave-one-out (LOO) value, measures of the amount of information of a data source such as Shannon¿s entropy, and metrics of the averageness of such data.

    We first study the value of data fusion at the granularity of companies. Since the number of such companies covering the same geographical area is typically small, the relative value of their data can be computed directly from the definition of the Shapley value. This, however, becomes infeasible at the level of individual taxi drivers, since the latter may amount to several thousands for large metropolitan areas. To address this issue, we compare different approximation techniques, and conclude that an ad hoc version of structured sampling performs much better than other more popular approaches such as Monte Carlo and random sampling.

    By applying our model and valuation algorithms to taxi-ride data from Chicago, Porto and New York, we find that sufficiently large companies hold enough information to independently predict the overall demand, at city level, or in large districts, with over 96% accuracy. This effectively means that inter-company collaboration does not make much sense in such cases. On the contrary, companies have to combine their data in order to achieve a sufficient forecasting accuracy in smaller districts. We compute the relative value of different contributions in such cases by computing the Shapley value for each taxi company. We find that the values differ by several orders of magnitude, and that the importance of the data of a given company can vary as much as x10 across districts. More interestingly, the Shapley value of a company's dataset does not correlate with its volume, i.e., some companies that report relatively few rides have a larger impact on the forecasting accuracy than companies that report many more rides. The LOO heuristic also fails to approximate the per company value as given by Shapley.

    Similar phenomena are observed at the finer level of individual drivers. We show that by combining data from relatively few drivers one can easily detect peak hours at city level. At district level, however, more data needs to be combined, and this requires making use of our fastest approximations of the Shapley value based on structured sampling. Moreover, using trajectory data from taxis in Porto, we observe again, this time for estimating the travel time within a city, that the value of information contributed by each driver may vary wildly, and that it cannot be approximated based on the volume of rides they report nor via the LOO heuristic.

    Overall, using multiple datasets, different forecasting objectives, and at different granularities, our work shows that computing, even approximately, the Shapley value seems to be a ¿necessary evil¿ if one wants to split fairly the value of a combined spatio-temporal dataset. Simple heuristics based on volume and LOO fail to approximate the results produced via the Shapley value. Other heuristics tailored to each problem, such as the similarity to the aggregate when predicting demand, or spatio-temporal Shannon entropy when predicting travel time in a city, seem to be doing a better job at approximating Shapley. We believe that the fast-growing ecosystem of data marketplaces and PIMS can greatly benefit from these findings as it transits from very basic towards more elaborate and fairer pricing schemes.

    We conclude with a number of open issues and propose further research directions that leverage the contributions and findings of this dissertation. These include monitoring data transactions to better measure data markets, and complementing market data with actual transaction prices to build a more accurate data pricing tool. A human-centric data economy would also require that the contributions of thousands of individuals to machine learning tasks are calculated daily. For that to be feasible, we need to further optimise the efficiency of data purchasing and payoff calculation processes in data marketplaces. In that direction, we also point to some alternatives to repetitively training and evaluating a model to select data based on Try Before You Buy and approximate the Shapley value. Finally, we discuss the challenges and potential technologies that help with building a federation of standardised data marketplaces.

    The data economy will develop fast in the upcoming years, and researchers from different disciplines will work together to unlock the value of data and make the most out of it. Maybe the proposal of getting paid for our data and our contribution to the data economy finally flies, or maybe it is other proposals such as the robot tax that are finally used to balance the power between individuals and tech firms in the digital economy. Still, we hope our work sheds light on the value of data, and contributes to making the price of data more transparent and, eventually, to moving towards a human-centric data economy.


Fundación Dialnet

Mi Documat