In 2017, DOMO estimated that 90 percent of the world’s data had been created in the previous two years. Moreover, many data analysts expect that the digital universe will be 40 times bigger by 2020. This big amount of person-specific and sensitive data arrives from disparate sources such as social networking sites, mobile phone applications and electronic medical record systems. The use of big data offers remarkable opportunities. For example, in a healthcare context, big data can be used to refine health policies, which is beneficial for individuals and for the society as a whole. However, at the same time, the privacy of the subjects to whom the data refers to needs to be guaranteed. The data must be protected against attacks and data leakages (data protection), e.g. unauthorized people cannot have access to sensitive data. At the same time, it should not be possible to re-identify any individual in the published data even when other external or publicly available data are integrated (data anonymity). Additionally, the access to the released data should not allow an attacker to increase his knowledge about confidential information related to any specific individual (data condentiality). For instance, the Big Data Value Association recognizes data protection and anonymization as one of the main priorities for research and innovation. Moreover, they also identify the need of efficient echanisms for data storage and processing, and they suggest the joint development of hardware and software for cloud data platforms.
The cloud is fast becoming a new normal and suitable strategy in the big data context. In fact, the cloud is often the only possible strategy due to the costs (software, hardware, energy, maintenance) associated to the storage and processing of big data. The 2017 State of the Cloud Survey estimated that 95 percent of enterprises had a cloud strategy. Nevertheless, even if companies run most of their workload in the cloud, around the 79 percent of their total workload, a remaining 21 percent, runs locally. This local workload may be traced back to the reluctance of data controllers to entrust their sensitive data to the cloud due to security and privacy concerns. For instance, the 2017 State of the Cloud Survey shows that 25 percent of respondents cite security as a major concern. The problem is not only that cloud service providers (CSPs) may read, use or even sell the data outsourced by their customers; but also that they may suffer attacks and data leakages that can compromise data confidentiality. For instance, Ristenpart et al. show that co-resident virtual machines (VM) can give rise to certain security vulnerabilities: if the attacker becomes a customer of the cloud and obtains a VM, he can use different information leakage attacks on the shared physical resources to gain access to the victim's (sensitive) information.
In this thesis, we tackle the problem of outsourcing to untrusted clouds in a practical and privacy preserving manner two basic operations on non-encrypted sensitive data: scalar products and matrix products. These operations are useful to perform data analyses such as correlations between attributes or contingency tables, among others. Specifically, we propose several secure protocols to outsource to multiple clouds the computation of a variety of multivariate analyses on nominal data (frequency-based and semantic-based). These analyses are challenging, and they are even harder when data are nominal (i.e., textual, non-ordinal), because the standard arithmetic operators cannot be used.
Our protocols allow using the cloud not only to store sensitive non-encrypted data, but also to process them. We consider two variants of honest-but-curious clouds: clouds that do not share information with each other and clouds that may collude by sharing information with each other. In addition to analyzing the security of the proposed protocols, we also evaluate their performance against a baseline consisting of downloading plus local computation. Our protocols have been designed to outsource as much workload as possible to the clouds, in order to retain the cost-saving benefits of cloud computing while ensuring that the outsourced data stay split and, hence, they are privacy-protected versus the clouds. In addition to analyzing the security of the proposed protocols, we also evaluate their performance against a baseline consisting of downloading plus local computation The experiments on categorical data that we report on the Amazon cloud service show that, with our protocols, the data controller can save more than 99.999% runtime for the most demanding computations.
On the other hand, although data collection has become easier and more affordable than ever before, releasing data for secondary use (that is, for a purpose other than the one that triggered the data collection) remains very important: in most cases, researchers cannot afford collecting themselves the data they need. However, when the data released for secondary use refer to individuals, households or companies, the privacy of the data subjects must be taken into account. A great variety of statistical disclosure control (SDC) methods, which aim at releasing data that preserve their statistical validity while protecting the privacy of each data subject, are now available. Since sensitive information can be inferred in many ways from the data releases, these masking methods are compulsory. Homer et al. show that participants in genomic research studies may be identified from the publication of aggregated research results, including where an individual contributes in a mixture less than 0.1% of the total genomic DNA. Greveler et al. show that the high-resolution energy consumption data which are transmitted by some smart meters to the utility company can be used to identify the TV shows and movies being watched in a target household. Coull et al. show that certain types of web pages viewed by users can be deduced from metadata about network flows, even when server IP addresses are replaced with pseudonyms. Finally, Goljan and Fridrich show how cameras can be identified from noise in the images they produce. In this thesis, we also present a methodology to compare SDC methods for microdata in terms of how they perform regarding the risk-utility trade-off. Previous comparative studies usually start by selecting some parameter values for a set of SDC methods and evaluate the disclosure risk and the information loss yielded by the methods for those parameterizations. In contrast, here we start by setting a certain risk level (resp. utility preservation level) and then we find which parameter values are needed to attain that risk (resp. utility) under different SDC methods. Finally, once we have achieved an equivalent risk (resp. utility) level across methods, we evaluate the utility (resp. the risk) provided by each method, in order to rank methods according to their utility preservation (resp. disclosure protection). This ranking depends on a certain level of risk (resp. utility) and a certain original data set. The novelty of this comparison is not limited to the above-described methodology: we also justify and use general utility and risk measures that differ from those used in previous comparisons. Furthermore, we present experimental results of our methodology to compare the utility preservation of several methods given an equivalent level of risk for all of them. The experiments that we report on CENSUS and EIA data sets, which are usual test sets in the SDC literature, show that the results differ between data sets. As a conclusion from the experimental analysis, the best strategy seems to be to make several anonymizations at the desired level of disclosure risk and select the one that has the greatest utility.
© 2008-2024 Fundación Dialnet · Todos los derechos reservados