Differential Privacy - an introduction
Our previous post discussed the privacy risks yielded by the increasing amount of personal data collected, processed and consumed by society in our time. Preventing the leakage of protected information when analyzing sensitive datasets is therefore one of the major challenges for data science nowadays.
One may think that releasing only summary statistics about a dataset, for example the mean value of a certain feature, guarantees the privacy of its individual records: specific information of single records is not revealed. Unfortunately, this is far from being true. To convince you about this fact, let us introduce the following toy example. Imagine that at the beginning of every month your HR department sends out a newsletter with up-to-date information about the the company crew. Among others, the newsletter introduces the new employees starting in the upcoming month, congratulates the current employees celebrating their birthday during the upcoming month and reports the average age of the current employees. Since people age is considered a private information, only its average value is released. At first glance, based just on this summary statistic released monthly, inferring the age of any employee seems impossible.
However, let us consider this situation. Your boss is the only one celebrating his birthday this upcoming month and he will leave the company before the end the month. Plus, no new employees will join the company this month. By comparing the average age of the employees reported by the newsletter of this month with the one reported by the next newsletter, one can get to know the age of your boss. A privacy breach just happened!
While the above scenario is just a toy example, it depicts how even releasing apparently harmless statistics may unintentionally pose subtle threats to the privacy of your data. Perhaps leaking the actual age of a person is not such a big issue, but what about unintentionally revealing information about the health of someone? Well, to use an euphemism, it is undesirable at best.
Luckily, in situations like these ones, differential privacy (DP) comes to the rescue for safeguarding the privacy of your data. Let us get back to our example for an intuitive explanation of how DP works. To obtain a differentially-private estimation of the empolyees’ average age, first the actual age of each employee is perturbed by adding a random number in a suitable interval centered around zero, e.g., [-100, 100]. We will discuss in the next blog entries how the interval bounds are chosen. So, for example, if the actual age of John is 34, his perturbed age may be 39, while the perturbed age of the 56-years-old Mike may be 47. The mean of the perturbed rather than the actual ages of the employees is then reported in the newsletter. Provided that enough employees are involved in the computation, the reported average age is a reliable good approximation of the actual average age. And, crucial point, since the perturbed rather than actual employees’ ages are used to compute the mean, a privacy breach like the one described above no longer happens: your boss’ age, as well as the one of any other employee, is not revealed!
If you read until this point, I guess you have a big question in mind: how the mean of the perturbed ages can be a reliable estimation of the actual average age (if enough perturbed ages are involved in the estimation)? Since the random numbers added to each age are pulled from a distribution with zero mean, if you average over enough of them, they cancel each other out: the mean of the random numbers gets close to zero, therefore when averaging the perturbed ages you obtain the mean of the actual ages plus a value relatively close to zero. You thus obtain a reliable average age, despite the fact that no employee revealed his actual age! It should also be clear now why increasing the number of employees involved in the differentially-private computation improves the quality of the result.
This simple example shows how releasing differentially-private statistics about a large group of people preserves the privacy of any single individual in the group. While this is not the only useful application of DP, DP is not the silver bullet to the data-privacy issue. In particular, private information may be leaked when repeatedly releasing differentially-private estimations of the same statistic over the same data. This vulnerability and how to properly handle it will be discussed in our next blog posts, where we will resort to the mathematical formulation of DP. By the more rigorous DP explanation, we will also motivate the choice of interval [-100, 100] for the random numbers generation in above example.