Data Protection and Artificial Intelligence

Jona Boeddinghaus
2020-05-26

Data protection is an important prerequisite for a free and fair society. Only if citizens really have the right to determine how their personal data are used and published can they move freely in society.

Only this fundamental right to informational self-determination creates people’s trust in a functioning democracy, which is needed for free development of the personality and for comprehensive freedom of action.

Effective data protection is also a guarantee of real equality and practical avoidance of discrimination. It protects against unwanted exclusion or non-transparent evaluations and enables self-determined participation in public and private life.

In its legal form, data protection includes the protection of personal data. However, non-personal data are also worth protecting if they allow conclusions to be drawn about secret information. For example, companies are often very interested in securely protecting certain data that describe business-critical processes or inventions.

Data protection should therefore always come first when storing and processing data.

On the other hand, there is a growing need to collect data and process it sensibly. Business, research and public administration are storing more data than ever before. This contradicts the data protection maxim of data minimization, according to which, as stated in the European General Data Protection Regulation, as little data as possible should always be processed for a specific purpose. It is important to work towards more sensibility and control here. On the other hand, the collection and use of data is not per se and in any case to be condemned. All empirical sciences are based on data. Data analysis is essential in research as well as in countless business processes or public administration.

Data evaluations can enable inventions or improve processes and thus contribute to greater prosperity; Analysis of diseases or epidemics can provide important information that can save lives.

In recent years, a discipline in computer science has particularly stood out to speed up and improve the analysis of large amounts of data: Artificial Intelligence, or more precisely, its sub-area of machine learning. In doing so, data scientists develop algorithms that improve automatically, based on the data made available to them. Machine learning mainly uses complex mathematical and statistical models that are particularly suitable for uncovering certain, previously hidden, general properties of data sets. On the one hand, this makes machine learning a powerful tool for data analysis, which is successfully used in industry to optimize processes, in research for faster and more precise development, e.g. discovery of new drug combinations, or in healthcare for better diagnosis and treatment of diseases. On the other hand, this type of data processing in particular poses major challenges when it comes to protecting data.

There are various methods for protecting secret or personal data in data processing. Probably the best known and most popular approaches are anonymization and its sister method, pseudonymization. Partial information worth protecting (such as names) is removed from the data or replaced by certain placeholders. The resulting data records apparently no longer contain any personal data and are therefore mostly freely usable according to EU data protection laws. The logic here is: if the data no longer contains any personal information, there is no conflict with the principle of informational self-determination (simply because this “self” obviously no longer appears in the data). Data protection laws require that assigning a person to a data record must be sufficiently complex. In addition to - often underestimated - consequences for information security - this requirement is apparently met with a carefully carried out anonymization.

Unfortunately, anonymized data is by no means secure. With machine learning methods in particular, it is easy to obtain information about individual people (or other data points) from supposedly completely anonymized data. The following examples illustrate this:

In August 2016, the Australian government released a record online of medical and pharmaceutical bills of approximately 2.9 million people. Identifiable information and invoice data were previously pseudonymized. Australian scientists have subsequently shown that only with the help of a few known facts or a little public information can individuals be identified in this data set (2017, C. Culnane, B. Rubinstein, V. Teague, https://arxiv.org/ abs / 1712.05627). And with that, her medical history, which is extremely worth protecting. In another interesting work, scientists from the Belgian Université catholique de Louvain (UCLouvain) and the Imperial College London demonstrated how easy it is to de-anonymize any data set (2019, L. Rocher, J. Hendrickx, YA. de Montjoye, https://www.nature.com/articles/s41467-019-10933-3). For example, a dataset with 15 demographic attributes would make 99.98% of people in Massachusetts clearly identifiable - for a dataset that would be considered completely secure in terms of data protection law requirements.

Even if simple de-identification is not possible, advanced statistical evaluations easily allow conclusions to be drawn about the properties of individual data points that should actually be protected. For example, repeated requests of aggregated information (i.e. means or sums) can determine whether a data point is part of a data record or not. Machine learning models can be used to calculate the probability, for example, whether a person is a member of a particular organization or even has certain properties. These so-called “membership” and “attribute disclosure” attacks based on advanced analytical methods can easily be applied to supposedly completely anonymized data and thus represent a major threat to effective data protection.

The unwanted publication of belonging to groups or properties can mean a serious impairment of informational self-determination. When acquaintances or potential employers know more about a person than they are willing to publish, or when e.g. an insurance company knows more than the person stated, very unpleasant social and economic consequences can be the result.

The same applies to the unwanted publication of company secrets. Research results or secret business data are repeatedly the target of attacks on the information security of institutions and companies. And advanced statistical analysis is an increasingly popular means of information gathering.

However, prohibiting machine learning for these reasons is neither practical nor recommended because, as described above, the R&D benefits are enormous for example in health care. Nevertheless, this form of analysis needs to be regulated, especially regarding its effects on data protection. Machine learning models that are ready for use, by definition, contain a wealth of information about the data on which they were trained. However, anonymizing or pseudonymizing this data (or changing it in any other form for that matter) is not sufficient for effective data protection. Rather, a secure, quantifiable and thus verifiable training method is required that ensures the data is protected. Fortunately, there is one such method: differential privacy. We are working to only publish models that were trained according to the principle of differential privacy and guarantee the maximum usability of the data with the highest level of data protection.

If you would like to know more about this mathematical method and how we at Gradient Zero apply this principle in a secure software platform and apply it to machine learning, visit dq0.io or contact us at dq0@gradient0.com.

All blog entries