Mar 14, 20227 min read

Understanding machine unlearning: In the context of right to be forgotten

Content

Introduction
Right to be Forgotten
Machine Learning
Machine Unlearning
Discussion

Introduction

In this blog, we are going to look through the lens of the right to be forgotten by using one of the unknown data science methods that are machine unlearning. To do that, firstly we will touch upon what is right to be forgotten and why it is important in the data science world. In the second part, we will be explaining machine learning, and then we will be explaining machine unlearning. In the discussion, we are going to give examples of both methods in the eyes of the right to be forgotten to grasp the issue correctly.

Right to be Forgotten

One can understand the right to be forgotten as thinking of protecting one's rights if an unnecessary data processing situation happens [1]. This right can be applied if there is no need to keep a person's data in the data storage; if a person's data is processed unlawfully; if persons' data is processed with her/his consent but person intends to withdraw her/his consent; and so on. In the regulation part, this right finds itself a room in Article 17 of the GDPR [2]. The aforementioned law seems to regulate the countries in the EU. However, one must not forget that the countries outside of the EU have to follow this law's regulation when they would like to process data subjects' personal information if Article 3 of the GDPR shall be taken place in compliance with procedures and principles laid down [3]. On the Turkish side of the story, it is a difficult task to position right to be forgotten in regulations. Because law number 6698 does not mention this right to be forgotten directly [4]. Also, the finding right to be forgotten in other Turkish regulations is a difficult task as well. The only exception is that Turkish Supreme Court Decisions mention the term as it appears in the GDPR [5].

Machine Learning

Machine learning is a system that understands the data sets. One can predict the goal that she/he would like to reach. The predicting monthly sales value could be an example here. One can classify variables that are in the data set. Coding monthly sales of more than x amount of units as `yes`, and below units as `no` could be an example for classification. Also, one can let the machine decides what to do with the untagged variable in the data sets. For example, finding patterns in pictures, music songs, texts, and so on. Kumar et al. (2020) conducted research for different industries' approaches to machine learning methods. There were organizations that participated in this research. Attendees as follows; Cybersecurity, Healthcare, Government, Consulting, Banking, Social Media Analytics, Publishing, Agriculture, Urban Planning, Food Processing, Translation. One thing that readers would like to grasp is that these sectors are plausible for machine learning applications.

Machine Unlearning

The most obvious way to describe this system can be done as data deletion (Cao & Yang, 2015). Because the point that differs from machine learning will be the complete purification of the data of the people in question from the learning processes. The aforementioned process leads to serious discussions on how to do it. In addition, it is not possible to talk about building a data system as in machine learning processes. Namely, in machine learning, the data set is first made suitable for data preparation, data cleaning and data processing. Then, while the data set is made ready for machine learning with train, test and verification methods, an action must be taken while the processes in question are experienced and learning processes are carried out in the system here. Moreover, with the data engineering method, the continuity of the flow of the data in the data set is ensured. While all these described steps are processed in machine learning, the question of how to delete the personal information of the data owner is a very thought-provoking process.

Discussion

In this section, where the intersection of the method, which is called machine unlearning, and the right to be forgotten, will be explained, academic studies published on the aforementioned subjects will be included. At this point, it will be discussed how machine learning models, which continue to exist with continuous data updating, will pursue the legitimate interests of both parties in terms of both businesses and people who request data deletion. Considering that the data subject constitutes the cornerstone of the aforementioned laws, it would be a reasonable step to expect the data processors to immediately fulfill these requests for the deletion of their data [6]. On the other hand, the fact that the concept of legitimate interest is established only on the legitimate interests of data controllers in the personal data protection law raises the question of whether the subject to be given importance in data processing is individuals or data controllers [7]. After this question, how the weight of sensitivity in data deletion will be measured will be the starting point of the discussion. On the European Union side of the issue, the perspective of protecting the rights of data subjects (in its more comprehensive version) is the same, but in addition to this situation, there are texts related to the legitimate interests of data subjects [8].

It was mentioned that the aforementioned data deletion processes were brought to the literature by Cao and Yang (2015). The data deletion processes they deal with consist of complete deletion of data and timely deletion of data. As an example of the purpose of the complete deletion part, the effect of the deleted data in the data set on the prediction results may not be noticed. While these processes are taking place, it should not be ignored that the data set is also trying to produce new estimation results by updating itself. An example of timely deletion can be said to mean a quick deletion process in order to prevent possible attacks during changes in the data set. To explain these methods developed by Cao and Yang in a simpler language, in a scenario where the average age of a team of 100 people is 34, after the data deletion request, in a scenario where the average age of 95 people can be 37, it is as if the differences in the estimation can be entered without seeing a third eye. It will be possible to create a table like the average is 37 and the number of people in the team is always 95. As an example of timely data deletion, let's take a 100-person data set. After the personal information is deleted, making the change in the average salary without attracting the attention of third parties will prevent possible attacks. At this point, it is expected to create a scenario where security and privacy will be important.

On the more technical side of the process, it is discussed how to get the result without retraining the train data set while deleting the data. It can be said that the discussion revolves around solving the process here (Gupta et al., 2021). In this context, it would not be right to pass without mentioning the differential privacy method, which finds its place in articles and naturally in data deletion processes and even in data deletion processes. In this context, it would not be right to pass without mentioning the differential privacy method (Brophy & Lowd, 2021), which finds its place in articles and naturally in data deletion processes and even in data deletion processes. Let's try to open a little bit about the technically expressed issue with the differential privacy method. We will continue with the example of the "Game of Thrones" dataset, which users in R Studio are also familiar with. As in the examples above, there is no exact data pertaining to individuals. There is only information that will disclose the status of individuals or, more specifically, their personal information. Here, let's think of Game of Thrones information as if we are examining real person information and try to explain the process a little more.

Differential Privacy. Game of Thrones — Game of Thrones Dataset - Before Removing Identifiers - Differential Privacy

As can be seen from the picture above, the information of the people is fixed. Because the statistics determined for the individuals were listed according to the "Probability Likelihood of Dead" ratio for this example. In other words, if the names of the persons are expressed as numbers with the differential confidentiality method and this data is shared in certain periods, if one of the persons wants their data to be deleted, it will be clear who made this request in the above data set.

Game of Thrones. Differential Privacy — Game of Thrones Dataset - After Removing Identifiers - Differential Privacy

The part that was tried to exemplify with the pictures was to convey the differential privacy dimension of the work. On the technical side, it is to show how the effect of the Probability Likelihood of Dead ratio, called the Game of Thrones universe, on the predictions to be made, will be shown in the ever-changing data sets. Let's continue the example by trying to show different outputs of another dataset. In the first image, it is possible to observe that the variables in the data set are ordered according to their importance. It may be difficult to predict the costs of correcting a different result that will appear in the second image.

Bootstrap aggregating. Bagging. Mach'ne Learn'ng — Bagging Train Result 1 (Bootstrap aggregating) Basketball Players Stats per season 49 leagues

This data set produced for basketball statistics will not make sense at first glance. Only when the published result as below changes, it will be difficult to talk about a functioning machine learning here. In the final analysis, it shows that the right to be forgotten and machine learning methods have somehow entered our lives. Personal data is not intentionally included in the images used, so that any data leakage is prevented. The purpose of this article is that at a point where many focus on machine learning, there may be a possibility of being overlooked in both the legal and technical processes of the business. In addition, creating discussion areas on different points comes from the fact that we think that it can make the discussions enjoyable. We are ending this article here, perhaps in future versions of the article, avoiding giving details about personal data, and applying machine unlearning methods.

Bootstrap aggregating. Bagging. Machine Learning — Bagging Train Result 2 (Bootstrap aggregating) Basketball Players Stats per season 49 leagues

Bibliography

Brophy, J., & Lowd, D. (2021, July). Machine unlearning for random forests. In International Conference on Machine Learning (pp. 1092-1104). PMLR.
Cao, Y., & Yang, J. (2015, May). Towards making systems forget with machine unlearning. In 2015 IEEE Symposium on Security and Privacy (pp. 463-480). IEEE.
Gupta, V., Jung, C., Neel, S., Roth, A., Sharifi-Malvajerdi, S., & Waites, C. (2021). Adaptive machine unlearning. Advances in Neural Information Processing Systems, 34.
Kumar, R. S. S., Nyström, M., Lambert, J., Marshall, A., Goertzel, M., Comissoneru, A., ... & Xia, S. (2020, May). Adversarial machine learning-industry perspectives. In 2020 IEEE Security and Privacy Workshops (SPW) (pp. 69-75). IEEE.