Blog Post

Eternal Sunshine of the Mechanical Mind: Machine Learning vs the Right to be Forgotten

by Meem Arafat Manab
Edited by Joseph Boyer & Montserrat Guzmán

(This is an abridged version of a longer article that can be read on arxiv.org. Please refer to that version for all academic references.)

The Legal Backdrop

The Right to be Forgotten, in the case of the European Union, is of a rather limited scope. Often referred to as the Right to Erasure (of Data), as enshrined in GDPR’s Articles 17 and 19 (and Recital 65), it awards EU citizens the right to request the deletion of their personal data without any unreasonable delay. This can lead to the deletion of their records, for example, from search engine results, as evidenced in the Google Spain case. Its ruling established a right to delisting, where the person’s associated data would cease to appear in search engine records. There have been talks of how this deletion may not be as instantaneous as we want it to be due to, for example, search engines keeping large amounts of data in multiple servers and cache-s (pronounced like cash) in individual computers to make the search process as fast and smooth as possible. But ultimately, this line of reasoning sees data as information kept in some table or some database, and with machine learning, that might not entirely be the case.

Welcome to the Machine

The last three decades saw an unprecedented boom in artificial intelligence. Much of it owes its existence to machine learning, and more specifically, its sub-branch deep learning, where large amounts of data are fed into an algorithm of layers of neural networks which identifies patterns in the data, and then either categorize or generates data using them. Facial recognition, image-generating AI like Dall-E and Midjourney, large language models and their user interfaces like GPT-4, Bard, and ChatGPT, and deep fakes are all examples of this progress in deep learning,

Deep learning models employ parameters, known as weights and biases, for their pattern recognition and data generation. The parameter values are derived from some ‘training’ data provided to the algorithm and then used for either classification or creation of new data. Statistical in nature, the size of these models, i.e. the number of their parameters, has generally been smaller than the data they are trained with. Take GPT-4 as a reference. According to some sources, it has 1.7 trillion parameters, and it was trained on 13 trillion tokens (small chunks of words and sentences). The information of these tokens is then remembered using the parameter values, so each token might be remembered using a handful of parameters, and each parameter could also be responsible for remembering some different tokens. Parameters, in this case, are doing what neurons do in our brains. If we want GPT-4 to “forget” the sentence: “EMILDAI is the best program to study data protection and privacy”, assuming each word in this sentence is a token, deleting a few parameter values from the 1.7 trillion should be enough.

The problem is that we do not know which parameter, in reality, captures which information. It is yet impossible to completely reverse-engineer a model and say with certainty that these “neurons” or these parameters are responsible for remembering this specific piece of data and this piece only. Just like an animal brain, trying to delete one piece of data might even lead to a loss of information related to something else. We cannot say with absolute certainty which records would thus be affected since we do not know which pieces of information are correlated in the internal representation of the model. There have been recorded instances that deleting the information of one person can result in the deletion of data of other people with the same or similar names. Researchers have been comparing neural networks to a hypothetical black box since their early days, and if these networks are not simply a collection of data but rather a black box that consumes data and performs intelligent tasks based on them, deleting the data should be challenging.

Plausible Solutions and the Digital Amnesiac

To make these machine learning models forget some data, we can retrain the model from scratch, known in the literature as exact unlearning. The cost here would be very disproportionate, as sources believe that it took 34 days for 1024 servers (that is 1024 high-quality CPUs and GPUs) to train GPT-4. Now, if we wanted the personal data of the authors of this article, for example, to be deleted from GPT-4, retraining it would, in addition to time and server, require around 50 GWh of energy, which is enough to power 44 million households in the US for one year (that is one-third of the United States, and 24 times the size of Ireland). The other option, to use the currently available machine unlearning algorithms, would mean deleting it with some uncertainty, which is equivalent to saying that there will always remain a tiny but real chance that our data might not be deleted at all after our request has been processed. There is also a lot of ongoing research into differential privacy, which essentially means adding noise to all personal data for maximum anonymization. But in that case, the performance of the AI will severely decrease as well. As we moved from rule-based to probabilistic artificial intelligence, we also moved from Spreadsheets to neural networks, and deletion would now mean more than selecting cells in a table and deleting the corresponding records. We could also, always, have a ruling from the court that no machine learning models should ever be trained on personal data of the people.

None of the versions of the AI act currently available on the internet address the gap between machine learning and the Right to erasure, although they go to an excellent length to explain what machine learning is, and some previous legal literature had sparingly acknowledged the disparity before. Meanwhile, some computational researchers have argued that machine unlearning is only achievable mathematically, not practically. I believe that as machine learning models become more and more like an anatomical brain, both in terms of performance and structure, the more irreconcilable they will become with the Right to be Forgotten. Can we ask a brain to forget us, even if we created it ourselves? It would perhaps be wiser to never introduce ourselves to that brain in the first place. Otherwise, in addition to forgetting one person named Clementine K., the mechanical mind could end up forgetting songs and fruits by the same name, with a digital amnesiac instead of an AI standing before us.

Meem Arafat Manab is a former lecturer of mathematics and computer science from BRAC University in Dhaka, Bangladesh. Torn between interests in deep learning and public policy, they are currently preoccupied with the safety, reliability, and legitimacy of artificial intelligence. Their other research interests include critical pedagogy, machine translation, and econometrics. At the moment, they are a full-time master’s student of the EMILDAI program with a specialization in cybersecurity.