European data protection laws pose a significant threat to the use of ChatGPT by OpenAI. Already banned in Italy, OpenAI has a mere week to comply with the regulations, but compliance may be extremely challenging due to the way data has been collected.
ChatGPT has become one of the most controversial technologies of the 21st century, with differing opinions on its potential impact. Some believe it will render human workers obsolete, while others fear it will disrupt our way of life. The truth likely lies somewhere in between.
For years, the prevailing belief in the field of AI has been that more training data leads to better models. OpenAI’s earlier models, GPT-2 and GPT-3, were trained on vast amounts of text, totaling 40 and 570 gigabytes respectively. The size of the data set for GPT-4 has not been disclosed by OpenAI.
The fixation on collecting more training data has raised concerns about the costs and environmental impact, particularly in terms of water and energy usage. This may have been a factor behind recent statements by Sam Altman, stating that further progress will not come from simply making models bigger, and that there will be no GPT-5.
The issue of training data has also come under scrutiny from European authorities investigating OpenAI. There are allegations that OpenAI has scraped personal data, such as names and email addresses, without obtaining proper consent from individuals. Compliance with European data protection laws would require OpenAI to obtain consent or prove legitimate interest in collecting data. OpenAI would also need to provide explanations on how ChatGPT uses the data, offer correction mechanisms, and allow users to request deletion or object to data usage.
The article suggests that OpenAI is unlikely to argue that it obtained consent when scraping data, leaving the company with the challenge of proving legitimate interest. This would require convincing regulators of the essentiality of ChatGPT to justify data collection without consent.
One of the major concerns raised by experts is how ChatGPT uses the data provided during chats. Users often share private and sensitive information about their mental health, physical well-being, and personal opinions. Under European law, users must have the option to delete their chat logs to protect their privacy. However, experts believe that OpenAI may struggle to identify and remove individuals’ data from its models, as the company lacks robust data record-keeping practices.
As a former Google employee stated, tech companies often lack documentation and knowledge about how AI training data is collected and annotated. OpenAI could have avoided this predicament by implementing robust data record-keeping practices from the outset, but this was not done. The reality is that data collection and usage in AI models is often inadequately documented and understood by tech companies.