Globalism AI The explosion has honed the need for common sense, a human-centered methodology for dealing with data privacy and ownership. Leading the way is Europe’s General Data Protection Regulation (GDPR), but there is more than just personally identifiable information (PII) at stake in the modern marketplace.
What about the data we produce as content and art? It is certainly not legal to copy someone else’s work and then present it as your own. But there are artificial intelligence systems out there trying to do that scrape As much human-generated content from the web as possible in order to create similar content.
Could the GDPR or other EU-focused policies protect this type of content? As it turns out, like most things in the machine learning world, it depends on data.
Privacy vs. Ownership
The primary purpose of the GDPR is to protect European citizens from harmful actions and consequences related to the misuse, misuse or exploitation of their private information. It is of little use to citizens (or organizations) when it comes to intellectual property (IP) protection.
Unfortunately, the policies and regulations put in place to protect intellectual property, as far as we know, are not equipped to cover data scraping and anonymization. This makes it difficult to understand exactly where the regulations apply when it comes to scraping the web for content.
These techniques and the data they obtain are used to create huge databases for use in training large AI models such as OpenAI’s GPT-3 and DALL-E 2 systems.
The only way to teach AI to imitate humans is to expose it to man-made data. The more data you push into an AI system, the more powerful its output will be.
It works like this: Imagine that you draw a picture of a flower and post it on an online forum of artists. Using scraping techniques, the tech outfit absorbs your image along with billions of others so it can create a huge dataset of artwork. The next time someone asks an AI to create an image of a “flower”, there’s a greater than zero chance that your work will show up in the AI’s interpretation of the router.
It remains an open question as to whether this use is ethical.
Public Data vs. Personally Identifiable Information
While the regulatory oversight of the General Data Protection Regulation (GDPR) can be described as far-reaching when it comes to protecting private information and granting Europeans right to eraseIt doesn’t seem to do much to protect the content from being scraped. However, this does not mean that the General Data Protection Regulation (GDPR) and other EU regulations are completely ineffective in this regard.
Individuals and organizations have to follow very specific rules to revoke personally identifiable information, so as not to conflict with the law – something that can become very costly.
For example, it has become almost impossible for Clearview AI, a company that builds facial recognition databases for government use. scraping Social media data, for doing business in Europe. European Union regulators from at least seven countries have already issued or recommended hefty fines for the company’s refusal to comply with the General Data Protection Regulation (GDPR) and similar regulations.
On the whole other side of the spectrum, companies like Google, OpenAI, and Meta use similar data scraping They exercise either directly or by purchasing or using the broken datasets of many of their AI models without any repercussions. And while big tech companies have faced their fair share of fines in Europe, very few offenses have involved data scraping.
Why not ban skimming?
Skimming may appear, on the surface, as a practice with a high potential for abuse so don’t ban it entirely. However, for many organizations that rely on scraping, the data collected is not necessarily “content” or “personally identifiable information”, but rather information that can serve the public.
We have reached out to the UK agency to deal with data privacy, and Information Commissioner’s Office (ICO), to learn how dredging techniques and data sets are regulated on an Internet scale and to understand why it is important not to over-regulate.
An ICO spokesperson told TNW:
Using publicly available information can bring many benefits, from research to the development of new products, services and innovations – including in the field of artificial intelligence. However, when this information is personal data, it is important to understand that data protection law applies. This is the case whether the techniques used to collect the data include scraping or something else.
In other words, it is more about the type of data used than how it is collected.
Whether you copy images from Facebook profiles or use machine learning to scrape the web for classified images, you will likely be in conflict with the General Data Protection Regulation (GDPR) and other European privacy regulations if you build a facial recognition engine without the consent of the people whose faces are located. its own database.
But it’s generally okay to ditch the internet for massive amounts of data as long as you are too hide his identity Or make sure that there is no personally identifiable information in the dataset.
More gray areas
However, even within the permitted use cases, there are still some gray areas regarding private information.
GPT-2 and GPT-3, for example Known to occasionally release personally identifiable information In the form of addresses, phone numbers, and other information that appears to be hidden in their collections across extensive training data sets.
Here, as the company behind GPT-2 and GPT-3 is clearly taking steps to mitigate this, the General Data Protection Regulation (GDPR) and similar regulations are doing their job.
Simply put, we can either choose not to train large AI models or let companies train them on the opportunity to explore evolving cases and try to mitigate concerns.
What may be required is GDUR, a general regulation of data use, something that can provide clear guidance on how human-generated content can be used legally in large data sets.
At the very least, it seems worth having a conversation about whether European citizens should have the same right to have the content they create from datasets such as their profile pictures and profile pictures removed.
Currently, in the UK and across Europe, the right to scan appears to extend only to our personally identifiable information. Anything we put online is likely to end up in some AI training datasets.