A Modern Marriage: How AI Powered By Blockchain Could Protect IP Rights
“The creative industries have long been raising concerns that their IP is being unfairly used to train AI systems without consent and without compensation. The lack of even a voluntary code will not allay these concerns.”
“The industry is asking for transparency on what models have and haven’t been trained on, and what works are being used. The IPO hasn’t found answers to those questions.”
The quotes, from the Culture, Media and Sport Committee and the Design and Artists Copyright Society respectively, have surfaced from the rubble after recent talks between rights holders and the Intellectual Property Office to establish a UK voluntary copyright and AI code of practice have collapsed.1
The Culture, Media and Sport Committee, the Design and Artists Copyright Society and senior bosses from UK’s creative sectors were among the participants who were not able to find a solution to this topical issue.
While the proliferation of generative AI into mainstream society via AI models like ChatGPT and Midjourney over the last year has been breathtaking for many, it has more than likely left regulators and legislators breathless as they try to catch up to the current technological age.
Generative AI (GenAI) tools rely heavily on models which are trained on massive data sets, sometimes scraped from the World Wide Web, to generate user prompted outputs. However, they are often extremely large and it is usually hard for third parties to gain access to them or to truly know what is in them. This in turn makes it difficult for (1) copyright owners to establish that their work has been used as input or training data, (2) copyright owners to seek compensation for the unauthorised use of their copyrighted works and (3) users of AI tools to minimise the risks of inadvertent copyright infringement by using the tool to generate AI outputs that are similar to copyrighted works. These problems have been further elaborated on in my colleagues’ recent insight pieces on “Copyright and AI: Part 1 - Teaching the machine” here and “Copyright and AI: Part 2- Infringement by machine?” here.
The Current Landscape
Currently, large amounts of data gathering takes place by private entities behind closed doors for their own benefit. For example, Facebook constantly gathers information from their users, which is then traded and used for advertising. Some gathering can take place with sources available under restrictions such as paywalls. However, most data scraping is done via online sources accessible to the public on the World Wide Web. For example, the WebVid dataset contains 10m video preview clips solely from Shutterstock2. This dataset was used, with Shutterstock’s permission, to train Meta AI’s Make-A-Video AI system. A problem arises when copyright works are used without the right owner’s permission. OpenAI’s Generative Pre-trained Transformer 3 (GPT-3) large language model (LLM) has been trained with 499 tokens3 of data from online sources. It is speculated that OpenAI, the founders of GPT, used a mixture of publicly available data and allegedly illegally acquired materials, such as the Books2 dataset containing about 294,000 titles from various authors, a few of whom have since decided to commence proceedings4.
There are questions around whether profits generated from the use of this body of data are distributed fairly. Currently, while there are numerous intellectual property court cases ongoing here in the UK and in the US and elsewhere, the profits earned by Generative AI companies are rarely shared with the owners of rights in the materials that comprise the training dataset. Let us imagine AI companies suddenly have a magnanimous change of heart and want to start compensating right owners for the content. They create a database weighing up the amount of each individual contributor’s content that is fed as training input and decide to maintain this database and pay out right owners accordingly. The issue here is that not only would right owners need to trust that companies like Meta or OpenAI would be compensating them fairly for their data being used, but they would also need to trust auditors to confirm that these companies are rightfully paying out data owners for their content/copyright use. One must remember that these firms are non-deterministic hierarchical structures, which ultimately concentrate decision-making power to a board of people, albeit answerable to owners or stakeholders, so there is an inherent possibility of bias or greed.
The Big Picture
Before breaking down the issues above, let us look at the big picture. In the long run, economic growth is and has been fundamentally powered by technological trajectories of innovation. For instance, the railways (industrial technology) required standardised clocks and timetables (institutional technology) to make them useful at scale.5 The internet (industrial technology) required institutional standards like TCP/IP (Transmission Control Protocol/ Internet Protocol) to facilitate interoperability and global use at scale. The principle here is that industrial technologies create economic value at scale only when coupled with institutional technologies. GenAI is a digital technology borne of mathematics and computers but is an ‘industrial’ technology that in this author’s view needs to be institutionally contained with rules to make it economically valuable.
This is where the union of the “creative”, “non-deterministic” generative AI and the “deterministic”, “reliable”, “transparent” blockchain technology and its smart contracts comes in. The blockchain, as an ‘institutional technology’, can potentially help solve more than just copyright issues for copyright owners and AI tool users but for the scope of this article, this is what we will consider. A blockchain-backed governance solution fits in going forward, where we want to set up new databases of safe or copyright-friendly content. At this point, you might be perplexed. Surely, blockchain technology, widely regarded simply as a hotbed for cryptocurrency scams by the mainstream media, cannot offer a legitimate solution to tackling copyright infringement and the fair compensation for copyrighted works?
The blockchain is a type of distributed ledger technology that uses cryptography to provide an immutable record of transactions on a decentralised network without any centralised authority. This framework is ideal for hosting data that cannot be compromised. Additionally, every transaction on the blockchain is transparent and available for public viewing. While the opacity of data scraping processes can often be a property of generative AI architecture, the inherent transparency and security of blockchain networks can provide on-chain guardrails for the GenAI models to work their magic to the benefit of society.
Companies training AI models may challenge this notion and argue against using a blockchain-based system that would remove human intervention and shine a glaring light at their opaque AI-training processes. However, looking at the amount of copyright infringement lawsuits already in the public eye like the ones against OpenAI and Meta, the blockchain could provide the basis of a negotiated solution that enables the transparent remuneration of copyright owners.
Why use the blockchain?
In the digital era, one’s digital identity and ownership of digital assets will become increasingly important as technology develops. As more people transition their businesses and lifestyles to the online world, owners of copyrighted works will realise they are able to verify ownership of their own data via digital assets on the blockchain. One way to tokenise data would be via non-fungible tokens (NFTs), which can be used to verify origins of various forms of media including images, texts, videos and music.6 The question right owners may ask is, why would they go through this process and tokenise their data on the blockchain to voluntarily give GenAI models training material?
Firstly, copyright infringement occurs where someone uses the whole or a substantial part of the right owner’s work without their permission or an applicable defence. If right owners voluntarily provide their copyrighted material for use, this of course eradicates the issue of copyright infringement entirely, be it from the input of training material into AI models, or the AI generated output of these models. This potentially provides an avenue for solving problems (1) and (3) set out above – although of course the right owners must be rewarded for doing so.
Secondly, by tokenising their data and giving permission for GenAI models to use it, right owners benefit by being able to track the use of their data. If the infrastructure around the data troves that GenAI models scrape is put on-chain as a base for AI machine learning, a synergistic relationship between blockchain and the GenAI model could emerge. The blockchain facilitates the transparent record of data, providing AI models with a clear framework for their operations. The immutability of the blockchain can reveal to copyright owners if copyrighted data, and whose, is being relied upon, as input for training material to generate AI outputs. Such data would also be free from leakage or tampering due to the block encryption of the learning data7 and the peer-to-peer nature of the technology8 respectively. Being able to track data input for AI outputs provides rights owners and AI tool users a level of trust and security on something which would otherwise be a grey area.
If those are not good enough reasons for using a blockchain solution to help solve this problem, maybe the third will help.
If copyright owners are incentivised, this can encourage them to provide private data for generative AI companies to learn and train their AI models. GenAI models are tools that are being used today to increase productivity and efficiency and by extension financial gain. If the wave of legal actions is to stop, these gains need to be redistributed between the model owners and the right owners.
The immutability of blockchain-enabled smart contracts is a feature that eliminates the need for human trust. There is assurance that no more or less authority over a user’s rights or assets is available other than what has been explicitly agreed to in advance. It provides certainty and predictability. When LLMs obtain revenue from the use for training of material linked to these NFTs, upon analysis of audit trails on the decision-making patterns of algorithms9, these programmable smart contracts interacting with the blockchain can allocate part of the revenue to the right owner as a royalty payment according to the weight or proportion of the owner’s data that was used in training. This conveniently solves Problem (2).
As an example of arguably similar solutions already in existence, datalatte10 is a blockchain/AI solution that allows you to engage in conversations with its LLM chatbot to generate and share insightful data. This data is then tokenised into NFTs and the owner has full authority over who assesses the data and how it is used. If the tokenised data is used in a query, the owner receives a payment from it.
The Open Music Initiative is another example of a collective effort to build an open-source protocol for managing music and copyright data. Their solution provides copyright owners with control, recognition, and compensation for their works.11
Conclusion
With the development of emerging technologies moving at breakneck pace, I believe the symbiotic relationship between blockchain technology and generative AI will be the most efficient and effective means to solve these copyright infringement issues that have emerged from both the training input and product output segments of the generative AI model.
The copyright owners in this new technological age may have the opportunity to effectively monetise their content while retaining control over it. At the same time, the generative AI owners can have access to legitimate content to train their models. As the saying goes, change is the only constant. It is time the industry shifts to embrace this modern marriage.
1 S Speight, (2024), UK fails in bid to create AI voluntary code as talks collapse
2 S Willison, (2022), Exploring 10m scraped Shutterstock videos used the train Meta’s Make-A-Video text-to-video model
3 A Guadamuz,(2024), A Scanner Darkly: Copyright Liability and Exceptions in Artificial Intelligence Inputs and Outputs,
4 As an update on 12 February 2024: The US District Judge has dismissed 5 of the 6 claims, leaving the claim for direct copyright infringement open, and has ordered the plaintiffs to file an amended complaint.
5 C Berg, S Davidson, J Potts, (2023), Institutions to constrain chaotic robots: why generative AI needs blockchain
6 Chainlink, (2023), Use Cases of AI in Blockchain
7 J Kim, N Park, (2020), Blockchain-Based Data-Preserving AI Learning Environment Model for AI Cybersecurity Systems in IoT Service Environments
8 H Luo, J Luo, A V. Vasilakos, (2023), BC4LLM, Trusted Artificial Intelligence When Blockchain Meets Large Language Models
9 Chainlink, (2023), Use Cases of AI in Blockchain
10 www.datalatte.com
11 Our Open Protocols Approach — Open Music Initiative (open-music.org)