AI video training data: finding common ground between AI companies and creators

Controversies over data, intellectual property, and licensing go hand in hand with generative AI (GenAI). The machine learning algorithms used by GenAI models require data to identify patterns and interdependencies that enable them to generate suitable responses to prompts. Therefore, volume and data quality are fundamentally important to the effectiveness of AI models.

AI developers – including Anthropic, OpenAI, Stability AI, and Midjourney – have trained their models using a combination of data sources, including data scraped from websites. Some have accused these companies of wrongfully neglecting to obtain explicit permission from either publishers or individual creators to use their content for training. As a result, a number of lawsuits are ongoing between well-known publishers, including The Atlantic, Dow Jones Company, The New York Post, and Getty Images, as plaintiffs and various AI developers as defendants. 

In these cases, the courts are evaluating whether AI developers infringed on the copyright of these publishers and creators by training their AI models using this data. The developers claim the training of their models falls under "fair use" legislation in the US or Articles 3 and 4 of the Directive on Copyright in the Digital Single Market in the EU. In both these cases, the legislation permits the unlicensed use of copyrighted works in specific circumstances, such as data mining for scientific research. It remains unclear exactly how these principles and directives apply to the training of AI models, and the outcome of these lawsuits will be critical in setting the future direction of GenAI development.

Proceeding with video generation models in an unclear legal context

With these critical questions still unresolved, the race to develop ever more advanced AI models continues. Most notably, GenAI video output is a key area of competition at present, with players including OpenAI's Sora, Runway, Invideo AI, Vyond, and many others all vying for market share. This puts companies working on AI models for video generation in a difficult position. If they proceed slowly while awaiting the outcomes of these ongoing lawsuits, they risk falling behind the competition. If they power ahead, using every available data source, they might find themselves on the wrong side of the law.

There are three potential ways that developers can walk this tightrope – licensing agreements, waiting and seeing how the regulation develops, and a systematic connection of creators with developers. These strategies are not mutually exclusive, and AI companies will most likely often combine them.

Option 1: licensing agreements with publishers and creators for video data

Striking individual deals with publishers and content creators has been the go-to method for AI developers. OpenAI currently has licensing agreements with prominent media outlets, including The Associated Press, The Financial Times, The Atlantic, The Guardian, and Le Monde, among others. Google, Microsoft, and many others have similar deals in place.

These agreements escape copyright-related controversies. They have also proven effective as a means to acquire high-quality video data for training AI models. Some even claim to have managed to train a newly-released AI video generation model on licensed video data only. This can be a way for a lesser-known company to emerge in the very competitive AI market with a distinctive value proposition.

The complexities of licensing agreements

While licensing agreements seem to be a viable solution for AI developers, this approach remains far from straightforward. For smaller, less well-known creators, joining a publisher or organising deals with developers is not a realistic option. These creators remain at risk of being marginalised in AI-based web ecosystems, as their work will have little to no impact on AI outputs.

For developers, licensing agreements also have downsides. Individual deals take significant time and effort to negotiate and finalise. Even with many resources poured into it, developers racing to build the greatest next-generation AI models couldn't possibly get all the needed data in time. And since AI development, as well as the everyday usage of these tools, is an ongoing process, dealmaking is never truly finalised. One would need to constantly make additional deals while always being outpaced by the demand for training data.

Option 2: wait and see

A third option for AI developers is to continue with their existing practices while waiting to see how regulation develops. Recent legal and policy developments suggest a shift towards making data access easier for AI developers. A regional court in Hamburg, Germany, recently ruled that even specific disclaimers on websites forbidding data mining are not valid if this mining is done for free-to-access and publicly available scientific research. Meanwhile, the UK Government is considering adding an exception to its current legislation to allow data mining for commercial use (this would bring it in line with EU practices).

However, there is considerable uncertainty surrounding this strategy, which makes it an imperfect standalone choice. The law might not always side with developers, especially when commercial use is involved. Additionally, operating in a grey area strains the relationship between creators and AI developers, leading to prolonged conflict. And bad blood is something no one wants. Thus, while we wait for regulation to put everything in its place, this ecosystem of digital value creation needs another way to move forward and build positive relationships within it.

Option 3: connecting creators directly with AI developers

Some have already started looking for alternatives. Their aim is to directly connect AI developers with creators who are willing to have their content used for model training.

Thus, platforms that allow creators to license their content for AI training are emerging. They simplify licensing procedures, connecting AI developers and willing creators and providing both with the opportunity for mutually beneficial cooperation.

While platforms that offer remuneration are more effective for well-known creators, more ambitious models are attempting to let any willing creators have their content used for AI training purposes. For example, Adobe uses its Adobe Stock platform to procure video footage for training its Firefly AI video model. They see this type of video acquisition as a key differentiator that should make their model more attractive to creators.

However, YouTube leads the way in this area. As of December 2024, it has allowed users to opt into having their videos used for AI training purposes. Creators are presented with a range of YouTube's AI partners, including Adobe, Amazon, Anthropic, Apple, Meta, Nvidia, and Microsoft, and given the option to select which of these partners can train AI using their data. This enables the creation of huge video datasets that developers can use without infringing on the wishes of content creators. Meanwhile, the latter can define how they want to be part of the AI revolution.

Searching for a way forward

The use of data for training GenAI video models will be an ongoing and complex negotiation. The emerging discussion on whether AI-generated or AI-assisted material itself can be copyrighted only adds further uncertainty to this already tricky question.

While we wait for legal decisions, AI companies and creators will need to use their judgment on how to proceed. Yet, we can see that this ecosystem is not incapable of finding its own solutions. Through innovation and cooperation, the market might be able to thread its way forward without the need for heavy-handed external restrictions. This has proven time and again to be the most productive way forward.

For more startup news, check out the other articles on the website, and subscribe to the magazine for free. Listen to The Cereal Entrepreneur podcast for more interviews with entrepreneurs and big-hitters in the startup ecosystem.