AI Training Data Copyright Litigation: Enterprise Risk Management Guide

28 December 2025

When The New York Times sued OpenAI and Microsoft for copyright infringement, alleging their AI models were trained on millions of Times articles without permission, it crystallised a legal risk that enterprises can no longer ignore: using AI systems may expose your organisation to copyright liability for how those systems were trained.

Over 50 active lawsuits target AI companies—OpenAI, Anthropic, Google, Meta, Microsoft, Nvidia, Perplexity, and others—for allegedly using copyrighted books, articles, photographs, source code, and artwork as training data without authorisation or compensation. Plaintiffs include major publishers, bestselling authors, visual artists, photographers, musicians, and software developers.

The legal question at the centre of these cases is deceptively simple: Is training AI models on copyrighted works "fair use," or does it require explicit permission and licensing?

The answer remains unsettled. Three judges have ruled on fair use in AI training cases, reaching different conclusions. The U.S. Copyright Office released guidance stating "some uses of copyrighted works for generative AI training will qualify as fair use, and some will not"—a studied ambiguity that reflects genuine legal uncertainty.

For enterprises deploying or developing AI systems, this uncertainty creates material risk. This article breaks down the litigation landscape, the emerging legal framework, and what enterprises must do to manage copyright exposure.

The Copyright Litigation Landscape

As of February 2026, the copyright-vs-AI litigation tracker includes over 50 active cases. Here are the most significant:

The New York Times v. OpenAI & Microsoft

The Times alleges that OpenAI's GPT models and Microsoft's Copilot were trained on millions of Times articles scraped from the web without permission. The complaint argues this constitutes:

Unauthorised copying – Reproducing Times articles to create training datasets
Derivative work creation – Using those articles to create AI model "weights" that encode Times content
Market substitution – Enabling users to get Times-quality information without subscribing to the Times

A federal judge denied OpenAI's motion to dismiss the copyright claims, allowing the case to proceed. This signals judicial scepticism towards blanket fair use defences.

Authors Guild & Individual Authors v. OpenAI, Meta, Anthropic

Multiple cases brought by bestselling authors—including John Grisham, George R.R. Martin, Jodi Picoult, Michael Chabon, and others—allege AI companies trained models on pirated ebook libraries without authorisation.

Key allegations:

AI companies knowingly used datasets like "Books3" compiled from piracy sites
AI models can reproduce passages from copyrighted books, demonstrating unlawful copying
AI-generated content competes directly with authors' works in the marketplace

Anthropic reached a $1.5 billion settlement with one author class action, though other cases remain active.

Visual Artists v. Stability AI, Midjourney, DeviantArt

Artists sued image generation platforms for training on billions of copyrighted images scraped from the web, including from portfolios on DeviantArt, ArtStation, and Behance.

The complaint argues AI systems can generate images "in the style of" specific artists, effectively creating derivative works that dilute the artists' markets.

Music Publishers v. Anthropic

Concord Music Group and other publishers sued Anthropic (maker of Claude AI) for training on copyrighted song lyrics. The case demonstrates that copyright risk extends beyond books and articles to all creative works.

Thomson Reuters v. ROSS Intelligence

In the first major AI copyright ruling, a federal court found that ROSS Intelligence's use of Thomson Reuters' Westlaw legal headnotes as training data was not fair use.

Key findings:

ROSS's use was commercial and non-transformative
ROSS obtained training data from pirated "Bulk Memos" rather than licensing
ROSS's AI tool competed directly with Thomson Reuters' Westlaw product
The use affected Thomson Reuters' potential market for AI training data

Whilst ROSS Intelligence involves non-generative AI, the decision's reasoning may influence how courts analyse generative AI cases.

The Fair Use Framework: Four Factors Under Scrutiny

U.S. copyright law's fair use doctrine (17 U.S.C. § 107) allows limited use of copyrighted works without permission under certain circumstances. Courts evaluate four statutory factors:

Factor 1: Purpose and Character of the Use

Transformativeness is the central inquiry: Does the AI training use the copyrighted work for a fundamentally different purpose than the original?

Arguments in favour of AI companies:

Training isn't "reading" books—it's extracting statistical patterns
Models don't store copies of works; they create mathematical representations
AI-generated content serves different purposes than training materials

Arguments against AI companies:

Copying entire works for commercial gain isn't transformative just because the copying method is sophisticated
If AI systems can reproduce or closely paraphrase training data, the use substitutes for the original
Market substitution defeats transformativeness

Copyright Office position: Training on "large, diverse datasets" will often be transformative, but transformativeness alone doesn't establish fair use. Commerciality weighs against fair use.

Factor 2: Nature of the Copyrighted Work

This factor considers whether the work is factual (favours fair use) or creative (disfavours fair use).

Most AI training datasets include highly creative works: fiction, poetry, visual art, music, photography. This factor typically weighs against AI companies in creative content cases.

Factor 3: Amount and Substantiality Used

AI training typically involves copying entire works—complete novels, full articles, entire image files—rather than excerpts.

AI companies argue: Training requires access to complete works to learn language patterns, but the works aren't reproduced in outputs.

Copyright holders counter: Wholesale copying is presumptively unfair. If AI systems can regenerate substantial portions of training data, the copying is both quantitatively and qualitatively excessive.

Emerging consensus: Courts may permit copying entire works where justified by technical necessity, provided the data was lawfully obtained and the use is transformative. However, sourcing from pirated datasets fatally undermines fair use.

Factor 4: Effect on the Potential Market

This factor examines both direct market harm (lost sales) and derivative market harm (impact on licensing opportunities).

Market substitution risk:

If users get AI-generated content instead of buying/subscribing to original works, that's market harm
If AI systems can produce content "in the style of" a specific creator, that dilutes the creator's market
Even if individual AI outputs don't infringe, the aggregate effect of AI-generated competition can harm markets

The training data licensing market:

Copyright holders argue that AI companies' unlicensed use destroys the emerging market for licensed training data. Companies like Adobe, Shutterstock, and Getty Images now offer AI training licences—proving a market exists.

The Copyright Office noted: "The speed and scale at which AI systems generate content pose a serious risk of diluting markets for works of the same kind as in their training data."

Thomson Reuters explicitly found that the existence of a potential AI training data market weighed against fair use, even though Thomson Reuters itself wasn't yet licensing data for AI when ROSS copied it.

The Critical Distinction: Lawful vs. Pirated Data Sources

One of the clearest principles emerging from litigation is that fair use requires lawful acquisition of training data.

In Bartz v. Anthropic, the court emphasised: "If there is a legal way to acquire copyrighted works, then the developer must pursue that route instead of sourcing from pirated libraries."

This creates a two-tier analysis:

Data obtained lawfully (via licensing, purchased books, publicly accessible websites with no robots.txt restrictions):
→ Fair use remains possible if the use is transformative and doesn't harm markets

Data obtained from pirated sources (ebook piracy sites, illegally shared datasets, circumventing paywalls):
→ Fair use defence severely weakened or eliminated

Multiple lawsuits allege AI companies knowingly used pirated datasets:

Books3 – A dataset of 196,000 pirated ebooks widely used to train language models
LibGen and Z-Library – Piracy sites from which training corpora were allegedly compiled
LAION-5B – A 5.85 billion image dataset that allegedly includes copyrighted photographs scraped without authorisation

If courts find AI companies knowingly trained on pirated data, fair use defences collapse. This is the single greatest copyright risk facing AI developers.

Enterprise Liability: When Are AI Users Exposed?

Most enterprises don't train their own foundation models—they use AI systems built by OpenAI, Anthropic, Google, or Microsoft. Does this insulate them from copyright liability?

Not necessarily.

Direct Infringement Risk

If your enterprise uses an AI system that generates output substantially similar to copyrighted works, you could face direct infringement claims for:

Reproduction – If AI outputs copy protected works
Distribution – If you publish AI-generated content that infringes
Derivative works – If AI outputs are unauthorised derivatives of copyrighted works

The fact that an AI system generated the infringing content doesn't shield you from liability. Courts have long held that infringement is a strict liability offence—intent and knowledge are irrelevant.

Contributory and Vicarious Infringement

Enterprises deploying AI systems could face secondary liability theories:

Contributory infringement – If you knowingly facilitate infringement by providing AI tools to others

Vicarious liability – If you profit from infringement and have the ability to control it

These theories haven't yet been tested extensively in AI contexts, but they represent plausible exposure for platforms and enterprises offering AI services.

Indemnification Gaps

Many AI service providers offer limited indemnification for copyright claims:

OpenAI's Copyright Shield: Covers certain enterprise customers for copyright claims related to ChatGPT outputs, provided customers used OpenAI's content filtering and safety systems.

Microsoft's Copilot Copyright Commitment: Indemnifies enterprise customers for IP claims arising from Copilot use, subject to numerous conditions and exclusions.

Google's Generative AI Indemnification: Covers certain Google Cloud AI products but excludes cases where customers modified models, used third-party data, or violated acceptable use policies.

Critically, these indemnifications typically exclude:

Customers who fine-tuned models on their own copyrighted datasets
Use of AI in ways that violate the provider's terms of service
Claims arising from how the underlying model was trained (these remain the provider's liability)

High-Risk Enterprise Scenarios

Scenario 1: AI-Generated Marketing Content

Your marketing team uses an AI image generator to create product photography. The AI outputs closely resemble a famous photographer's copyrighted work.

Risk: The photographer sues your company for copyright infringement. You argue the AI tool generated the image, but courts hold you—the publisher—liable for distributing infringing content.

Mitigation:

Run all AI-generated content through reverse image search to check for similarity to existing works
Avoid prompts like "in the style of [artist name]"
Maintain insurance coverage for IP claims
Document that you used filtering and review processes

Scenario 2: AI-Assisted Code Development

Developers use GitHub Copilot to accelerate coding. Copilot suggests code snippets that reproduce copyrighted code from open-source projects with incompatible licences.

Risk: Your software incorporates copyrighted code without proper licensing. The original developers sue for infringement and demand the software be withdrawn.

Mitigation:

Require developers to review all AI-generated code suggestions
Use code scanning tools to detect copied open-source code
Maintain strict policies on acceptable AI tool usage
Document compliance with open-source licence requirements

Scenario 3: Custom Model Training on Proprietary Data

Your enterprise fine-tunes a language model using industry reports, competitor materials, and third-party research papers to create a specialised business intelligence tool.

Risk: You've created derivative training datasets using copyrighted materials without permission. Rights holders sue for unauthorised copying and distribution.

Mitigation:

Conduct copyright clearance before incorporating third-party materials
Use only licensed or public domain training data
Maintain detailed provenance records for all training data
Implement data governance processes for custom model development

Enterprise Risk Mitigation Strategies

1. AI Vendor Due Diligence

Not all AI providers have equal copyright risk profiles. Evaluate vendors on:

Training data provenance – Does the vendor disclose data sources? Were sources lawfully obtained?
Active litigation – Is the vendor currently defending copyright lawsuits? What are the allegations?
Indemnification terms – What protection does the vendor offer? What are the exclusions?
Output filtering – Does the system include safeguards against reproducing training data?

Recommended approach:

Maintain a vendor risk registry tracking copyright litigation status
Require vendors to warrant that training data was lawfully obtained
Negotiate enhanced indemnification for enterprise deployments
Include right-to-audit clauses for training data practices

2. Output Verification Processes

Implement systematic review of AI-generated content before publication:

For text content:

Run outputs through plagiarism detection tools (Copyscape, Turnitin, Copyleaks)
Search for exact phrase matches to identify potential copying
Require human review of all external-facing AI content

For images:

Use reverse image search (Google Images, TinEye) to check for similar existing works
Avoid generating images of recognisable people, characters, or trademarked objects
Maintain records of generation prompts and parameters

For code:

Deploy code scanning tools that detect copied open-source snippets
Verify licence compatibility for any detected open-source code
Maintain software bill of materials (SBOM) documentation

3. Licensing and Permission Workflows

For custom model development or fine-tuning:

Licence commercial training data – Use datasets from providers like Adobe Stock, Shutterstock AI, or Getty Images that explicitly permit AI training
Obtain copyright clearance – For third-party materials, get written permission for AI training use
Use public domain content – Prioritise training data from works where copyright has expired or was never claimed
Respect robots.txt – Don't scrape websites that explicitly disallow automated access

4. Insurance and Legal Protections

IP insurance coverage – Ensure your E&O or IP policy covers AI-related infringement claims
Contractual allocation of risk – When commissioning AI-generated work from agencies or vendors, require indemnification for IP claims
Legal reserves – Budget for potential litigation defence costs

5. Governance and Documentation

Create audit trails demonstrating reasonable precautions:

AI usage policies – Document approved tools, prohibited uses, and review requirements
Training logs – For custom models, maintain records of data sources, licensing agreements, and permissions
Review documentation – Keep records showing that content was reviewed before publication
Incident response plans – Establish procedures for handling copyright claims related to AI outputs

The Evolving Legal Landscape

Judicial Developments to Watch

Over the next 12-24 months, multiple cases will proceed to summary judgement and trial:

New York Times v. OpenAI – May establish precedent on whether news article training is fair use
Authors Guild cases – Will determine whether training on pirated ebooks defeats fair use
Visual artists cases – Could clarify whether style mimicry constitutes infringement

Legal observers don't expect definitive Supreme Court guidance for several years. Until then, enterprises must navigate conflicting district court decisions.

Legislative Possibilities

Congress may intervene if courts fail to provide clarity. Potential approaches:

Compulsory licensing regime: AI companies pay into a collective licensing fund; rights holders receive proportional compensation based on usage.

Opt-out registries: Copyright holders can register works as off-limits for AI training; unauthorised use loses fair use protection.

Transparency mandates: AI companies must disclose training data sources; users can verify whether specific works were used.

Private right of action expansion: Proposed legislation would allow individual creators to sue for unauthorised use rather than relying on class actions.

International Divergence

Whilst the U.S. debate centres on fair use, other jurisdictions are adopting different frameworks:

European Union: The AI Act and Copyright Directive create transparency obligations but don't resolve the fundamental training question. EU courts may take a more restrictive view of text and data mining exceptions.

United Kingdom: Proposed a text and data mining exception for commercial AI training, then withdrew it following creator backlash. The issue remains unresolved.

Japan: Takes a permissive approach, generally allowing AI training without requiring permission for non-expressive uses.

Multinational enterprises must comply with the most restrictive regime in any jurisdiction where they operate.

Strategic Recommendations for Enterprises

Short Term (0-6 Months)

Conduct AI copyright risk assessment
- Inventory all AI tools used across the organisation
- Identify which vendors face active copyright litigation
- Map high-risk use cases (public-facing content, code generation, custom training)
Review vendor contracts
- Evaluate indemnification terms and exclusions
- Negotiate enhanced protections for enterprise deployments
- Require vendors to warrant lawful data acquisition
Implement output verification
- Deploy plagiarism detection for text
- Use reverse image search for visual content
- Scan AI-generated code for copied snippets

Medium Term (6-12 Months)

Develop AI governance framework
- Create formal policies on acceptable AI tool usage
- Establish approval workflows for custom model training
- Implement data provenance tracking for training datasets
Train legal and compliance teams
- Educate staff on AI copyright risks
- Establish escalation procedures for potential infringement
- Create incident response playbooks
Secure appropriate insurance
- Verify IP insurance covers AI-related claims
- Consider cyber insurance for data breach implications
- Budget for potential litigation defence costs

Long Term (12+ Months)

Build licensing relationships
- Establish agreements with content providers for training data
- Participate in collective licensing organisations if they emerge
- Contribute to industry standards for responsible AI development
Develop proprietary alternatives
- Create owned datasets from public domain or licensed content
- Train custom models on verified, legally obtained data
- Reduce dependence on third-party AI systems with uncertain provenance
Monitor legal developments
- Track ongoing litigation outcomes
- Adapt policies as judicial precedents emerge
- Engage with industry groups shaping legislation

Conclusion: Managing Uncertainty While Innovation Continues

The copyright-vs-AI legal battle represents one of the most significant intellectual property conflicts of the digital age. The outcome will determine whether AI development continues on its current trajectory or fundamentally restructures around licensed training data.

For enterprises, the uncertainty creates a challenging risk environment:

Fair use doctrine hasn't caught up with AI technology
Courts are reaching inconsistent conclusions
Legislative solutions remain years away
AI vendors face existential litigation

Yet enterprises cannot simply wait for legal clarity. AI adoption continues to accelerate, and competitive pressure demands engagement with these technologies.

The path forward requires:

Rigorous vendor diligence: Evaluate AI providers' copyright risk profiles. Prefer vendors using licensed training data or offering robust indemnification.

Systematic output verification: Implement processes to detect copied content before publication. Don't rely solely on AI systems' built-in filters.

Clear governance frameworks: Establish policies defining acceptable AI usage, approval requirements, and review procedures.

Proactive licensing: For custom models, prioritise lawfully obtained training data. The single clearest legal principle is that fair use requires lawful acquisition.

Continuous monitoring: Track litigation outcomes and adjust strategies as precedents emerge.

The enterprises that will successfully navigate this transition are those treating copyright risk as a core component of AI strategy—not an afterthought. In an environment where a single court decision can reshape the entire landscape, preparation and adaptability matter more than premature certainty.

The legal question may remain unsettled, but the business imperative is clear: enterprises must manage AI copyright risk deliberately, or risk catastrophic exposure when clarity finally arrives.