When The New York Times sued OpenAI and Microsoft for copyright infringement, alleging their AI models were trained on millions of Times articles without permission, it crystallised a legal risk that enterprises can no longer ignore: using AI systems may expose your organisation to copyright liability for how those systems were trained.
Over 50 active lawsuits target AI companies—OpenAI, Anthropic, Google, Meta, Microsoft, Nvidia, Perplexity, and others—for allegedly using copyrighted books, articles, photographs, source code, and artwork as training data without authorisation or compensation. Plaintiffs include major publishers, bestselling authors, visual artists, photographers, musicians, and software developers.
The legal question at the centre of these cases is deceptively simple: Is training AI models on copyrighted works "fair use," or does it require explicit permission and licensing?
The answer remains unsettled. Three judges have ruled on fair use in AI training cases, reaching different conclusions. The U.S. Copyright Office released guidance stating "some uses of copyrighted works for generative AI training will qualify as fair use, and some will not"—a studied ambiguity that reflects genuine legal uncertainty.
For enterprises deploying or developing AI systems, this uncertainty creates material risk. This article breaks down the litigation landscape, the emerging legal framework, and what enterprises must do to manage copyright exposure.
The Copyright Litigation Landscape
As of February 2026, the copyright-vs-AI litigation tracker includes over 50 active cases. Here are the most significant:
The New York Times v. OpenAI & Microsoft
The Times alleges that OpenAI's GPT models and Microsoft's Copilot were trained on millions of Times articles scraped from the web without permission. The complaint argues this constitutes:
- Unauthorised copying – Reproducing Times articles to create training datasets
- Derivative work creation – Using those articles to create AI model "weights" that encode Times content
- Market substitution – Enabling users to get Times-quality information without subscribing to the Times
A federal judge denied OpenAI's motion to dismiss the copyright claims, allowing the case to proceed. This signals judicial scepticism towards blanket fair use defences.
Authors Guild & Individual Authors v. OpenAI, Meta, Anthropic
Multiple cases brought by bestselling authors—including John Grisham, George R.R. Martin, Jodi Picoult, Michael Chabon, and others—allege AI companies trained models on pirated ebook libraries without authorisation.
Key allegations:
- AI companies knowingly used datasets like "Books3" compiled from piracy sites
- AI models can reproduce passages from copyrighted books, demonstrating unlawful copying
- AI-generated content competes directly with authors' works in the marketplace
Anthropic reached a $1.5 billion settlement with one author class action, though other cases remain active.
Visual Artists v. Stability AI, Midjourney, DeviantArt
Artists sued image generation platforms for training on billions of copyrighted images scraped from the web, including from portfolios on DeviantArt, ArtStation, and Behance.
The complaint argues AI systems can generate images "in the style of" specific artists, effectively creating derivative works that dilute the artists' markets.
Music Publishers v. Anthropic
Concord Music Group and other publishers sued Anthropic (maker of Claude AI) for training on copyrighted song lyrics. The case demonstrates that copyright risk extends beyond books and articles to all creative works.
Thomson Reuters v. ROSS Intelligence
In the first major AI copyright ruling, a federal court found that ROSS Intelligence's use of Thomson Reuters' Westlaw legal headnotes as training data was not fair use.
Key findings:
- ROSS's use was commercial and non-transformative
- ROSS obtained training data from pirated "Bulk Memos" rather than licensing
- ROSS's AI tool competed directly with Thomson Reuters' Westlaw product
- The use affected Thomson Reuters' potential market for AI training data
Whilst ROSS Intelligence involves non-generative AI, the decision's reasoning may influence how courts analyse generative AI cases.
The Fair Use Framework: Four Factors Under Scrutiny
U.S. copyright law's fair use doctrine (17 U.S.C. § 107) allows limited use of copyrighted works without permission under certain circumstances. Courts evaluate four statutory factors:
Factor 1: Purpose and Character of the Use
Transformativeness is the central inquiry: Does the AI training use the copyrighted work for a fundamentally different purpose than the original?
Arguments in favour of AI companies:
- Training isn't "reading" books—it's extracting statistical patterns
- Models don't store copies of works; they create mathematical representations
- AI-generated content serves different purposes than training materials
Arguments against AI companies:
- Copying entire works for commercial gain isn't transformative just because the copying method is sophisticated
- If AI systems can reproduce or closely paraphrase training data, the use substitutes for the original
- Market substitution defeats transformativeness
Copyright Office position: Training on "large, diverse datasets" will often be transformative, but transformativeness alone doesn't establish fair use. Commerciality weighs against fair use.
Factor 2: Nature of the Copyrighted Work
This factor considers whether the work is factual (favours fair use) or creative (disfavours fair use).
Most AI training datasets include highly creative works: fiction, poetry, visual art, music, photography. This factor typically weighs against AI companies in creative content cases.
Factor 3: Amount and Substantiality Used
AI training typically involves copying entire works—complete novels, full articles, entire image files—rather than excerpts.
AI companies argue: Training requires access to complete works to learn language patterns, but the works aren't reproduced in outputs.
Copyright holders counter: Wholesale copying is presumptively unfair. If AI systems can regenerate substantial portions of training data, the copying is both quantitatively and qualitatively excessive.
Emerging consensus: Courts may permit copying entire works where justified by technical necessity, provided the data was lawfully obtained and the use is transformative. However, sourcing from pirated datasets fatally undermines fair use.
Factor 4: Effect on the Potential Market
This factor examines both direct market harm (lost sales) and derivative market harm (impact on licensing opportunities).
Market substitution risk:
- If users get AI-generated content instead of buying/subscribing to original works, that's market harm
- If AI systems can produce content "in the style of" a specific creator, that dilutes the creator's market
- Even if individual AI outputs don't infringe, the aggregate effect of AI-generated competition can harm markets
The training data licensing market:
Copyright holders argue that AI companies' unlicensed use destroys the emerging market for licensed training data. Companies like Adobe, Shutterstock, and Getty Images now offer AI training licences—proving a market exists.
The Copyright Office noted: "The speed and scale at which AI systems generate content pose a serious risk of diluting markets for works of the same kind as in their training data."
Thomson Reuters explicitly found that the existence of a potential AI training data market weighed against fair use, even though Thomson Reuters itself wasn't yet licensing data for AI when ROSS copied it.
The Critical Distinction: Lawful vs. Pirated Data Sources
One of the clearest principles emerging from litigation is that fair use requires lawful acquisition of training data.
In Bartz v. Anthropic, the court emphasised: "If there is a legal way to acquire copyrighted works, then the developer must pursue that route instead of sourcing from pirated libraries."
This creates a two-tier analysis:
Data obtained lawfully (via licensing, purchased books, publicly accessible websites with no robots.txt restrictions):
→ Fair use remains possible if the use is transformative and doesn't harm markets
Data obtained from pirated sources (ebook piracy sites, illegally shared datasets, circumventing paywalls):
→ Fair use defence severely weakened or eliminated
Multiple lawsuits allege AI companies knowingly used pirated datasets:
- Books3 – A dataset of 196,000 pirated ebooks widely used to train language models
- LibGen and Z-Library – Piracy sites from which training corpora were allegedly compiled
- LAION-5B – A 5.85 billion image dataset that allegedly includes copyrighted photographs scraped without authorisation
If courts find AI companies knowingly trained on pirated data, fair use defences collapse. This is the single greatest copyright risk facing AI developers.
Enterprise Liability: When Are AI Users Exposed?
Most enterprises don't train their own foundation models—they use AI systems built by OpenAI, Anthropic, Google, or Microsoft. Does this insulate them from copyright liability?
Not necessarily.
Direct Infringement Risk
If your enterprise uses an AI system that generates output substantially similar to copyrighted works, you could face direct infringement claims for:
- Reproduction – If AI outputs copy protected works
- Distribution – If you publish AI-generated content that infringes
- Derivative works – If AI outputs are unauthorised derivatives of copyrighted works
The fact that an AI system generated the infringing content doesn't shield you from liability. Courts have long held that infringement is a strict liability offence—intent and knowledge are irrelevant.
Contributory and Vicarious Infringement
Enterprises deploying AI systems could face secondary liability theories:
Contributory infringement – If you knowingly facilitate infringement by providing AI tools to others
Vicarious liability – If you profit from infringement and have the ability to control it
These theories haven't yet been tested extensively in AI contexts, but they represent plausible exposure for platforms and enterprises offering AI services.
Indemnification Gaps
Many AI service providers offer limited indemnification for copyright claims:
OpenAI's Copyright Shield: Covers certain enterprise customers for copyright claims related to ChatGPT outputs, provided customers used OpenAI's content filtering and safety systems.
Microsoft's Copilot Copyright Commitment: Indemnifies enterprise customers for IP claims arising from Copilot use, subject to numerous conditions and exclusions.
Google's Generative AI Indemnification: Covers certain Google Cloud AI products but excludes cases where customers modified models, used third-party data, or violated acceptable use policies.
Critically, these indemnifications typically exclude:
- Customers who fine-tuned models on their own copyrighted datasets
- Use of AI in ways that violate the provider's terms of service
- Claims arising from how the underlying model was trained (these remain the provider's liability)
High-Risk Enterprise Scenarios
Scenario 1: AI-Generated Marketing Content
Your marketing team uses an AI image generator to create product photography. The AI outputs closely resemble a famous photographer's copyrighted work.
Risk: The photographer sues your company for copyright infringement. You argue the AI tool generated the image, but courts hold you—the publisher—liable for distributing infringing content.
Mitigation:
- Run all AI-generated content through reverse image search to check for similarity to existing works
- Avoid prompts like "in the style of [artist name]"
- Maintain insurance coverage for IP claims
- Document that you used filtering and review processes
Scenario 2: AI-Assisted Code Development
Developers use GitHub Copilot to accelerate coding. Copilot suggests code snippets that reproduce copyrighted code from open-source projects with incompatible licences.
Risk: Your software incorporates copyrighted code without proper licensing. The original developers sue for infringement and demand the software be withdrawn.
Mitigation:
- Require developers to review all AI-generated code suggestions
- Use code scanning tools to detect copied open-source code
- Maintain strict policies on acceptable AI tool usage
- Document compliance with open-source licence requirements
Scenario 3: Custom Model Training on Proprietary Data
Your enterprise fine-tunes a language model using industry reports, competitor materials, and third-party research papers to create a specialised business intelligence tool.
Risk: You've created derivative training datasets using copyrighted materials without permission. Rights holders sue for unauthorised copying and distribution.
Mitigation:
- Conduct copyright clearance before incorporating third-party materials
- Use only licensed or public domain training data
- Maintain detailed provenance records for all training data
- Implement data governance processes for custom model development
Enterprise Risk Mitigation Strategies
1. AI Vendor Due Diligence
Not all AI providers have equal copyright risk profiles. Evaluate vendors on:
- Training data provenance – Does the vendor disclose data sources? Were sources lawfully obtained?
- Active litigation – Is the vendor currently defending copyright lawsuits? What are the allegations?
- Indemnification terms – What protection does the vendor offer? What are the exclusions?
- Output filtering – Does the system include safeguards against reproducing training data?
Recommended approach:
- Maintain a vendor risk registry tracking copyright litigation status
- Require vendors to warrant that training data was lawfully obtained
- Negotiate enhanced indemnification for enterprise deployments
- Include right-to-audit clauses for training data practices
2. Output Verification Processes
Implement systematic review of AI-generated content before publication:
For text content:
- Run outputs through plagiarism detection tools (Copyscape, Turnitin, Copyleaks)
- Search for exact phrase matches to identify potential copying
- Require human review of all external-facing AI content
For images:
- Use reverse image search (Google Images, TinEye) to check for similar existing works
- Avoid generating images of recognisable people, characters, or trademarked objects
- Maintain records of generation prompts and parameters
For code:
- Deploy code scanning tools that detect copied open-source snippets
- Verify licence compatibility for any detected open-source code
- Maintain software bill of materials (SBOM) documentation
3. Licensing and Permission Workflows
For custom model development or fine-tuning:
- Licence commercial training data – Use datasets from providers like Adobe Stock, Shutterstock AI, or Getty Images that explicitly permit AI training
- Obtain copyright clearance – For third-party materials, get written permission for AI training use
- Use public domain content – Prioritise training data from works where copyright has expired or was never claimed
- Respect robots.txt – Don't scrape websites that explicitly disallow automated access
4. Insurance and Legal Protections
- IP insurance coverage – Ensure your E&O or IP policy covers AI-related infringement claims
- Contractual allocation of risk – When commissioning AI-generated work from agencies or vendors, require indemnification for IP claims
- Legal reserves – Budget for potential litigation defence costs
5. Governance and Documentation
Create audit trails demonstrating reasonable precautions:
- AI usage policies – Document approved tools, prohibited uses, and review requirements
- Training logs – For custom models, maintain records of data sources, licensing agreements, and permissions
- Review documentation – Keep records showing that content was reviewed before publication
- Incident response plans – Establish procedures for handling copyright claims related to AI outputs
The Evolving Legal Landscape
Judicial Developments to Watch
Over the next 12-24 months, multiple cases will proceed to summary judgement and trial:
- New York Times v. OpenAI – May establish precedent on whether news article training is fair use
- Authors Guild cases – Will determine whether training on pirated ebooks defeats fair use
- Visual artists cases – Could clarify whether style mimicry constitutes infringement
Legal observers don't expect definitive Supreme Court guidance for several years. Until then, enterprises must navigate conflicting district court decisions.
Legislative Possibilities
Congress may intervene if courts fail to provide clarity. Potential approaches:
Compulsory licensing regime: AI companies pay into a collective licensing fund; rights holders receive proportional compensation based on usage.
Opt-out registries: Copyright holders can register works as off-limits for AI training; unauthorised use loses fair use protection.
Transparency mandates: AI companies must disclose training data sources; users can verify whether specific works were used.
Private right of action expansion: Proposed legislation would allow individual creators to sue for unauthorised use rather than relying on class actions.
International Divergence
Whilst the U.S. debate centres on fair use, other jurisdictions are adopting different frameworks:
European Union: The AI Act and Copyright Directive create transparency obligations but don't resolve the fundamental training question. EU courts may take a more restrictive view of text and data mining exceptions.
United Kingdom: Proposed a text and data mining exception for commercial AI training, then withdrew it following creator backlash. The issue remains unresolved.
Japan: Takes a permissive approach, generally allowing AI training without requiring permission for non-expressive uses.
Multinational enterprises must comply with the most restrictive regime in any jurisdiction where they operate.
Strategic Recommendations for Enterprises
Short Term (0-6 Months)
- Conduct AI copyright risk assessment
- Inventory all AI tools used across the organisation
- Identify which vendors face active copyright litigation
- Map high-risk use cases (public-facing content, code generation, custom training)
- Review vendor contracts
- Evaluate indemnification terms and exclusions
- Negotiate enhanced protections for enterprise deployments
- Require vendors to warrant lawful data acquisition
- Implement output verification
- Deploy plagiarism detection for text
- Use reverse image search for visual content
- Scan AI-generated code for copied snippets
Medium Term (6-12 Months)
- Develop AI governance framework
- Create formal policies on acceptable AI tool usage
- Establish approval workflows for custom model training
- Implement data provenance tracking for training datasets
- Train legal and compliance teams
- Educate staff on AI copyright risks
- Establish escalation procedures for potential infringement
- Create incident response playbooks
- Secure appropriate insurance
- Verify IP insurance covers AI-related claims
- Consider cyber insurance for data breach implications
- Budget for potential litigation defence costs
Long Term (12+ Months)
- Build licensing relationships
- Establish agreements with content providers for training data
- Participate in collective licensing organisations if they emerge
- Contribute to industry standards for responsible AI development
- Develop proprietary alternatives
- Create owned datasets from public domain or licensed content
- Train custom models on verified, legally obtained data
- Reduce dependence on third-party AI systems with uncertain provenance
- Monitor legal developments
- Track ongoing litigation outcomes
- Adapt policies as judicial precedents emerge
- Engage with industry groups shaping legislation
Conclusion: Managing Uncertainty While Innovation Continues
The copyright-vs-AI legal battle represents one of the most significant intellectual property conflicts of the digital age. The outcome will determine whether AI development continues on its current trajectory or fundamentally restructures around licensed training data.
For enterprises, the uncertainty creates a challenging risk environment:
- Fair use doctrine hasn't caught up with AI technology
- Courts are reaching inconsistent conclusions
- Legislative solutions remain years away
- AI vendors face existential litigation
Yet enterprises cannot simply wait for legal clarity. AI adoption continues to accelerate, and competitive pressure demands engagement with these technologies.
The path forward requires:
Rigorous vendor diligence: Evaluate AI providers' copyright risk profiles. Prefer vendors using licensed training data or offering robust indemnification.
Systematic output verification: Implement processes to detect copied content before publication. Don't rely solely on AI systems' built-in filters.
Clear governance frameworks: Establish policies defining acceptable AI usage, approval requirements, and review procedures.
Proactive licensing: For custom models, prioritise lawfully obtained training data. The single clearest legal principle is that fair use requires lawful acquisition.
Continuous monitoring: Track litigation outcomes and adjust strategies as precedents emerge.
The enterprises that will successfully navigate this transition are those treating copyright risk as a core component of AI strategy—not an afterthought. In an environment where a single court decision can reshape the entire landscape, preparation and adaptability matter more than premature certainty.
The legal question may remain unsettled, but the business imperative is clear: enterprises must manage AI copyright risk deliberately, or risk catastrophic exposure when clarity finally arrives.